An adaptive method of estimation and outlier detection in regression applicable for small to moderate sample sizes

Brenton R. Clarke

Discussiones Mathematicae Probability and Statistics (2000)

  • Volume: 20, Issue: 1, page 25-50
  • ISSN: 1509-9423

Abstract

top
In small to moderate sample sizes it is important to make use of all the data when there are no outliers, for reasons of efficiency. It is equally important to guard against the possibility that there may be single or multiple outliers which can have disastrous effects on normal theory least squares estimation and inference. The purpose of this paper is to describe and illustrate the use of an adaptive regression estimation algorithm which can be used to highlight outliers, either single or multiple of varying number. The outliers can include 'bad' leverage points. Illustration is given of how 'good' leverage points are retained and 'bad' leverage points discarded. The adaptive regression estimator generalizes its high breakdown point adaptive location estimator counterpart and thus is expected to have high efficiency at the normal model. Simulations confirm this. On the other hand, examples demonstrate that the regression algorithm given highlights outliers and 'potential' outliers for closer scrutiny. The algorithm is computer intensive for the reason that it is a global algorithm which is designed to highlight outliers automatically. This also obviates the problem of searching out 'local minima' encountered by some algorithms designed as fast search methods. Instead the objective here is to assess all observations and subsets of observations with the intention of culling all outliers which can range up to as much as approximately half the data. It is assumed that the distributional form of the data less outliers is approximately normal. If this distributional assumption fails, plots can be used to indicate such failure, and, transformations may be ;required before potential outliers are deemed as outliers. A well known set of data illustrates this point.

How to cite

top

Brenton R. Clarke. "An adaptive method of estimation and outlier detection in regression applicable for small to moderate sample sizes." Discussiones Mathematicae Probability and Statistics 20.1 (2000): 25-50. <http://eudml.org/doc/287713>.

@article{BrentonR2000,
abstract = {In small to moderate sample sizes it is important to make use of all the data when there are no outliers, for reasons of efficiency. It is equally important to guard against the possibility that there may be single or multiple outliers which can have disastrous effects on normal theory least squares estimation and inference. The purpose of this paper is to describe and illustrate the use of an adaptive regression estimation algorithm which can be used to highlight outliers, either single or multiple of varying number. The outliers can include 'bad' leverage points. Illustration is given of how 'good' leverage points are retained and 'bad' leverage points discarded. The adaptive regression estimator generalizes its high breakdown point adaptive location estimator counterpart and thus is expected to have high efficiency at the normal model. Simulations confirm this. On the other hand, examples demonstrate that the regression algorithm given highlights outliers and 'potential' outliers for closer scrutiny. The algorithm is computer intensive for the reason that it is a global algorithm which is designed to highlight outliers automatically. This also obviates the problem of searching out 'local minima' encountered by some algorithms designed as fast search methods. Instead the objective here is to assess all observations and subsets of observations with the intention of culling all outliers which can range up to as much as approximately half the data. It is assumed that the distributional form of the data less outliers is approximately normal. If this distributional assumption fails, plots can be used to indicate such failure, and, transformations may be ;required before potential outliers are deemed as outliers. A well known set of data illustrates this point.},
author = {Brenton R. Clarke},
journal = {Discussiones Mathematicae Probability and Statistics},
keywords = {outlier; least median of squares regression; least trimmed squares; trimmed likelihood; adaptive estimation; leverage; outliers; tables},
language = {eng},
number = {1},
pages = {25-50},
title = {An adaptive method of estimation and outlier detection in regression applicable for small to moderate sample sizes},
url = {http://eudml.org/doc/287713},
volume = {20},
year = {2000},
}

TY - JOUR
AU - Brenton R. Clarke
TI - An adaptive method of estimation and outlier detection in regression applicable for small to moderate sample sizes
JO - Discussiones Mathematicae Probability and Statistics
PY - 2000
VL - 20
IS - 1
SP - 25
EP - 50
AB - In small to moderate sample sizes it is important to make use of all the data when there are no outliers, for reasons of efficiency. It is equally important to guard against the possibility that there may be single or multiple outliers which can have disastrous effects on normal theory least squares estimation and inference. The purpose of this paper is to describe and illustrate the use of an adaptive regression estimation algorithm which can be used to highlight outliers, either single or multiple of varying number. The outliers can include 'bad' leverage points. Illustration is given of how 'good' leverage points are retained and 'bad' leverage points discarded. The adaptive regression estimator generalizes its high breakdown point adaptive location estimator counterpart and thus is expected to have high efficiency at the normal model. Simulations confirm this. On the other hand, examples demonstrate that the regression algorithm given highlights outliers and 'potential' outliers for closer scrutiny. The algorithm is computer intensive for the reason that it is a global algorithm which is designed to highlight outliers automatically. This also obviates the problem of searching out 'local minima' encountered by some algorithms designed as fast search methods. Instead the objective here is to assess all observations and subsets of observations with the intention of culling all outliers which can range up to as much as approximately half the data. It is assumed that the distributional form of the data less outliers is approximately normal. If this distributional assumption fails, plots can be used to indicate such failure, and, transformations may be ;required before potential outliers are deemed as outliers. A well known set of data illustrates this point.
LA - eng
KW - outlier; least median of squares regression; least trimmed squares; trimmed likelihood; adaptive estimation; leverage; outliers; tables
UR - http://eudml.org/doc/287713
ER -

References

top
  1. [1] A.C. Atkinson, Two graphical displays for outlying and influential observations in regression, Biometrika 68 (1981), 13-20. Zbl0462.62049
  2. [2] A.C. Atkinson, Masking unmasked, Biometrika 73 (1986a), 533-41. Zbl0614.62092
  3. [3] A.C. Atkinson, Comment : Aspects of diagnostic regression analysis, Statistical Science 1 (1986b), 397-401. 
  4. [4] A.C. Atkinson, Fast very robust methods for the detection of multiple outliers, Journal of the American Statistical Association 89 (1994), 1329-1339. Zbl0825.62429
  5. [5] V. Barnett and T. Lewis, Outliers in Statistical Data, 3rd ed., New York, Wiley, 1994. Zbl0801.62001
  6. [6] T. Bednarski and B.R. Clarke, Trimmed likelihood estimation of location and scale of the normal distribution, Australian Journal of Statistics 35 (1993), 141-153. Zbl0798.62043
  7. [7] M.D. Brown, J. Durbin and J.M. Evans, Techniques for testing the constancy of regression relationships over time, Journal of the Royal Statistical Society, Series B 37 (1975), 149-192. Zbl0321.62063
  8. [8] K.A. Brownlee, Statistical Theory and Methodology in Science and Engineering, 2nd ed., New York, Wiley, 1965. Zbl0136.39203
  9. [9] R.W. Butler, Nonparametric interval and point prediction using data trimmed by a Grubbs-type outlier rule, Annals of Statistics 10 (1982), 197-204. Zbl0487.62040
  10. [10] R.L. Chambers and C.R. Heathcote, On the estimation of slope and the identification of outliers in linear regression, Biometrika 68 (1981), 21-33. Zbl0463.62061
  11. [11] B.R. Clarke, Empirical evidence for adaptive confidence intervals and identification of outliers using methods of trimming, Australian Journal of Statistics 36 (1994), 45-58. Zbl0825.62418
  12. [12] R.D. Cook and S. Weisberg, Residuals and Influence in Regression, New York and London, Chapman and Hall 1982. Zbl0564.62054
  13. [13] P.L. Davies, The asymptotics of S-estimators in the Linear Regression Model, Annals of Statistics 18 (1990), 1651-1675. Zbl0719.62042
  14. [14] P.L. Davies and U. Gather, The identification of multiple outliers (with discussion), Journal of the American Statistical Association 88 (1993), 782-801. Zbl0797.62025
  15. [15] N.R. Draper and H. Smith, Applied Regression Analysis, New York, Wiley, 1966. Zbl0158.17101
  16. [16] W. Fung, Unmasking outliers and leverage points : A confirmation, Journal of the American Statistical Association 88 (1993), 515-519. 
  17. [17] A.S. Hadi and J.S. Simonoff, Procedures for the identification of multiple outliers in linear models, Journal of the American Statistical Association 88 (1993), 1264-1272. 
  18. [18] F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw and W.J. Stahel, Robust Statistics, the Approach Based on Influence Functions, New York, Wiley, 1986. Zbl0593.62027
  19. [19] D.M. Hawkins, D. Bradu and G.V. Kass, Location of several outliers in multiple-regression data using elemental sets, Technometrics 26 (1984), 197-208. 
  20. [20] T.P. Hettmansperger and S.J. Sheather, A cautionary note on the method of least median squares, American Statistician 46 (1992), 79-83. 
  21. [21] L.A. Jaeckel, Some flexible estimates of location, Annals of Mathematical Statistics 42 (1971), 1540-1552. Zbl0232.62008
  22. [22] F. Kianifard and W.H. Swallow, Using recursive residuals, calculated on adaptively-ordered observations, to identify outliers in linear regression, Biometrics 45 (1989), 571-585. Zbl0715.62144
  23. [23] F. Kianifard and W.H. Swallow, A Monte Carlo comparison of five procedures for identifying outliers in linear regression, Communications in Statistics, Part A-Theory and Methods 19 (1990), 1913-1938. 
  24. [24] M.G Marasinghe, A multistage procedure for detecting several outliers in linear regression, Technometrics 27 (1985), 395-399. 
  25. [25] P.J. Rousseeuw, Least median of squares regression, Journal of the American Statistical Association 79 (1984), 871-880. Zbl0547.62046
  26. [26] P.J. Rousseeuw and A.M. Leroy, Robust Regression and Outlier Detection, New York, Wiley, 1987. Zbl0711.62030
  27. [27] P.J. Rousseeuw and B.C. van Zomeren, Unmasking multivariate outliers and leverage points, Journal of the American Statistical Association 85 (1990), 633-651. 
  28. [28] P.J. Rousseeuw and V.J. Yohai, Robust regression by means of S-estimators, Robust and Nonlinear Time Series Analysis, eds., J. Franke, W. Härdle and R.D. Martin, (Lecture Notes in Statistics), New York, Springer-Verlag, (1984), 256-272. Zbl0567.62027
  29. [29] D. Ruppert, Computing S-estimators for regression and multivariate location/dispersion, Journal of Computational and Graphical Statistics 1 (1992), 253-270. 
  30. [30] T.P. Ryan, Comment on Hadi and Simonoff, Letters to the Editor, Journal of the American Statistical Association 90 (1995), 811. 
  31. [31] G. Simpson, D. Ruppert and R.J. Carroll, On one-step GM estimates and stability of inferences in linear regression, Journal of the American Statistical Association 87 (1992), 439-450. Zbl0781.62104
  32. [32] W.H. Swallow and F. Kianifard, Using robust scale estimates in detecting multiple outliers in linear regression, Biometrics 52 (1996), 545-556. Zbl0875.62283
  33. [33] J.W Tukey and D.H. McLaughlin, Less vulnerable confidence and significance procedures for location based on a single sample: Trimming/Winsorization 1, Sankhya 25 (A) (1963), 331-352. Zbl0116.10904
  34. [34] W.N. Venables and B.D. Ripley, Modern Applied Statistics with S-Plus, New York, Springer-Verlag, 1994. Zbl0806.62002
  35. [35] D.L. Woodruff and D.M. Rocke, Computable robust estimation of multivariate location and shape in high dimension using compound estimators, Journal of the American Statistical Association 89 (1994), 888-896. Zbl0825.62485
  36. [36] V.J. Yohai, High breakdown point and high-efficiency robust estimates for regression, Annals of Statistics 15 (1987), 642-656. Zbl0624.62037
  37. [37] V.J. Yohai and R.H. Zamar, High breakdown-point estimates of regression by means of the minimization of an efficient scale, Journal of the American Statistical Association 83 (1988), 406-413. Zbl0648.62036

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.