How many bins should be put in a regular histogram

Lucien Birgé; Yves Rozenholc

ESAIM: Probability and Statistics (2006)

  • Volume: 10, page 24-45
  • ISSN: 1292-8100

Abstract

top
Given an n-sample from some unknown density f on [0,1], it is easy to construct an histogram of the data based on some given partition of [0,1], but not so much is known about an optimal choice of the partition, especially when the data set is not large, even if one restricts to partitions into intervals of equal length. Existing methods are either rules of thumbs or based on asymptotic considerations and often involve some smoothness properties of f. Our purpose in this paper is to give an automatic, easy to program and efficient method to choose the number of bins of the partition from the data. It is based on bounds on the risk of penalized maximum likelihood estimators due to Castellan and heavy simulations which allowed us to optimize the form of the penalty function. These simulations show that the method works quite well for sample sizes as small as 25.

How to cite

top

Birgé, Lucien, and Rozenholc, Yves. "How many bins should be put in a regular histogram." ESAIM: Probability and Statistics 10 (2006): 24-45. <http://eudml.org/doc/249764>.

@article{Birgé2006,
abstract = { Given an n-sample from some unknown density f on [0,1], it is easy to construct an histogram of the data based on some given partition of [0,1], but not so much is known about an optimal choice of the partition, especially when the data set is not large, even if one restricts to partitions into intervals of equal length. Existing methods are either rules of thumbs or based on asymptotic considerations and often involve some smoothness properties of f. Our purpose in this paper is to give an automatic, easy to program and efficient method to choose the number of bins of the partition from the data. It is based on bounds on the risk of penalized maximum likelihood estimators due to Castellan and heavy simulations which allowed us to optimize the form of the penalty function. These simulations show that the method works quite well for sample sizes as small as 25. },
author = {Birgé, Lucien, Rozenholc, Yves},
journal = {ESAIM: Probability and Statistics},
keywords = {Regular histograms; density estimation; penalized maximum likelihood; model selection. ; penalized maximum likelihood; model selection},
language = {eng},
month = {1},
pages = {24-45},
publisher = {EDP Sciences},
title = {How many bins should be put in a regular histogram},
url = {http://eudml.org/doc/249764},
volume = {10},
year = {2006},
}

TY - JOUR
AU - Birgé, Lucien
AU - Rozenholc, Yves
TI - How many bins should be put in a regular histogram
JO - ESAIM: Probability and Statistics
DA - 2006/1//
PB - EDP Sciences
VL - 10
SP - 24
EP - 45
AB - Given an n-sample from some unknown density f on [0,1], it is easy to construct an histogram of the data based on some given partition of [0,1], but not so much is known about an optimal choice of the partition, especially when the data set is not large, even if one restricts to partitions into intervals of equal length. Existing methods are either rules of thumbs or based on asymptotic considerations and often involve some smoothness properties of f. Our purpose in this paper is to give an automatic, easy to program and efficient method to choose the number of bins of the partition from the data. It is based on bounds on the risk of penalized maximum likelihood estimators due to Castellan and heavy simulations which allowed us to optimize the form of the penalty function. These simulations show that the method works quite well for sample sizes as small as 25.
LA - eng
KW - Regular histograms; density estimation; penalized maximum likelihood; model selection. ; penalized maximum likelihood; model selection
UR - http://eudml.org/doc/249764
ER -

References

top
  1. H. Akaike, A new look at the statistical model identification. IEEE Trans. Automatic Control19 (1974) 716–723.  
  2. A.R. Barron, L. Birgé and P. Massart. Risk bounds for model selection via penalization. Probab. Theory Relat. Fields113 (1999) 301–415.  
  3. L. Birgé and P. Massart, From model selection to adaptive estimation, in Festschrift for Lucien Le Cam: Research Papers in Probability and Statistics, D. Pollard, E. Torgersen and G. Yang, Eds., Springer-Verlag, New York (1997) 55–87.  
  4. L. Birgé and P. Massart, Gaussian model selection. J. Eur. Math. Soc.3 (2001) 203–268.  
  5. G. Castellan, Modified Akaike's criterion for histogram density estimation. Technical Report. Université Paris-Sud, Orsay (1999).  
  6. G. Castellan, Sélection d'histogrammes à l'aide d'un critère de type Akaike. CRAS330 (2000) 729–732.  
  7. J. Daly, The construction of optimal histograms. Commun. Stat., Theory Methods17 (1988) 2921–2931.  
  8. L. Devroye, A Course in Density Estimation. Birkhäuser, Boston (1987).  
  9. L. Devroye, and L. Györfi, Nonparametric Density Estimation: The L1 View. John Wiley, New York (1985).  
  10. L. Devroye and G. Lugosi, Combinatorial Methods in Density Estimation. Springer-Verlag, New York (2001).  
  11. D. Freedman and P. Diaconis, On the histogram as a density estimator: L2 theory. Z. Wahrscheinlichkeitstheor. Verw. Geb.57 (1981) 453–476.  
  12. P. Hall, Akaike's information criterion and Kullback-Leibler loss for histogram density estimation. Probab. Theory Relat. Fields85 (1990) 449–467.  
  13. P. Hall and E.J. Hannan, On stochastic complexity and nonparametric density estimation. Biometrika75 (1988) 705–714.  
  14. K. He and G. Meeden, Selecting the number of bins in a histogram: A decision theoretic approach. J. Stat. Plann. Inference61 (1997) 49–59.  
  15. D.R.M. Herrick, G.P. Nason and B.W. Silverman, Some new methods for wavelet density estimation. Sankhya, Series A63 (2001) 394–411.  
  16. M.C. Jones, On two recent papers of Y. Kanazawa. Statist. Probab. Lett.24 (1995) 269–271.  
  17. Y. Kanazawa, Hellinger distance and Akaike's information criterion for the histogram. Statist. Probab. Lett.17 (1993) 293–298.  
  18. L.M. Le Cam, Asymptotic Methods in Statistical Decision Theory. Springer-Verlag, New York (1986).  
  19. L.M. Le Cam and G.L. Yang, Asymptotics in Statistics: Some Basic Concepts. Second Edition. Springer-Verlag, New York (2000).  
  20. J. Rissanen, Stochastic complexity and the MDL principle. Econ. Rev.6 (1987) 85–102.  
  21. M. Rudemo, Empirical choice of histograms and kernel density estimators. Scand. J. Statist.9 (1982) 65–78.  
  22. D.W. Scott, On optimal and databased histograms. Biometrika66 (1979) 605–610.  
  23. H.A. Sturges, The choice of a class interval. J. Am. Stat. Assoc.21 (1926) 65–66.  
  24. C.C. Taylor, Akaike's information criterion and the histogram. Biometrika.74 (1987) 636–639.  
  25. G.R. Terrell, The maximal smoothing principle in density estimation. J. Am. Stat. Assoc.85 (1990) 470–477.  
  26. M.P. Wand, Data-based choice of histogram bin width. Am. Statistician51 (1997) 59–64.  

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.