A graph-based estimator of the number of clusters

Gérard Biau; Benoît Cadre; Bruno Pelletier

ESAIM: Probability and Statistics (2007)

  • Volume: 11, page 272-280
  • ISSN: 1292-8100

Abstract

top
Assessing the number of clusters of a statistical population is one of the essential issues of unsupervised learning. Given n independent observations X1,...,Xn drawn from an unknown multivariate probability density f, we propose a new approach to estimate the number of connected components, or clusters, of the t-level set ( t ) = { x : f ( x ) t } . The basic idea is to form a rough skeleton of the set ( t ) using any preliminary estimator of f, and to count the number of connected components of the resulting graph. Under mild analytic conditions on f, and using tools from differential geometry, we establish the consistency of our method.

How to cite

top

Biau, Gérard, Cadre, Benoît, and Pelletier, Bruno. "A graph-based estimator of the number of clusters." ESAIM: Probability and Statistics 11 (2007): 272-280. <http://eudml.org/doc/250119>.

@article{Biau2007,
abstract = { Assessing the number of clusters of a statistical population is one of the essential issues of unsupervised learning. Given n independent observations X1,...,Xn drawn from an unknown multivariate probability density f, we propose a new approach to estimate the number of connected components, or clusters, of the t-level set $\mathcal L(t)=\\{x:f(x) \geq t\\}$. The basic idea is to form a rough skeleton of the set $\mathcal L(t)$ using any preliminary estimator of f, and to count the number of connected components of the resulting graph. Under mild analytic conditions on f, and using tools from differential geometry, we establish the consistency of our method. },
author = {Biau, Gérard, Cadre, Benoît, Pelletier, Bruno},
journal = {ESAIM: Probability and Statistics},
keywords = {Cluster analysis; connected component; level set; graph; tubular neighborhood.; cluster analysis; tubular neighborhood},
language = {eng},
month = {6},
pages = {272-280},
publisher = {EDP Sciences},
title = {A graph-based estimator of the number of clusters},
url = {http://eudml.org/doc/250119},
volume = {11},
year = {2007},
}

TY - JOUR
AU - Biau, Gérard
AU - Cadre, Benoît
AU - Pelletier, Bruno
TI - A graph-based estimator of the number of clusters
JO - ESAIM: Probability and Statistics
DA - 2007/6//
PB - EDP Sciences
VL - 11
SP - 272
EP - 280
AB - Assessing the number of clusters of a statistical population is one of the essential issues of unsupervised learning. Given n independent observations X1,...,Xn drawn from an unknown multivariate probability density f, we propose a new approach to estimate the number of connected components, or clusters, of the t-level set $\mathcal L(t)=\{x:f(x) \geq t\}$. The basic idea is to form a rough skeleton of the set $\mathcal L(t)$ using any preliminary estimator of f, and to count the number of connected components of the resulting graph. Under mild analytic conditions on f, and using tools from differential geometry, we establish the consistency of our method.
LA - eng
KW - Cluster analysis; connected component; level set; graph; tubular neighborhood.; cluster analysis; tubular neighborhood
UR - http://eudml.org/doc/250119
ER -

References

top
  1. G.E. Bredon, Topology and Geometry, Springer-Verlag, New York, Graduate Texts in Mathematics 139 (1993).  
  2. M.R. Brito, E.L. Chavez, A.J. Quiroz and J.E. Yukich, Connectivity of the mutual k-nearest neighbor graph in clustering and outlier detection. Statist. Probab. Lett.35 (1997) 33–42.  
  3. B. Cadre, Kernel estimation of density level sets. J. Multivariate Anal.97 (2006) 999–1023.  
  4. I. Chavel, Riemannian Geometry: A Modern Introduction. Cambridge University Press, Cambridge (1993).  
  5. T.H. Cormen, C.E. Leiserson and R.L. Rivest, Introduction to Algorithms. The MIT Press, Cambridge (1990).  
  6. A. Cuevas, M. Febrero and R. Fraiman, Estimating the number of clusters. Canad. J. Statist.28 (2000) 367–382.  
  7. A. Cuevas, M. Febrero and R. Fraiman, Cluster analysis: a further approach based on density estimation. Comput. Statist. Data Anal.36 (2001) 441–459.  
  8. L. Devroye and G. Wise, Detection of abnormal behavior via nonparametric estimation of the support. SIAM J. Appl. Math.38 (1980) 480–488.  
  9. R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification, 2nd edition. Wiley-Interscience, New York (2000).  
  10. L. Györfi, M. Kohler, A. Krzyżak and H. Walk, A Distribution-Free Theory of Nonparametric Regression. Springer-Verlag, New York (2002).  
  11. J.A. Hartigan, Clustering Algorithms. John Wiley, New York (1975).  
  12. T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2001).  
  13. S. Kobayashi and K. Nomizu, Foundations of Differential Geometry, Vol. I & II, 2nd edition. Wiley, New York (1996).  
  14. U. von Luxburg and S. Ben-David, Towards a statistical theory of clustering. PASCAL Workshop on Statistics and Optimization of Clustering (2005).  
  15. M.D. Penrose, A strong law for the longest edge of the minimal spanning tree. Ann. Probab.27 (1999) 246–260.  
  16. A. Polonik, Measuring mass concentrations and estimating density contour clusters–an excess mass approach. Ann. Statist.23 (1995) 855–881.  
  17. B.L.S. Prakasa Rao, Nonparametric Functional Estimation. Academic Press, Orlando (1983).  
  18. A.B. Tsybakov, On nonparametric estimation of density level sets. Ann. Statist.25 (1997) 948–969.  

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.