Data-driven penalty calibration: A case study for gaussian mixture model selection

Cathy Maugis; Bertrand Michel

ESAIM: Probability and Statistics (2011)

  • Volume: 15, page 320-339
  • ISSN: 1292-8100

Abstract

top
In the companion paper [C. Maugis and B. Michel, A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM: P&S 15 (2011) 41–68] , a penalized likelihood criterion is proposed to select a Gaussian mixture model among a specific model collection. This criterion depends on unknown constants which have to be calibrated in practical situations. A “slope heuristics” method is described and experimented to deal with this practical problem. In a model-based clustering context, the specific form of the considered Gaussian mixtures allows us to detect the noisy variables in order to improve the data clustering and its interpretation. The behavior of our data-driven criterion is highlighted on simulated datasets, a curve clustering example and a genomics application.

How to cite

top

Maugis, Cathy, and Michel, Bertrand. "Data-driven penalty calibration: A case study for gaussian mixture model selection." ESAIM: Probability and Statistics 15 (2011): 320-339. <http://eudml.org/doc/277145>.

@article{Maugis2011,
abstract = {In the companion paper [C. Maugis and B. Michel, A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM: P&S 15 (2011) 41–68] , a penalized likelihood criterion is proposed to select a Gaussian mixture model among a specific model collection. This criterion depends on unknown constants which have to be calibrated in practical situations. A “slope heuristics” method is described and experimented to deal with this practical problem. In a model-based clustering context, the specific form of the considered Gaussian mixtures allows us to detect the noisy variables in order to improve the data clustering and its interpretation. The behavior of our data-driven criterion is highlighted on simulated datasets, a curve clustering example and a genomics application.},
author = {Maugis, Cathy, Michel, Bertrand},
journal = {ESAIM: Probability and Statistics},
keywords = {slope heuristics; penalized likelihood criterion; model-based clustering; noisy variable detection},
language = {eng},
pages = {320-339},
publisher = {EDP-Sciences},
title = {Data-driven penalty calibration: A case study for gaussian mixture model selection},
url = {http://eudml.org/doc/277145},
volume = {15},
year = {2011},
}

TY - JOUR
AU - Maugis, Cathy
AU - Michel, Bertrand
TI - Data-driven penalty calibration: A case study for gaussian mixture model selection
JO - ESAIM: Probability and Statistics
PY - 2011
PB - EDP-Sciences
VL - 15
SP - 320
EP - 339
AB - In the companion paper [C. Maugis and B. Michel, A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM: P&S 15 (2011) 41–68] , a penalized likelihood criterion is proposed to select a Gaussian mixture model among a specific model collection. This criterion depends on unknown constants which have to be calibrated in practical situations. A “slope heuristics” method is described and experimented to deal with this practical problem. In a model-based clustering context, the specific form of the considered Gaussian mixtures allows us to detect the noisy variables in order to improve the data clustering and its interpretation. The behavior of our data-driven criterion is highlighted on simulated datasets, a curve clustering example and a genomics application.
LA - eng
KW - slope heuristics; penalized likelihood criterion; model-based clustering; noisy variable detection
UR - http://eudml.org/doc/277145
ER -

References

top
  1. [1] C. Abraham, P.A. Cornillon, E. Matzner-Løber and N. Molinari. Unsupervised curve clustering using B-splines. Scand. J. Stat. Th. Appl.30 (2003) 581–595. Zbl1039.91067MR2002229
  2. [2] H. Akaike, Information theory and an extension of the maximum likelihood principle, in Second International Symposium on Information Theory (Tsahkadsor, 1971). Akadémiai Kiadó, Budapest (1973) 267–281. Zbl0283.62006MR483125
  3. [3] H. Akaike, A new look at the statistical model identification. IEEE Trans. Automatic Control AC-19 (1974) 716–723. System identification and time-series analysis Zbl0314.62039MR423716
  4. [4] S. Arlot, Réechantillonnage et sélection de modèles, Ph.D. thesis, Université Paris-Sud XI (2007). 
  5. [5] S. Arlot and P. Massart, Slope heuristics for heteroscedastic regression on a random design. Submitted to the Annals of Statistics (2008). 
  6. [6] D. Babusiaux, S. Barreau and P.-R. Bauquis, Oil and gas exploration and production, reserves, costs, contracts. Technip, Paris (2007). 
  7. [7] J.D. Banfield and A.E. Raftery, Model-based gaussian and non-gaussian clustering. Biometrics49 (1993) 803–821. Zbl0794.62034MR1243494
  8. [8] A. Barron, L. Birgé and P. Massart, Risk bounds for model selection via penalization. Prob. Th. Rel. Fields113 (1999) 301–413. Zbl0946.62036MR1679028
  9. [9] J.-P. Baudry, Clustering through model selection criteria. Poster session at One Day Statistical Workshop in Lisieux. http://www.math.u-psud.fr/ baudry, June (2007). 
  10. [10] A. Berlinet, G. Biau and L. Rouvière, Functional classification with wavelets, Technical report To appear (2008), in Annales de l'ISUP. MR2435041
  11. [11] C. Biernacki, G. Celeux and G. Govaert, Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell.22 (2000) 719–725. 
  12. [12] C. Biernacki, G. Celeux, G. Govaert and F. Langrognet, Model-based cluster and discriminant analysis with the MIXMOD software. Comp. Stat. Data Anal.51 (2006) 587–600. Zbl1157.62431MR2297473
  13. [13] L. Birgé and P. Massart, Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3 (2001) 203–268. Zbl1037.62001MR1848946
  14. [14] L. Birgé and P. Massart, Minimal penalties for Gaussian model selection. Prob. Th. Rel. Fields138 (2006) 33–73. Zbl1112.62082MR2288064
  15. [15] K.-E. Blake and C. Merz, Uci repository of machine learning databases (1999). http://mlearn.ics.uci.edu/MLSummary.html. 
  16. [16] L. Breiman, J.H. Friedman, R.A. Olshen and C.J. Stone, Classification and regression trees. Wadsworth Statistics/Probability Series. Wadsworth Advanced Books and Software, Belmont, CA (1984). Zbl0541.62042MR726392
  17. [17] G. Celeux and G. Govaert, Gaussian parsimonious clustering models. Patt. Recog.28 (1995) 781–793. 
  18. [18] A.P. Dempster, N.M. Laird and D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. Methodol. 39 (1977) 1–38, With discussion. Zbl0364.62022MR501537
  19. [19] S. Gagnot, J.-P. Tamby, M.-L. Martin-Magniette, F. Bitton, L. Taconnat, S. Balzergue, S. Aubourg, J.-P. Renou, A. Lecharny and V. Brunaud, CATdb: a public access to Arabidopsis transcriptome data from the URGV-CATMA platform. Nucleic Acids Res.36 (2008) 986–990. 
  20. [20] L.A. García-Escudero and A. Gordaliza, A proposal for robust curve clustering. J. Class.22 (2005) 185–201. Zbl1336.62179
  21. [21] P.J. Huber, Robust Statistics. Wiley (1981). Zbl1276.62022MR606374
  22. [22] G.M. James and C.A. Sugar, Clustering for sparsely sampled functional data. J. Am. Stat. Assoc.98 (2003) 397–408. Zbl1041.62052MR1995716
  23. [23] D. Jiang, C. Tang and A. Zhang, Cluster analysis for gene expression data: A survey. IEEE Trans. Knowl. Data Eng.16 (2004) 1370–1386. 
  24. [24] C. Keribin, Consistent estimation of the order of mixture models. Sankhyā Ser. A62 (2000) 49–66. Zbl1081.62516MR1769735
  25. [25] E. Lebarbier, Detecting multiple change-points in the mean of Gaussian process by model selection. Signal Proc.85 (2005) 717–736. Zbl1148.94403
  26. [26] V. Lepez, Potentiel de réserves d'un bassin pétrolier: modélisation et estimation, Ph.D. thesis, Université Paris Sud (2002). 
  27. [27] C. Lurin, C. Andréas, S. Aubourg, M. Bellaoui, F. Bitton, C. Bruyère, M. Caboche, J. Debast, C. Gualberto, B. Hoffmann, M. Lecharny, A. LeRet, M.-L. Martin-Magniette, H. Mireau, N. Peeters, J.-P. Renou, B. Szurek, L. Taconnat and I. Small, Genome-wide analysis of arabidopsis pentatricopeptide repeat proteins reveals their essential role in organelle biogenesis. Plant Cell 16 (2004) 2089–103. 
  28. [28] P. Ma, W. Castillo-Davis, C. Zhong and J.S. Liu, A data-driven clustering method for time course gene expression data. Nucleic Acids Res.34 (2006) 1261–1269. 
  29. [29] C.L. Mallows, Some comments on Cp. Technometrics37 (1973) 362–372. Zbl0862.62061MR1365719
  30. [30] P. Massart, Concentration inequalities and model selection, Lecture Notes in Mathematics Vol. 1896. Springer, Berlin (2007). Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23 (2003). Zbl1170.60006MR2319879
  31. [31] C. Maugis, G. Celeux and M.-L. Martin-Magniette, Variable selection for clustering with Gaussian mixture models. Biometrics65 (2009) 701–709. Zbl1172.62021MR2649842
  32. [32] C. Maugis, G. Celeux and M.-L. Martin-Magniette, Variable selection in model-based clustering: A general variable role modeling. Comput. Stat. Data Anal.53 (2009) 3872–3882. Zbl05689143MR2749931
  33. [33] C. Maugis and B. Michel, A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM: P&S 15 (2011) 41–68. Zbl06157507MR2870505
  34. [34] B. Michel, Modélisation de la production d'hydrocarbures dans un bassin pétrolier, Ph.D. thesis, Université Paris-Sud 11 (2008). 
  35. [35] B.P. Percival and A.T. Walden, Wavelet methods for time series analysis. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge university press, New York (2000). Zbl0963.62079MR1770693
  36. [36] A.E. Raftery and N. Dean, Variable selection for model-based clustering. J. Am. Stat. Assoc.101 (2006) 168–178. Zbl1118.62339MR2268036
  37. [37] G. Schwarz, Estimating the dimension of a model. Ann. Stat.6 (1978) 461–464. Zbl0379.62005MR468014
  38. [38] R. Sharan, R. Elkon and R. Shamir, Cluster analysis and its applications to gene expression data. In Ernst Schering Workshop on Bioinformatics and Genome Analysis. Springer Verlag (2002). 
  39. [39] T. Tarpey and K.K.J. Kinateder, Clustering functional data. J. Class.20 (2003) 93–114. Zbl1112.62327MR1983123
  40. [40] F. Villers, Tests et sélection de modèles pour l'analyse de données protéomiques et transcriptomiques, Ph.D. thesis, Université Paris-Sud 11 (2007). 

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.