Variable selection through CART

Marie Sauve; Christine Tuleau-Malot

ESAIM: Probability and Statistics (2014)

  • Volume: 18, page 770-798
  • ISSN: 1292-8100

Abstract

top
This paper deals with variable selection in regression and binary classification frameworks. It proposes an automatic and exhaustive procedure which relies on the use of the CART algorithm and on model selection via penalization. This work, of theoretical nature, aims at determining adequate penalties, i.e. penalties which allow achievement of oracle type inequalities justifying the performance of the proposed procedure. Since the exhaustive procedure cannot be realized when the number of variables is too large, a more practical procedure is also proposed and still theoretically validated. A simulation study completes the theoretical results.

How to cite

top

Sauve, Marie, and Tuleau-Malot, Christine. "Variable selection through CART." ESAIM: Probability and Statistics 18 (2014): 770-798. <http://eudml.org/doc/273651>.

@article{Sauve2014,
abstract = {This paper deals with variable selection in regression and binary classification frameworks. It proposes an automatic and exhaustive procedure which relies on the use of the CART algorithm and on model selection via penalization. This work, of theoretical nature, aims at determining adequate penalties, i.e. penalties which allow achievement of oracle type inequalities justifying the performance of the proposed procedure. Since the exhaustive procedure cannot be realized when the number of variables is too large, a more practical procedure is also proposed and still theoretically validated. A simulation study completes the theoretical results.},
author = {Sauve, Marie, Tuleau-Malot, Christine},
journal = {ESAIM: Probability and Statistics},
keywords = {binary classification; CART; model selection; penalization; regression; variable selection},
language = {eng},
pages = {770-798},
publisher = {EDP-Sciences},
title = {Variable selection through CART},
url = {http://eudml.org/doc/273651},
volume = {18},
year = {2014},
}

TY - JOUR
AU - Sauve, Marie
AU - Tuleau-Malot, Christine
TI - Variable selection through CART
JO - ESAIM: Probability and Statistics
PY - 2014
PB - EDP-Sciences
VL - 18
SP - 770
EP - 798
AB - This paper deals with variable selection in regression and binary classification frameworks. It proposes an automatic and exhaustive procedure which relies on the use of the CART algorithm and on model selection via penalization. This work, of theoretical nature, aims at determining adequate penalties, i.e. penalties which allow achievement of oracle type inequalities justifying the performance of the proposed procedure. Since the exhaustive procedure cannot be realized when the number of variables is too large, a more practical procedure is also proposed and still theoretically validated. A simulation study completes the theoretical results.
LA - eng
KW - binary classification; CART; model selection; penalization; regression; variable selection
UR - http://eudml.org/doc/273651
ER -

References

top
  1. [1] S. Arlot and P. Bartlett, Margin adaptive model selection in statistical learning. Bernoulli17 (2011) 687–713. Zbl06083988MR2787611
  2. [2] L. Birgé and P. Massart, Minimal penalties for gaussian model selection. Probab. Theory Relat. Fields138 (2007) 33–73. Zbl1112.62082MR2288064
  3. [3] L. Breiman, Random forests. Mach. Learn.45 (2001) 5–32. Zbl1007.68152
  4. [4] L. Breiman and A. Cutler, Random forests. http://www.stat.berkeley.edu/users/breiman/RandomForests/ (2005). 
  5. [5] L. Breiman, J. Friedman, R. Olshen and C. Stone, Classification and Regression Trees. Chapman et Hall (1984). Zbl0541.62042MR726392
  6. [6] R. Díaz-Uriarte and S. Alvarez de Andrés, Gene selection and classification of microarray data using random forest. BMC Bioinform.7 (2006) 1–13. 
  7. [7] B. Efron, T. Hastie, I. Johnstone and R. Tibshirani, Least angle regression. Ann. Stat.32 (2004) 407–499. Zbl1091.62054MR2060166
  8. [8] J. Fan and J. Lv, A selective overview of variable selection in high dimensional feature space. Stat. Sin.20 (2010) 101–148. Zbl1180.62080MR2640659
  9. [9] G.M. Furnival and R.W. Wilson, Regression by leaps and bounds. Technometrics16 (1974) 499–511. Zbl0294.62079
  10. [10] R. Genuer, J.M. Poggi and C. Tuleau-Malot, Variable selection using random forests. Pattern Recognit. Lett.31 (2010) 2225–2236. 
  11. [11] S. Gey, Margin adaptive risk bounds for classification trees, hal-00362281. Zbl1242.62055
  12. [12] S. Gey and E. Nédélec, Model Selection for CART Regression Trees. IEEE Trans. Inf. Theory51 (2005) 658–670. Zbl1301.62064MR2236074
  13. [13] B. Ghattas and A. Ben Ishak, Sélection de variables pour la classification binaire en grande dimension: comparaisons et application aux données de biopuces. Journal de la société française de statistique149 (2008) 43–66. MR2501989
  14. [14] U. Grömping, Estimators of relative importance in linear regression based on variance decomposition. The American Statistician61 (2007) 139–147. MR2368103
  15. [15] I. Guyon and A. Elisseff, An introduction to variable and feature selection. J. Mach. Learn. Res.3 (2003) 1157–1182. Zbl1102.68556
  16. [16] I. Guyon, J. Weston, S. Barnhill and V.N. Vapnik, Gene selection for cancer classification using support vector machines. Mach. Learn.46 (2002) 389–422. Zbl0998.68111
  17. [17] T. Hastié, R. Tibshirani and J. Friedman, The Elements of Statistical Learning. Springer (2001). Zbl0973.62007MR1851606
  18. [18] T. Hesterberg, N.H. Choi, L. Meier and C. Fraley, Least angle regresion and l1 penalized regression: A review. Stat. Surv.2 (2008) 61–93. Zbl1189.62070MR2520981
  19. [19] R. Kohavi and G.H. John, Wrappers for feature subset selection. Artificial Intelligence97 (1997) 273–324. Zbl0904.68143
  20. [20] V. Koltchinskii, Local rademacher complexities and oracle inequalities in risk minimization. Ann. Stat.34 (2004) 2593–2656. Zbl1118.62065MR2329442
  21. [21] E. Mammen and A. Tsybakov, Smooth discrimination analysis. Ann. Stat.27 (1999) 1808–1829. Zbl0961.62058MR1765618
  22. [22] P. Massart, Some applications of concentration inequalities to statistics. Annales de la faculté des sciences de Toulouse2 (2000) 245–303. Zbl0986.62002MR1813803
  23. [23] P. Massart, Concentration Inequlaities and Model Selection. Lect. Notes Math. Springer (2003). Zbl1170.60006
  24. [24] P. Massart and E. Nédélec, Risk bounds for statistical learning. Ann. Stat. 34 (2006). Zbl1108.62007MR2291502
  25. [25] J.M. Poggi and C. Tuleau, Classification supervisée en grande dimension. Application à l’agrément de conduite automobile. Revue de Statistique Appliquée LIV (2006) 41–60. 
  26. [26] E. Rio, Une inégalité de bennett pour les maxima de processus empiriques. Ann. Inst. Henri Poincaré, Probab. Stat. 38 (2002) 1053–1057. Zbl1014.60011MR1955352
  27. [27] A. Saltelli, K. Chan and M. Scott, Sensitivity Analysis. Wiley (2000). Zbl1152.62071MR1886391
  28. [28] M. Sauvé, Histogram selection in non gaussian regression. ESAIM PS13 (2009) 70–86. Zbl1180.62061MR2502024
  29. [29] M. Sauvé and C. Tuleau-Malot, Variable selection through CART, hal-00551375. 
  30. [30] I.M. Sobol, Sensitivity estimates for nonlinear mathematical models. Math. Mod. Comput. Experiment1 (1993) 271–280. Zbl1039.65505MR1335161
  31. [31] R. Tibshirani, Regression shrinkage and selection via Lasso. J. R. Stat. Soc. Ser. B58 (1996) 267–288. Zbl0850.62538MR1379242
  32. [32] A.B. Tsybakov, Optimal aggregation of classifiers in statistical learning. Ann. Stat.32 (2004) 135–166. Zbl1105.62353MR2051002

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.