Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de biopuces

Badih Ghattas; Anis Ben Ishak

Journal de la société française de statistique (2008)

  • Volume: 149, Issue: 3, page 43-66
  • ISSN: 1962-5197

Abstract

top
In this paper we compare three methods for selecting important features in binary classification. We focus on the case where the sample size is smaller than the number of variables. The three approaches used are based on Support Vector Machines, L 1 constrained Generalized Linear Models and Random Forests.

How to cite

top

Ghattas, Badih, and Ben Ishak, Anis. "Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de biopuces." Journal de la société française de statistique 149.3 (2008): 43-66. <http://eudml.org/doc/93483>.

@article{Ghattas2008,
abstract = {Dans cet article nous nous proposons de comparer trois méthodes récentes de sélection de variables dans le cadre de la classification binaire. Le contexte auquel nous nous intéressons ici est celui où le nombre de variables est très grand et beaucoup plus important que le nombre d’observations, comme c’est le cas pour les données issues des biopuces. Les approches comparées sont de type SVM, GLM sous contraintes de type $L_\{1\}$ et Forêts Aléatoires.},
author = {Ghattas, Badih, Ben Ishak, Anis},
journal = {Journal de la société française de statistique},
keywords = {bootstrap; cross validation; feature selection; forward selection; GLMpath; microarray data; random forests; ranking rules; support vector machines; SVM-based criteria},
language = {fre},
number = {3},
pages = {43-66},
publisher = {Société française de statistique},
title = {Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de biopuces},
url = {http://eudml.org/doc/93483},
volume = {149},
year = {2008},
}

TY - JOUR
AU - Ghattas, Badih
AU - Ben Ishak, Anis
TI - Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de biopuces
JO - Journal de la société française de statistique
PY - 2008
PB - Société française de statistique
VL - 149
IS - 3
SP - 43
EP - 66
AB - Dans cet article nous nous proposons de comparer trois méthodes récentes de sélection de variables dans le cadre de la classification binaire. Le contexte auquel nous nous intéressons ici est celui où le nombre de variables est très grand et beaucoup plus important que le nombre d’observations, comme c’est le cas pour les données issues des biopuces. Les approches comparées sont de type SVM, GLM sous contraintes de type $L_{1}$ et Forêts Aléatoires.
LA - fre
KW - bootstrap; cross validation; feature selection; forward selection; GLMpath; microarray data; random forests; ranking rules; support vector machines; SVM-based criteria
UR - http://eudml.org/doc/93483
ER -

References

top
  1. [1] Alizadeh A. A. (2000). Distinct types of diffues large b-cell lymphoma identified by gene expression profiling. Nature, 403 : 503-511. 
  2. [2] Alon U., N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. Cell Biology, 96(12) : 6745-6750. 
  3. [3] Ambroise C. and G. MacLachlan (2002). Selection Bias in gene extraction on the basis of microarray gene expression data. Proceedings of the National Academic Science, USA, 99(10) :6562-6566. Zbl1034.92013
  4. [4] Ben Ishak A. and B. Ghattas (2005). An efficient method for variable selection using svm-based criteria. Pré-publication de l’Institut de Mathématiques de Luminy, Marseille, France. 
  5. [5] Ben Ishak A. (2007). Séléction de variables par les machines à vecteurs supports pour la discrimination binaire et multiclasse en grande dimension. Thèse soutenue à l’Université de la Méditerranée le 06 Spetembre 2007. (http://lumimath.univ-mrs.fr/~ghattas/theseAnisBenIshak.pdf) 
  6. [6] Boser A., I. Guyon, and V. Vapnik (1992). A training algorithm for optimal margin classifiers. In Fifth Annual Workshop on Computational Learning Theory, pages 144-152, Pittsburgh. ACM. 
  7. [7] Breiman L., J. H. Friedman, R. A. Olshen, and C. J. Stone (1984). Classification and Regression Trees. Wadsworth and Brooks. Zbl0541.62042MR726392
  8. [8] Breiman L. (2001). Random forests. Machine Learning Journal, 45 :5–32. Zbl1007.68152
  9. [9] Cristianini N. and J. Shawe-Taylor (2000). Introduction to Support Vector Machines. Cambridge University Press. Zbl0994.68074
  10. [10] Díaz-Uriarte R. and S. Alvarez de Andrés (2006). Gene Selection and classification of microarray data using random forest. BMC Bioinformatics, 7 :3, pp 1-13. 
  11. [11] Dudoit S., J. Fridlyand, and T. Speed (2002). Comparison of discrimination methods for the classification of tumors using gene expression data, J. Amer. Stat. Assoc.. Zbl1073.62576MR1963389
  12. [12] Efron B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. Annals of Statistics, 32(2) :407-499. Zbl1091.62054MR2060166
  13. [13] Ghattas B. et G. Oppenheim (2001). Etude de faisabilité : Modèles globaux pour la mise au point moteur. Rapport technique Renault, 56 pages. 
  14. [14] Golub T. R., D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander (1999). Molecular classification of cancer : Class discovery and class prediction by gene expression monitoring. Science, 286 : 531-537. 
  15. [15] Guyon I. and A. Elisseff (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3 : 1157-1182. Zbl1102.68556
  16. [16] Guyon I., J. Weston, S. Barnhill, and V. Vapnik (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1-3) : 389-422. Zbl0998.68111
  17. [17] Kohavi R. and G. H. John (1997). Wrappers for Feature Subset Selection. Artificial Intelligence, 97(1-2) : 273-324. Zbl0904.68143
  18. [18] Liaw A. and M. Wiener (2002). Classification and Regression by Random Forest. Rnews, 2 :18-22. 
  19. [19] Luntz A. and V. Brailovsky (1969). On estimation of characters obtained in statistical procedure of recognition. Technicheskaya Kibernetica, 3. 
  20. [20] McCullagh P. and J. Nelder (1989). Generalized Linear Models. CHAPMAN & HALL/CRC, Boca Raton. Zbl0744.62098MR727836
  21. [21] Park M. Y. and T. Hastie (2006). L 1 Regularization Path Algorithm for Generalized Linear Models. Technical report, Stanford University. 
  22. [22] Poggi J. M. et C. Tuleau (2006). Classification supervisée en grande dimension. Application à l’agrément de conduite automobile. Revue de Statistique Appliquée, LIV (4), 39-58. 
  23. [23] Rakotomamonjy A. (2003). Variable selection using SVM-based criteria. Journal of Machine Learning Research, 3 : 1357-1370. Zbl1102.68583MR2020764
  24. [24] Reunanen J. (2003). Overfitting in Making Comparisons Between Variable Selection Methods. Journal of Machine Learning Research, 3 :1371-1382. Zbl1102.68635
  25. [25] Singh D., P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D’Amico, J. P. Richie, E. S. Lander, M. Loda, P. W. Kantoff, T. R. Golub, and W. R. Sellers (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2) : 203-209. 
  26. [26] Somol P., P. Pudil, J. Novovičová, and P. Paclik (1999). Adaptive floating search methods in feature selection. Pattern Recognition Letters, 20 :1157-1163. 
  27. [27] Svetnik V., A. Liaw, C. Tong, and T. Wang (2004). Application of Breiman’s random forest to modeling structure-activity relashionships of pharmaceutical molecules. Multiple Classifier Systems. Lecture Notes in Computer Science, Springer, 3077 :334-343. 
  28. [28] Vapnik V. (1995). The Nature of Statistical Learning Theory. Springer Verlag, New York. Zbl0833.62008MR1367965
  29. [29] Vapnik V. (1998). Statistical Learning Theory. John Wiley and Sons, New York. Zbl0935.62007MR1641250
  30. [30] Vapnik V. and O. Chapelle (2000). Bounds on error expectation for support vector machines. Neural Computation, 12 : 9. 
  31. [31] Weston J., A. Elisseff, B. Schoelkopf, and M. Tipping (2003). Use of the zero norm with linear models and kernel methods. Journal of Machine Learning Research, 3 : 1439-1461. Zbl1102.68605MR2020766

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.