Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de biopuces

Badih Ghattas; Anis Ben Ishak

Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de biopuces

Badih Ghattas; Anis Ben Ishak

Journal de la société française de statistique (2008)

Volume: 149, Issue: 3, page 43-66
ISSN: 1962-5197

Access Full Article

top

Access to full text

Full (PDF)

Abstract

top

In this paper we compare three methods for selecting important features in binary classification. We focus on the case where the sample size is smaller than the number of variables. The three approaches used are based on Support Vector Machines,

L_{1}

constrained Generalized Linear Models and Random Forests.

How to cite

top

MLA
BibTeX
RIS

Ghattas, Badih, and Ben Ishak, Anis. "Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de biopuces." Journal de la société française de statistique 149.3 (2008): 43-66. <http://eudml.org/doc/93483>.

@article{Ghattas2008,
abstract = {Dans cet article nous nous proposons de comparer trois méthodes récentes de sélection de variables dans le cadre de la classification binaire. Le contexte auquel nous nous intéressons ici est celui où le nombre de variables est très grand et beaucoup plus important que le nombre d’observations, comme c’est le cas pour les données issues des biopuces. Les approches comparées sont de type SVM, GLM sous contraintes de type $L_\{1\}$ et Forêts Aléatoires.},
author = {Ghattas, Badih, Ben Ishak, Anis},
journal = {Journal de la société française de statistique},
keywords = {bootstrap; cross validation; feature selection; forward selection; GLMpath; microarray data; random forests; ranking rules; support vector machines; SVM-based criteria},
language = {fre},
number = {3},
pages = {43-66},
publisher = {Société française de statistique},
title = {Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de biopuces},
url = {http://eudml.org/doc/93483},
volume = {149},
year = {2008},
}

TY - JOUR
AU - Ghattas, Badih
AU - Ben Ishak, Anis
TI - Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de biopuces
JO - Journal de la société française de statistique
PY - 2008
PB - Société française de statistique
VL - 149
IS - 3
SP - 43
EP - 66
AB - Dans cet article nous nous proposons de comparer trois méthodes récentes de sélection de variables dans le cadre de la classification binaire. Le contexte auquel nous nous intéressons ici est celui où le nombre de variables est très grand et beaucoup plus important que le nombre d’observations, comme c’est le cas pour les données issues des biopuces. Les approches comparées sont de type SVM, GLM sous contraintes de type $L_{1}$ et Forêts Aléatoires.
LA - fre
KW - bootstrap; cross validation; feature selection; forward selection; GLMpath; microarray data; random forests; ranking rules; support vector machines; SVM-based criteria
UR - http://eudml.org/doc/93483
ER -

References

top

[1] Alizadeh A. A. (2000). Distinct types of diffues large b-cell lymphoma identified by gene expression profiling. Nature, 403 : 503-511.
[2] Alon U., N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. Cell Biology, 96(12) : 6745-6750.
[3] Ambroise C. and G. MacLachlan (2002). Selection Bias in gene extraction on the basis of microarray gene expression data. Proceedings of the National Academic Science, USA, 99(10) :6562-6566. Zbl1034.92013
[4] Ben Ishak A. and B. Ghattas (2005). An efficient method for variable selection using svm-based criteria. Pré-publication de l’Institut de Mathématiques de Luminy, Marseille, France.
[5] Ben Ishak A. (2007). Séléction de variables par les machines à vecteurs supports pour la discrimination binaire et multiclasse en grande dimension. Thèse soutenue à l’Université de la Méditerranée le 06 Spetembre 2007. (http://lumimath.univ-mrs.fr/~ghattas/theseAnisBenIshak.pdf)
[6] Boser A., I. Guyon, and V. Vapnik (1992). A training algorithm for optimal margin classifiers. In Fifth Annual Workshop on Computational Learning Theory, pages 144-152, Pittsburgh. ACM.
[7] Breiman L., J. H. Friedman, R. A. Olshen, and C. J. Stone (1984). Classification and Regression Trees. Wadsworth and Brooks. Zbl0541.62042 MR726392
[8] Breiman L. (2001). Random forests. Machine Learning Journal, 45 :5–32. Zbl1007.68152
[9] Cristianini N. and J. Shawe-Taylor (2000). Introduction to Support Vector Machines. Cambridge University Press. Zbl0994.68074
[10] Díaz-Uriarte R. and S. Alvarez de Andrés (2006). Gene Selection and classification of microarray data using random forest. BMC Bioinformatics, 7 :3, pp 1-13.
[11] Dudoit S., J. Fridlyand, and T. Speed (2002). Comparison of discrimination methods for the classification of tumors using gene expression data, J. Amer. Stat. Assoc.. Zbl1073.62576 MR1963389
[12] Efron B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. Annals of Statistics, 32(2) :407-499. Zbl1091.62054 MR2060166
[13] Ghattas B. et G. Oppenheim (2001). Etude de faisabilité : Modèles globaux pour la mise au point moteur. Rapport technique Renault, 56 pages.
[14] Golub T. R., D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander (1999). Molecular classification of cancer : Class discovery and class prediction by gene expression monitoring. Science, 286 : 531-537.
[15] Guyon I. and A. Elisseff (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3 : 1157-1182. Zbl1102.68556
[16] Guyon I., J. Weston, S. Barnhill, and V. Vapnik (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1-3) : 389-422. Zbl0998.68111
[17] Kohavi R. and G. H. John (1997). Wrappers for Feature Subset Selection. Artificial Intelligence, 97(1-2) : 273-324. Zbl0904.68143
[18] Liaw A. and M. Wiener (2002). Classification and Regression by Random Forest. Rnews, 2 :18-22.
[19] Luntz A. and V. Brailovsky (1969). On estimation of characters obtained in statistical procedure of recognition. Technicheskaya Kibernetica, 3.
[20] McCullagh P. and J. Nelder (1989). Generalized Linear Models. CHAPMAN & HALL/CRC, Boca Raton. Zbl0744.62098 MR727836
[21] Park M. Y. and T. Hastie (2006). L $_{1}$ Regularization Path Algorithm for Generalized Linear Models. Technical report, Stanford University.
[22] Poggi J. M. et C. Tuleau (2006). Classification supervisée en grande dimension. Application à l’agrément de conduite automobile. Revue de Statistique Appliquée, LIV (4), 39-58.
[23] Rakotomamonjy A. (2003). Variable selection using SVM-based criteria. Journal of Machine Learning Research, 3 : 1357-1370. Zbl1102.68583 MR2020764
[24] Reunanen J. (2003). Overfitting in Making Comparisons Between Variable Selection Methods. Journal of Machine Learning Research, 3 :1371-1382. Zbl1102.68635
[25] Singh D., P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D’Amico, J. P. Richie, E. S. Lander, M. Loda, P. W. Kantoff, T. R. Golub, and W. R. Sellers (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2) : 203-209.
[26] Somol P., P. Pudil, J. Novovičová, and P. Paclik (1999). Adaptive floating search methods in feature selection. Pattern Recognition Letters, 20 :1157-1163.
[27] Svetnik V., A. Liaw, C. Tong, and T. Wang (2004). Application of Breiman’s random forest to modeling structure-activity relashionships of pharmaceutical molecules. Multiple Classifier Systems. Lecture Notes in Computer Science, Springer, 3077 :334-343.
[28] Vapnik V. (1995). The Nature of Statistical Learning Theory. Springer Verlag, New York. Zbl0833.62008 MR1367965
[29] Vapnik V. (1998). Statistical Learning Theory. John Wiley and Sons, New York. Zbl0935.62007 MR1641250
[30] Vapnik V. and O. Chapelle (2000). Bounds on error expectation for support vector machines. Neural Computation, 12 : 9.
[31] Weston J., A. Elisseff, B. Schoelkopf, and M. Tipping (2003). Use of the zero norm with linear models and kernel methods. Journal of Machine Learning Research, 3 : 1439-1461. Zbl1102.68605 MR2020766

Citations in EuDML Documents

top

Marie Sauve, Christine Tuleau-Malot, Variable selection through CART

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Language to use for this widget.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Number of notes per page

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.