Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms

Bogdan Trawiński; Magdalena Smętek; Zbigniew Telec; Tadeusz Lasota

Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms

Bogdan Trawiński; Magdalena Smętek; Zbigniew Telec; Tadeusz Lasota

International Journal of Applied Mathematics and Computer Science (2012)

Volume: 22, Issue: 4, page 867-881
ISSN: 1641-876X

Access Full Article

top

Access to full text

Full (PDF)

Abstract

top

In the paper we present some guidelines for the application of nonparametric statistical tests and post-hoc procedures devised to perform multiple comparisons of machine learning algorithms. We emphasize that it is necessary to distinguish between pairwise and multiple comparison tests. We show that the pairwise Wilcoxon test, when employed to multiple comparisons, will lead to overoptimistic conclusions. We carry out intensive normality examination employing ten different tests showing that the output of machine learning algorithms for regression problems does not satisfy normality requirements. We conduct experiments on nonparametric statistical tests and post-hoc procedures designed for multiple 1 × N and N × N comparisons with six different neural regression algorithms over 29 benchmark regression data sets. Our investigation proves the usefulness and strength of multiple comparison statistical procedures to analyse and select machine learning algorithms.

How to cite

top

MLA
BibTeX
RIS

Bogdan Trawiński, et al. "Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms." International Journal of Applied Mathematics and Computer Science 22.4 (2012): 867-881. <http://eudml.org/doc/244548>.

@article{BogdanTrawiński2012,
abstract = {In the paper we present some guidelines for the application of nonparametric statistical tests and post-hoc procedures devised to perform multiple comparisons of machine learning algorithms. We emphasize that it is necessary to distinguish between pairwise and multiple comparison tests. We show that the pairwise Wilcoxon test, when employed to multiple comparisons, will lead to overoptimistic conclusions. We carry out intensive normality examination employing ten different tests showing that the output of machine learning algorithms for regression problems does not satisfy normality requirements. We conduct experiments on nonparametric statistical tests and post-hoc procedures designed for multiple 1 × N and N × N comparisons with six different neural regression algorithms over 29 benchmark regression data sets. Our investigation proves the usefulness and strength of multiple comparison statistical procedures to analyse and select machine learning algorithms.},
author = {Bogdan Trawiński, Magdalena Smętek, Zbigniew Telec, Tadeusz Lasota},
journal = {International Journal of Applied Mathematics and Computer Science},
keywords = {machine learning; nonparametric statistical tests; statistical regression; neural networks; multiple comparison tests},
language = {eng},
number = {4},
pages = {867-881},
title = {Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms},
url = {http://eudml.org/doc/244548},
volume = {22},
year = {2012},
}

TY - JOUR
AU - Bogdan Trawiński
AU - Magdalena Smętek
AU - Zbigniew Telec
AU - Tadeusz Lasota
TI - Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms
JO - International Journal of Applied Mathematics and Computer Science
PY - 2012
VL - 22
IS - 4
SP - 867
EP - 881
AB - In the paper we present some guidelines for the application of nonparametric statistical tests and post-hoc procedures devised to perform multiple comparisons of machine learning algorithms. We emphasize that it is necessary to distinguish between pairwise and multiple comparison tests. We show that the pairwise Wilcoxon test, when employed to multiple comparisons, will lead to overoptimistic conclusions. We carry out intensive normality examination employing ten different tests showing that the output of machine learning algorithms for regression problems does not satisfy normality requirements. We conduct experiments on nonparametric statistical tests and post-hoc procedures designed for multiple 1 × N and N × N comparisons with six different neural regression algorithms over 29 benchmark regression data sets. Our investigation proves the usefulness and strength of multiple comparison statistical procedures to analyse and select machine learning algorithms.
LA - eng
KW - machine learning; nonparametric statistical tests; statistical regression; neural networks; multiple comparison tests
UR - http://eudml.org/doc/244548
ER -

References

top

Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., García, S., Sánchez, L. and Herrera, F. (2011). Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of MultipleValued Logic and Soft Computing 17(2-3): 255-287.
Alcalá-Fdez, J., Sánchez, L., García, S., del Jesus, M., Ventura, S., Garrell, J., Otero, J., Romero, C., Bacardit, J., Rivas, V., Fernández, J. and Herrera, F. (2009). KEEL: A software tool to assess evolutionary algorithms to data mining problems, Soft Computing 13(3): 307-318.
Anderson, T. and Darling, D. (1954). A test of goodness-of-fit, Journal of the American Statistical Association 49(268): 765-769. Zbl0059.13302
Anscombe, F. and Glynn, W. (1983). Distribution of the kurtosis statistic b2 for normal samples, Biometrika 70(1): 227-234. Zbl0509.62014
Baruque, B., Porras, S. and Corchado, E. (2011). Hybrid classification ensemble using topology-preserving clustering, New Generation Computing 29(3): 329-344.
Bergmann, G. and Hommel, G. (1988). Improvements of general multiple test procedures for redundant systems of hypotheses, in P. Bauer, G. Hommel and E. Sonnemann (Eds.), Multiple Hypotheses Testing, Springer-Verlag, Berlin, pp. 100-115.
Broomhead, D. and Lowe, D. (1998). Multivariable functional interpolation and adaptive networks, Complex Systems 11: 321-355. Zbl0657.68085
Czarnowski, I. and Jędrzejowicz, P. (2011). Application of agent-based simulated annealing and tabu search procedures to solving the data reduction problem, International Journal of Applied Mathematics and Computer Science 21(1): 57-68, DOI: 10.2478/v10006-011-0004-3. Zbl1221.68191
D'Agostino, R. (1970). Transformation to normality of the null distribution of g1, Biometrika 57(3): 679-681.
D'Agostino, R., Belanger, A. and D'Agostino Jr., R. (1990). A suggestion for using powerful and informative tests of normality, The American Statistician 44(4): 316-321.
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7: 1-30. Zbl1222.68184
Derrac, J., García, S., Molina, D. and Herrera, F. (2011). A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms, Swarm and Evolutionary Computation 1: 3-18.
Dunn, O. (1961). Multiple comparisons among means, Journal of the American Statistical Association 56(238): 52-64. Zbl0103.37001
Finner, H. (1993). On a monotonicity problem in step-down multiple test procedures, Journal of the American Statistical Association 88(423): 920-923. Zbl0799.62077
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance, Journal of the American Statistical Association 32(200): 675-701.
García, S., Fernández, A., Luengo, J. and Herrera, F. (2009). A study of statistical techniques and performance measures for genetics-based machine learning: Accuracy and interpretability, Soft Computing 10(13): 959-977.
García, S., Fernández, A. and Luengo, J.and Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power, Information Sciences 180: 2044-2064.
García, S. and Herrera, F. (2008). An extension on “Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons, Journal of Machine Learning Research 9: 2677-2694. Zbl1225.68178
Graczyk, M., Lasota, T., Telec, Z. and Trawiński, B. (2010). Nonparametric statistical analysis of machine learning algorithms for regression problems, in R. Setchi, I. Jordanov, R.J. Howlett and L.C. Jain (Eds.), KES 2010, Lecture Notes in Artificial Intelligence, Vol. 6276, Springer, Heidelberg, pp. 111-120.
Graczyk, M., Lasota, T. and Trawiński, B. (2009). Comparative analysis of premises valuation models using KEEL, RapidMiner, and WEKA, in N.T. Nguyen, R. Kowalczyk and S.-M. Chen (Eds.), ICCCI 2009, Lecture Notes in Artificial Intelligence, Vol. 5796, Springer, Heidelberg, pp. 800-812.
Hill, T. and Lewicki, P. (2007). Statistics: Methods and Applications, StatSoft, Tulsa.
Hochberg, Y. (1988). A Sharper Bonferroni procedure for multiple tests of significance, Biometrika 75(4): 800-802. Zbl0661.62067
Hodges, J. and Lehmann, E. (1962). Ranks methods for combination of independent experiments in analysis of variance, Annals of Mathematical Statistics 33: 482-497. Zbl0112.10303
Holland, B. and Copenhaver, M. (1987). An improved sequentially rejective Bonferroni test procedure, Biometrics 43(2): 417-423. Zbl0654.62068
Holm, S. (1979). A simple sequentially rejective multiple test procedure, Scandinavian Journal of Statistics 6: 65-70. Zbl0402.62058
Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified Bonferroni test, Biometrika 75(2): 383-386. Zbl0639.62025
Hommel, G.and Bernhard, G. (1994). A rapid algorithm and a computer program for multiple test procedures using procedures using logical structures of hypotheses, Computer Methods and Programs in Biomedicine 43: 213-216.
Igel, C. and Hüsken, M. (2003). Empirical evaluation of the improved RPROP learning algorithm, Neurocomputing 50: 105-123. Zbl1006.68811
Iman, R. and Davenport, J. (1980). Approximations of the critical region of the Friedman statistic, Communications in Statistics 18: 571-595. Zbl0451.62061
Jackowski, K. and Woźniak, M. (2010). Method of classifier selection using the genetic approach, Expert Systems 27(2): 114-128.
Jarque, C. and Bera, A. (1987). A test for normality of observations and regression residuals, International Statistical Review 55(2): 163-172. Zbl0616.62092
Kajdanowicz, T. and Kazienko, P. (2011). Boosting-based sequential output prediction, New Generation Computing 29(3): 293-307. Zbl1251.68180
Keskin, S. (2006). Comparison of several univariate normality tests regarding type I error rate and power of the test in simulation based small samples, Journal of Applied Science Research 2(5): 296-300.
Król, D., Lasota, T., Trawiński, B. and Trawiński, K. (2008). Investigation of evolutionary optimization methods of TSK fuzzy model for real estate appraisal, International Journal of Hybrid Intelligent Systems 5(3): 111-128. Zbl1154.90639
Krzystanek, M., Lasota, T. and Trawiński, B. (2009). Comparative analysis of evolutionary fuzzy models for premises valuation using KEEL, in N.T. Nguyen, R. Kowalczyk and S.-M. Chen (Eds.), ICCCI 2009, Lecture Notes in Artificial Intelligence, Vol. 5796, Springer, Heidelberg, pp. 838-849.
Lasota, T., Mazurkiewicz, J., Trawiński, B. and Trawiński, K. (2010). Comparison of data driven models for the validation of residential premises using KEEL, International Journal of Hybrid Intelligent Systems 7(1): 3-16. Zbl1200.68193
Lasota, T., Telec, Z., Trawiński, B. and Trawiński, K. (2011). Investigation of the ets evolving fuzzy systems applied to real estate appraisal, Journal of Multiple-Valued Logic and Soft Computing 17(2-3): 229-253.
Li, J. (2008). A two-step rejection procedure for testing multiple hypotheses, Journal of Statistical Planning and Inference 138(6): 1521-1527. Zbl1131.62067
Lilliefors, H. (1967). On the Kolmogorov-Smirnov test for normality with mean and variance unknown, Journal of the American Statistical Association 62(318): 399-402.
Luengo, J., García, S. and Herrera, F. (2009). A study on the use of statistical tests for experimentation with neural networks: Analysis of parametric test conditions and non-parametric tests, Expert Systems with Applications 36: 7798-7808.
Lughofer, E., Trawiński, B., Trawiński, K., Kempa, O. and Lasota, T. (2011). On employing fuzzy modeling algorithms for the valuation of residential premises, Information Sciences 181: 5123-5142.
Moller, F. (1990). A scaled conjugate gradient algorithm for fast supervised learning, Neural Networks 6: 525-533.
Motulsky, H. (2010). Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking, 2nd Edn., Oxford University Press, New York, NY.
Nemenyi, P.B. (1963). Distribution-free Multiple Comparisons, Ph.D. thesis, Princeton University, Princeton, NJ.
Plackett, R. (1983). Karl Pearson and the chi-squared test, International Statistical Review 51(1): 59-72. Zbl0501.62001
Plat, J. (1991). A resource allocating network for function interpolation, Neural Computation 3(2): 213-225.
Quade, D. (1979). Using weighted rankings in the analysis of complete blocks with additive block effects, Journal of the American Statistical Association 74: 680-683. Zbl0416.62037
Romão, X., Delgado, R. and Costa, A. (2010). An empirical power comparison of univariate goodness-of-fit tests for normality, Journal of Statistical Computation and Simulation 80(5): 545-591. Zbl1195.62056
Rom, D. (1990). A sequentially rejective test procedure based on a modified Bonferroni inequality, Biometrika 77(3): 663-665.
Royston, P. (1993). A pocket-calculator algorithm for the Shapiro-Francia test for non-normality: An application to medicine, Statistics in Medicine 12(2): 181-184.
Salzberg, S. (1997). On comparing classifiers: Pitfalls to avoid and a recommended approach, Data Mining and Knowledge Discovery 1: 317-327.
Shaffer, J. (1986). Modified sequentially rejective multiple test procedures, Journal of the American Statistical Association 81(395): 826-831. Zbl0603.62087
Shapiro, S. and Wilk, M. (1965). An analysis of variance test for normality (complete samples), Biometrika 52(3/4): 591-611. Zbl0134.36501
Sheskin, D. (2011). Handbook of Parametric and Nonparametric Statistical Procedures, 5th Edn., Chapman & Hall/CRC, Boca Raton, FL. Zbl1269.62032
Smętek, M. and Trawiński, B. (2011). Investigation of genetic algorithms with self-adaptive crossover, mutation, and selection, in E. Corchado, M. Kurzyński and M. Woźniak (Eds.), HAIS 2011, Lecture Notes in Artificial Intelligence, Vol. 6678, Springer, Heidelberg, pp. 116-123.
Smotroff, I., Friedman, D. and Connolly, D. (1991). Self organizing modular neural networks, IEEE International Joint Conference on Neural Networks, IJCNN'91, Seattle, WA, USA, pp. 187-192.
Székely, G.J. and Rizzo, M. (2005). A new test for multivariate normality, Journal of Multivariate Analysis 93(1): 58-80. Zbl1087.62070
Tanweeer-Ul-Islam (2011). Normality testing-A new direction, International Journal of Business and Social Science 2(3): 115-118.
Thode, H. (2002). Testig for Normality, Marcel Dekker, New York, NY. Zbl1032.62040
Troć, M. and Unold, O. (2010). Self-adaptation of parameters in a learning classifier system ensemble machine, International Journal of Applied Mathematics and Computer Science 20(1): 157-174, DOI: 10.2478/v10006-010-0012-8. Zbl1300.68047
Wilcoxon, F. (1945). Individual comparisons by ranking methods, Biometrics 1: 80-83.
Wright, S. (1992). Adjusted p-values for simultaneous inference, Biometrics 48: 1005-1013.
Yazici, B. and Yolacan, S. (2007). A comparison of various tests of normality, Journal of Statistical Computation and Simulation 77(2): 175-183. Zbl1112.62039
Zaman, M. and Hirose, H. (2011). Classification performance of bagging and boosting type ensemble methods with small training sets, New Generation Computing 29(3): 277-292.
Zar, J. (2009). Biostatistical Analysis, 5th Edn., Prentice Hall, Upper Saddle River, NJ.

Citations in EuDML Documents

top

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Language to use for this widget.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Number of notes per page

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.