An algorithm for reducing the dimension and size of a sample for data exploration procedures

Piotr Kulczycki; Szymon Łukasik

International Journal of Applied Mathematics and Computer Science (2014)

  • Volume: 24, Issue: 1, page 133-149
  • ISSN: 1641-876X

Abstract

top
The paper deals with the issue of reducing the dimension and size of a data set (random sample) for exploratory data analysis procedures. The concept of the algorithm investigated here is based on linear transformation to a space of a smaller dimension, while retaining as much as possible the same distances between particular elements. Elements of the transformation matrix are computed using the metaheuristics of parallel fast simulated annealing. Moreover, elimination of or a decrease in importance is performed on those data set elements which have undergone a significant change in location in relation to the others. The presented method can have universal application in a wide range of data exploration problems, offering flexible customization, possibility of use in a dynamic data environment, and comparable or better performance with regards to the principal component analysis. Its positive features were verified in detail for the domain's fundamental tasks of clustering, classification and detection of atypical elements (outliers).

How to cite

top

Piotr Kulczycki, and Szymon Łukasik. "An algorithm for reducing the dimension and size of a sample for data exploration procedures." International Journal of Applied Mathematics and Computer Science 24.1 (2014): 133-149. <http://eudml.org/doc/271922>.

@article{PiotrKulczycki2014,
abstract = {The paper deals with the issue of reducing the dimension and size of a data set (random sample) for exploratory data analysis procedures. The concept of the algorithm investigated here is based on linear transformation to a space of a smaller dimension, while retaining as much as possible the same distances between particular elements. Elements of the transformation matrix are computed using the metaheuristics of parallel fast simulated annealing. Moreover, elimination of or a decrease in importance is performed on those data set elements which have undergone a significant change in location in relation to the others. The presented method can have universal application in a wide range of data exploration problems, offering flexible customization, possibility of use in a dynamic data environment, and comparable or better performance with regards to the principal component analysis. Its positive features were verified in detail for the domain's fundamental tasks of clustering, classification and detection of atypical elements (outliers).},
author = {Piotr Kulczycki, Szymon Łukasik},
journal = {International Journal of Applied Mathematics and Computer Science},
keywords = {dimension reduction; sample size reduction; linear transformation; simulated annealing; data mining},
language = {eng},
number = {1},
pages = {133-149},
title = {An algorithm for reducing the dimension and size of a sample for data exploration procedures},
url = {http://eudml.org/doc/271922},
volume = {24},
year = {2014},
}

TY - JOUR
AU - Piotr Kulczycki
AU - Szymon Łukasik
TI - An algorithm for reducing the dimension and size of a sample for data exploration procedures
JO - International Journal of Applied Mathematics and Computer Science
PY - 2014
VL - 24
IS - 1
SP - 133
EP - 149
AB - The paper deals with the issue of reducing the dimension and size of a data set (random sample) for exploratory data analysis procedures. The concept of the algorithm investigated here is based on linear transformation to a space of a smaller dimension, while retaining as much as possible the same distances between particular elements. Elements of the transformation matrix are computed using the metaheuristics of parallel fast simulated annealing. Moreover, elimination of or a decrease in importance is performed on those data set elements which have undergone a significant change in location in relation to the others. The presented method can have universal application in a wide range of data exploration problems, offering flexible customization, possibility of use in a dynamic data environment, and comparable or better performance with regards to the principal component analysis. Its positive features were verified in detail for the domain's fundamental tasks of clustering, classification and detection of atypical elements (outliers).
LA - eng
KW - dimension reduction; sample size reduction; linear transformation; simulated annealing; data mining
UR - http://eudml.org/doc/271922
ER -

References

top
  1. Aarts, E., Korst, J. and van Laarhoven, P. (1997). Simulated annealing, in E. Aarts and J. Lenstra (Eds.), Local Search in Combinatorial Optimization, Wiley, Chichester, pp. 91-120. Zbl0905.90140
  2. Alba, E. (2005). Parallel Metaheuristics: A New Class of Algorithms, Wiley, New York, NY. Zbl1094.90052
  3. Aswani Kumar, C. and Srinivas, S. (2006). Latent semantic indexing using eigenvalue analysis for efficient information retrieval, International Journal of Applied Mathematics and Computer Science 16(4): 551-558. Zbl1122.68047
  4. Aswani Kumar, C. (2009). Analysis of unsupervised dimensionality techniques, Computer Science and Information Systems 6(2): 217-227. 
  5. Azencot, R. (1992). Simulated Annealing: Parallelization Techniques, Wiley, New York, NY. 
  6. Bartenhagen, C., Klein, H.-U., Ruckert, C., Jiang, X. and Dugas, M. (2010). Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data, BMC Bioinformatics 11, paper no. 567. 
  7. Bartkuté, V. and Sakalauskas, L. (2009). Statistical inferences for termination of Markov type random search algorithms, Journal of Optimization Theory and Applications 141(3): 475-493. Zbl1183.62081
  8. Ben-Ameur, W. (2004). Computing the initial temperature of simulated annealing, Computational Optimization and Applications 29(3): 367-383. Zbl1062.90073
  9. Borg, I. and Groenen, P. (2005). Modern Multidimensional Scaling. Theory and Applications, Springer-Verlag, Berlin. Zbl1085.62079
  10. Camastra, F. (2003). Data dimensionality estimation methods: A survey, Pattern Recognition 36(12): 2945-2954. Zbl1059.68100
  11. Charytanowicz, M., Niewczas, J., Kulczycki, P., Kowalski, P., Łukasik, S. and Żak, S. (2010). Complete gradient clustering algorithm for features analysis of x-ray images, in E. Piątka and J. Kawa (Eds.), Information Technologies in Biomedicine, Vol. 2, Springer-Verlag, Berlin, pp. 15-24. 
  12. Cortez, P., Cerdeira, A., Almeida, F., Matos, T. and Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties, Decision Support Systems 47(4): 547-553. 
  13. Cox, T. and Cox, M. (2000). Multidimensional Scaling, Chapman and Hall, Boca Raton, FL. Zbl1147.68460
  14. Cunningham, P. (2007). Dimension reduction, Technical report, UCD School of Computer Science and Informatics, Dublin. 
  15. Czarnowski, I. and Jędrzejowicz, P. (2011). Application of agent-based simulated annealing and tabu search procedures to solving the data reduction problem, International Journal of Applied Mathematics and Computer Science 21(1): 57-68, DOI: 10.2478/v10006-011-0004-3. Zbl1221.68191
  16. David, H. and Nagaraja, H. (2003). Order Statistics, Wiley, New York, NY. Zbl1053.62060
  17. Deng, Z., Chung, F.-L. and Wang, S. (2008). FRSDE: Fast reduced set density estimator using minimal enclosing ball approximation, Pattern Recognition 41(4): 1363-1372. Zbl1131.68086
  18. François, D., Wertz, V. and Verleysen, M. (2007). The concentration of fractional distances, IEEE Transactions on Knowledge and Data Engineering 19(7): 873-886. 
  19. Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distribution and the Bayesian restoration in images, IEEE Transactions on Pattern Analysis and Machine Intelligence 6: 721-741. Zbl0573.62030
  20. Gendreau, M. and Potvin, J.-Y. (2010). Handbook of Metaheuristics, Springer, New York, NY. Zbl1198.90002
  21. Han, J. and Kamber, M. (2006). Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA. Zbl05951239
  22. Ingber, L. (1996). Adaptive simulated annealing (ASA): Lessons learned, Control and Cybernetics 25(1): 33-54. Zbl0860.93035
  23. Inza, I., Larranaga, P., Etxeberria, R. and Sierra, B. (2000). Feature subset selection by Bayesian network-based optimization, Artificial Intelligence 123(1-2): 157-184. Zbl0952.68118
  24. Ishibuchi, H., Nakashima, T. and Murata, T. (2001). Three-objective genetics-based machine learning for linguistic rule extraction, Information Sciences 136(1-4): 109-133. Zbl0996.68175
  25. Kerdprasop, K., Kerdprasop, N. and Sattayatham, P. (2005). Weighted k-means for density-biased clustering, in A. Tjoa and J. Trujillo (Eds.), Data Warehousing and Knowledge Discovery, Lecture Notes in Computer Science, Vol. 3589, Springer-Verlag, Berlin pp. 488-497. 
  26. Kulczycki, P. (2005). Kernel Estimators in System Analysis, WNT, Warsaw, (in Polish). Zbl1122.93077
  27. Kulczycki, P. (2008). Kernel estimators in industrial applications, in B. Prasad (Ed.), Soft Computing Applications in Industry, Springer-Verlag, Berlin, pp. 69-91. 
  28. Kulczycki, P. and Charytanowicz, M. (2010). A complete gradient clustering algorithm formed with kernel estimators, International Journal of Applied Mathematics and Computer Science 20(1): 123-134, DOI: 10.2478/v10006-010-0009-3. Zbl1300.62043
  29. Kulczycki, P. and Kowalski, P. (2011). Bayes classification of imprecise information of interval type, Control and Cybernetics 40(1): 101-123. Zbl1318.62212
  30. Kulczycki, P. and Łukasik, S. (2014). Reduction of dimension and size of data set by parallel fast simulated annealing, in L.T. Koczy, C.R. Pozna, R. Claudiu and J. Kacprzyk (Eds.), Issues and Challenges of Intelligent Systems and Computational Intelligence, Springer-Verlag, Berlin, pp. 273-292. 
  31. Kuo, Y. (2010). Using simulated annealing to minimize fuel consumption for the time-dependent vehicle routing problem, Computers & Industrial Engineering 59(1): 157-165. 
  32. Łukasik, S. and Kulczycki, P. (2011). An algorithm for sample and data dimensionality reduction using fast simulated annealing, in J. Tang, I. King, L. Chen and J. Wang (Eds.), Advanced Data Mining and Applications, Lecture Notes in Computer Science, Vol. 7120, Springer-Verlag, Berlin, pp. 152-161. 
  33. Łukasik, S. and Kulczycki, P. (2013). Using topology preservation measures for multidimensional intelligent data analysis in the reduced feature space, in L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. Zadeh and J. Zurada (Eds.), Artificial Intelligence and Soft Computing, Lecture Notes in Computer Science, Vol. 7895, Springer-Verlag, Berlin, pp. 184-193. 
  34. Maaten, van der, L. (2009). Feature Extraction from Visual Data, Ph.D. thesis, Tilburg University, Tilburg. 
  35. Mangasarian, O. and Wolberg, W. (1990). Cancer diagnosis via linear programming, SIAM News 23(5): 1-18. 
  36. Mitra, P., Murthy, C. and Pal, S. (2002). Density-based multiscale data condensation, IEEE Transactions on Pattern Analysis and Machine Intelligence 24(6): 734-747. 
  37. Nam, D., Lee, J.-S. and Park, C. (2004). n-dimensional Cauchy neighbor generation for the fast simulated annealing, IEICE Transactions on Information and Systems E87D(11): 2499-2502. 
  38. Oliveira, J. and Pedrycz, W. (Eds.) (2007). Advances in Fuzzy Clustering and Its Applications, Wiley, Chichester. 
  39. Pal, S. and Mitra, P. (2004). Pattern Recognition Algorithms for Data Mining, Chapman and Hall, Boca Raton, FL. Zbl1099.68091
  40. Parvin, H., Alizadeh, H. and Minati, B. (1971). Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association 66(336): 846-850. 
  41. Parvin, H., Alizadeh, H. and Minati, B. (2010). A modification on k-nearest neighbor classifier, Global Journal of Computer Science and Technology 10(14): 37-41. 
  42. Sait, S. and Youssef, H. (2000). Iterative Computer Algorithms with Applications in Engineering: Solving Combinatorial Optimization Problems, IEEE Computer Society Press, Los Alamitos, CA. Zbl0933.68151
  43. Sammon, J. (1969). A nonlinear mapping for data structure analysis, IEEE Transactions on Computers 18(5): 401-409. 
  44. Saxena, A., Pal, N. and Vora, M. (2010). Evolutionary methods for unsupervised feature selection using Sammon's stress function, Fuzzy Information and Engineering 2(3): 229-247. 
  45. Strickert, M., Teichmann, S., Sreenivasulu, N. and Seiffert, U. (2005). DIPPP online self-improving linear map for distance-preserving data analysis, 5th Workshop on SelfOrganizing Maps, WSOM'05, Paris, France, pp. 661-668. 
  46. Sumi, S.M., Zaman, M.F. and Hirose, H. (2012). A rainfall forecasting method using machine learning models and its application to the Fukuoka city case, International Journal of Applied Mathematics and Computer Science 22(4): 841-854, DOI: 10.2478/v10006-012-0062-1. Zbl1283.68305
  47. Szu, H. and Hartley, R. (1987). Fast simulated annealing, Physics Letters A 122(3-4): 157-162. 
  48. Tian, T., Wilcox, R. and James, G. (2010). Data reduction in classification: A simulated annealing based projection method, Statistical Analysis and Data Mining 3(5): 319-331. 
  49. UC Irvine Machine Learning Repository http://archive.ics.uci.edu/ml/. (2013). 
  50. Vanstrum, M. and Starks, S. (1981). An algorithm for optimal linear maps, Southeastcon Conference, Huntsville, AL, USA, pp. 106-110. 
  51. Wand, M. and Jones, M. (1995). Kernel Smoothing, Chapman and Hall, London. Zbl0854.62043
  52. Wilson, D. and Martinez, T. (2000). Reduction techniques for instance-based learning algorithms, Machine Learning 38(3): 257-286. Zbl0954.68126
  53. Xu, R. and Wunsch, D. (2009). Clustering, Wiley, Hoboken, NJ. 
  54. Zhigljavsky, A. and Žilinskas, A. (2008). Stochastic Global Optimization, Springer-Verlag, Berlin. Zbl1136.90003

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.