An alternative extension of the k-means algorithm for clustering categorical data

Ohn San; Van-Nam Huynh; Yoshiteru Nakamori

International Journal of Applied Mathematics and Computer Science (2004)

  • Volume: 14, Issue: 2, page 241-247
  • ISSN: 1641-876X

Abstract

top
Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The -means algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. The main contribution of this paper is to show how to apply the notion of 'cluster centers' on a dataset of categorical objects and how to use this notion for formulating the clustering problem of categorical objects as a partitioning problem. Finally, a -means-like algorithm for clustering categorical data is introduced. The clustering performance of the algorithm is demonstrated with two well-known data sets, namely, em soybean disease and em nursery databases.

How to cite

top

San, Ohn, Huynh, Van-Nam, and Nakamori, Yoshiteru. "An alternative extension of the k-means algorithm for clustering categorical data." International Journal of Applied Mathematics and Computer Science 14.2 (2004): 241-247. <http://eudml.org/doc/207695>.

@article{San2004,
abstract = {Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The -means algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. The main contribution of this paper is to show how to apply the notion of 'cluster centers' on a dataset of categorical objects and how to use this notion for formulating the clustering problem of categorical objects as a partitioning problem. Finally, a -means-like algorithm for clustering categorical data is introduced. The clustering performance of the algorithm is demonstrated with two well-known data sets, namely, em soybean disease and em nursery databases.},
author = {San, Ohn, Huynh, Van-Nam, Nakamori, Yoshiteru},
journal = {International Journal of Applied Mathematics and Computer Science},
keywords = {data mining; cluster analysis; categorical data},
language = {eng},
number = {2},
pages = {241-247},
title = {An alternative extension of the k-means algorithm for clustering categorical data},
url = {http://eudml.org/doc/207695},
volume = {14},
year = {2004},
}

TY - JOUR
AU - San, Ohn
AU - Huynh, Van-Nam
AU - Nakamori, Yoshiteru
TI - An alternative extension of the k-means algorithm for clustering categorical data
JO - International Journal of Applied Mathematics and Computer Science
PY - 2004
VL - 14
IS - 2
SP - 241
EP - 247
AB - Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The -means algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. The main contribution of this paper is to show how to apply the notion of 'cluster centers' on a dataset of categorical objects and how to use this notion for formulating the clustering problem of categorical objects as a partitioning problem. Finally, a -means-like algorithm for clustering categorical data is introduced. The clustering performance of the algorithm is demonstrated with two well-known data sets, namely, em soybean disease and em nursery databases.
LA - eng
KW - data mining; cluster analysis; categorical data
UR - http://eudml.org/doc/207695
ER -

References

top
  1. Anderberg M.R. (1973): Cluster Analysis for Applications. - New York: Academic Press. Zbl0299.62029
  2. Ball G.H. and Hall D.J. (1967): A clustering technique for summarizing multivariate data.- Behav. Sci., Vol. 12, No. 2, pp. 153-155. 
  3. Bezdek J.C. (1980): A convergence theorem for the fuzzy ISODATA clustering algorithms.- IEEE Trans. Pattern Anal. Mach. Intell., Vol. 2, No. 1, pp. 1-8. Zbl0441.62055
  4. Blake C.L. and Merz C.J. (1998): UCI Repository of machine learning databases. - Available at: http://www.ics.uci.edu/~mlearn/MLRepository.html, Irvine, CA: University of California, Department of Information and Computer Science. 
  5. Cox D. (1957): Note on grouping.- J. Amer. Stat. Assoc., Vol. 52, pp. 543-547. Zbl0088.35402
  6. Fisher D.H. (1987): Knowledge acquisition via incremental conceptual clustering.- Mach. Learn., Vol. 2, No. 2, pp. 139-172. 
  7. Ganti V., Gehrke J. and Ramakrishnan R. (1999): CATUS - Clustering categorical data using summaries. -Proc. Int. Conf. Knowledge Discovery and Data Mining, San Diego, USA, pp. 73-83. 
  8. Gibson D., Kleinberg J. and Raghavan P. (1998) Clustering categorical data: An approach based on dynamic systems. Proc. 24-th Int. Conf. Very LargeDatabases, New York, pp. 311-323. 
  9. Gowda K.C. and Diday E. (1991): Symbolic clustering using a new dissimilarity measure. - Pattern Recogn., Vol. 24, No. 6, pp. 567-578. 
  10. Guha S., Rastogi R. and Shim K. (2000): ROCK: A robust clustering algorithm for categorical attributes. - Inf. Syst., Vol. 25, No. 5, pp. 345-366. 
  11. Han J. and Kamber M. (2001): Data Mining: Concepts and Techniques. - San Francisco: Morgan Kaufmann Publishers. 
  12. Hathaway R.J. and Bezdek J.C. (1986): Local convergence of the c-means algorithms.- Pattern Recogn., Vol. 19, No. 6, pp. 477-480. Zbl0623.62058
  13. Huang Z. (1997): Clustering large data sets with mixed numeric and categorical values, In: KDD: Techniques and Applications (H. Lu, H. Motoda and H. Luu, Eds.). - Singapore: World Scientific, pp. 21-34. 
  14. Huang Z. (1998): Extensions to the k-means algorithm for clustering large data setswith categorical values. - Data Mining Knowl. Discov., Vol. 2, No. 2, pp. 283-304. 
  15. Huang Z. and Ng M.K. (1999): A fuzzy k-modes algorithm for clustering categorical data.- IEEE Trans. Fuzzy Syst., Vol. 7, No. 4, pp. 446-452. 
  16. Ismail M.A. and Selim S.Z. (1986): Fuzzy c-means: Optimality of solutions and effective termination of the problem.- Pattern Recogn., Vol. 19, No. 6, pp. 481-485. Zbl0623.62059
  17. Jain A.K. and Dubes R.C. (1988): Algorithms for Clustering Data.- Englewood Cliffs: Prentice Hall. Zbl0665.62061
  18. Kaufman L. and Rousseeuw P.J. (1990): Finding Groups in Data. - New York: Wiley. 
  19. MacQueen J.B. (1967): Some methods for classification and analysis of multivariate observations. - Proc. 5-th Symp. Mathematical Statistics and Probability, Berkelely, CA, Vol. 1, pp. 281-297. Zbl0214.46201
  20. Michalski R.S. and Stepp R.E. (1983): Automated construction of classifications: Conceptual clustering versus numerical taxonomy.- IEEE Trans. Pattern Anal. Mach. Intell., Vol. PAMI-5, No. 4, pp. 396-410. 
  21. Ralambondrainy H. (1995): A conceptual version of the k-means algorithm.- Pattern Recogn. Lett., Vol. 15, No. 11, pp. 1147-1157. 
  22. Ruspini E.R. (1969): A new approach to clustering. -Inf. Contr., Vol. 15, No. 1, pp. 22-32. Zbl0192.57101
  23. Selim S.Z. and Ismail M.A. (1984): k-Means-type algorithms: A generalized convergence theorem and characterization of local optimality. - IEEE Trans. Pattern Anal. Mach. Intell., Vol. PAMI-6, No. 1, pp. 81-87. Zbl0546.62037

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.