Text document classification based on mixture models

Jana Novovičová; Antonín Malík

Kybernetika (2004)

  • Volume: 40, Issue: 3, page [293]-304
  • ISSN: 0023-5954

Abstract

top
Finite mixture modelling of class-conditional distributions is a standard method in a statistical pattern recognition. This paper, using bag-of-words vector document representation, explores the use of the mixture of multinomial distributions as a model for class-conditional distribution for multiclass text document classification task. Experimental comparison of the proposed model and the standard Bernoulli and multinomial models as well as the model based on mixture of multivariate Bernoulli distributions was performed using Reuters-21578 and Newsgroups data sets. Preliminary experimental results indicate the effectiveness of the proposed model in a text classification problem.

How to cite

top

Novovičová, Jana, and Malík, Antonín. "Text document classification based on mixture models." Kybernetika 40.3 (2004): [293]-304. <http://eudml.org/doc/33701>.

@article{Novovičová2004,
abstract = {Finite mixture modelling of class-conditional distributions is a standard method in a statistical pattern recognition. This paper, using bag-of-words vector document representation, explores the use of the mixture of multinomial distributions as a model for class-conditional distribution for multiclass text document classification task. Experimental comparison of the proposed model and the standard Bernoulli and multinomial models as well as the model based on mixture of multivariate Bernoulli distributions was performed using Reuters-21578 and Newsgroups data sets. Preliminary experimental results indicate the effectiveness of the proposed model in a text classification problem.},
author = {Novovičová, Jana, Malík, Antonín},
journal = {Kybernetika},
keywords = {text classification; multinomialmixture model; text classification; multinomial mixture model},
language = {eng},
number = {3},
pages = {[293]-304},
publisher = {Institute of Information Theory and Automation AS CR},
title = {Text document classification based on mixture models},
url = {http://eudml.org/doc/33701},
volume = {40},
year = {2004},
}

TY - JOUR
AU - Novovičová, Jana
AU - Malík, Antonín
TI - Text document classification based on mixture models
JO - Kybernetika
PY - 2004
PB - Institute of Information Theory and Automation AS CR
VL - 40
IS - 3
SP - [293]
EP - 304
AB - Finite mixture modelling of class-conditional distributions is a standard method in a statistical pattern recognition. This paper, using bag-of-words vector document representation, explores the use of the mixture of multinomial distributions as a model for class-conditional distribution for multiclass text document classification task. Experimental comparison of the proposed model and the standard Bernoulli and multinomial models as well as the model based on mixture of multivariate Bernoulli distributions was performed using Reuters-21578 and Newsgroups data sets. Preliminary experimental results indicate the effectiveness of the proposed model in a text classification problem.
LA - eng
KW - text classification; multinomialmixture model; text classification; multinomial mixture model
UR - http://eudml.org/doc/33701
ER -

References

top
  1. Battiti R., 10.1109/72.298224, IEEE Trans. Neural Networks 5 (1994), 537–550 (1994) DOI10.1109/72.298224
  2. Dempster A. P., Laird N. M., Rubin D. B., Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Statist. Soc. Ser. B 39 (1977), 1–38 (1977) Zbl0364.62022MR0501537
  3. Forman G., An experimental study of feature selection metrics for text categorization, J. Mach. Learning Res. 3 (2003), 1289–1305 
  4. Joachims T., Text categorization with support vector machines: Learning with many relevant features, In: Proc. 10th European Conference on Machine Learning (ECML’98), 1998, pp. 137–142 (1998) 
  5. Juan A., Vidal E., 10.1016/S0031-3203(01)00242-4, Pattern Recognition 35 (2002), 2705–2710 DOI10.1016/S0031-3203(01)00242-4
  6. Kwak N., Choi C., Improved mutual information feature selector for neural networks in supervised learning, In: Proc. Internat. Joint Conference on Neural Networks (IJCNN ’99), 1999 pp. 1313–1318 (1999) 
  7. McCallum A., Nigam K., A comparison of event models for naive Bayes text classification, In: Proc. AAAI-98 Workshop on Learning for Text Categorization, 1998 
  8. McLachlan G. J., Peel D., Finite Mixture Models, Wiley, New York 2000 Zbl0963.62061MR1789474
  9. Mladenic D., Grobelnik M., Feature selection for unbalanced class distribution and Naive Bayes, In: Proc. Sixteenth Internat. Conference on Machine Learning, 1999, pp. 258–267 (1999) 
  10. Nigam K., McCallum A., Thrun, S., Mitchell T., 10.1023/A:1007692713085, Mach. Learning 39 (2000), 103–134 Zbl0949.68162DOI10.1023/A:1007692713085
  11. Novovičová J., Pudil, P., Kittler J., 10.1109/34.481557, IEEE Trans. Pattern Anal. Machine Intell. 18 (1996), 218–223 (1996) DOI10.1109/34.481557
  12. Novovičová J., Malík A., Text Document Classification Using Finite Mixtures, Research Report No. 2063, Institute of Information Theory and Automation, Prague 2002 
  13. Novovičová J., Malík A., Application of multinomial mixture model to text classification, In: Pattern Recognition and Image Analysis (Lecture Notes in Computer Sciences 2652), Springer–Verlag, Berlin 2003, pp. 646–653 
  14. Novovičová J., Malík, A., Pudil P., Feature selection using improved mutual information for text classification, In: Structural, Syntactic and Statistical Pattern Recognition (Lecture Notes in Computer Science), Springer–Verlag, Berlin 2004 (in press) Zbl1104.68663
  15. Pudil P., Novovičová, J., Kittler J., 10.1016/0031-3203(94)00009-B, Pattern Recognition 28 (1995), 1389–1398 (1995) DOI10.1016/0031-3203(94)00009-B
  16. Ueda N., Saito K., Parametric mixture models for multi-labeled text, In: Proc. Neural Information Processing Systems, 2003 
  17. Yang Y., Pedersen J. O., A comparative study on feature selection in text categorization, In: Proc. Internat. Conference on Machine Learning, 1997, pp. 412–420 (1997) 
  18. Yang Y., Liu X., A re-examination of text categorization methods, In: Proc. 22nd Internat. ACM SIGIR Conference on Research and Development in Inform. Retrieval, 1999, pp. 42–49 (1999) 
  19. Yang Y., 10.1023/A:1009982220290, J. Inform. Retrieval 1 (1999), 67–88 (1999) DOI10.1023/A:1009982220290
  20. Yang Y., Zhang, J., Kisiel B., A scalability analysis of classifier in text categorization, In: Proc. 26th ACM SIGIR Conference on Research and Development in Inform. Retrieval, 2003 

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.