Text document classification based on mixture models

Jana Novovičová; Antonín Malík

Text document classification based on mixture models

Jana Novovičová; Antonín Malík

Kybernetika (2004)

Volume: 40, Issue: 3, page [293]-304
ISSN: 0023-5954

Access Full Article

top

Access to full text

Full (PDF)

Abstract

top

Finite mixture modelling of class-conditional distributions is a standard method in a statistical pattern recognition. This paper, using bag-of-words vector document representation, explores the use of the mixture of multinomial distributions as a model for class-conditional distribution for multiclass text document classification task. Experimental comparison of the proposed model and the standard Bernoulli and multinomial models as well as the model based on mixture of multivariate Bernoulli distributions was performed using Reuters-21578 and Newsgroups data sets. Preliminary experimental results indicate the effectiveness of the proposed model in a text classification problem.

How to cite

top

MLA
BibTeX
RIS

Novovičová, Jana, and Malík, Antonín. "Text document classification based on mixture models." Kybernetika 40.3 (2004): [293]-304. <http://eudml.org/doc/33701>.

@article{Novovičová2004,
abstract = {Finite mixture modelling of class-conditional distributions is a standard method in a statistical pattern recognition. This paper, using bag-of-words vector document representation, explores the use of the mixture of multinomial distributions as a model for class-conditional distribution for multiclass text document classification task. Experimental comparison of the proposed model and the standard Bernoulli and multinomial models as well as the model based on mixture of multivariate Bernoulli distributions was performed using Reuters-21578 and Newsgroups data sets. Preliminary experimental results indicate the effectiveness of the proposed model in a text classification problem.},
author = {Novovičová, Jana, Malík, Antonín},
journal = {Kybernetika},
keywords = {text classification; multinomialmixture model; text classification; multinomial mixture model},
language = {eng},
number = {3},
pages = {[293]-304},
publisher = {Institute of Information Theory and Automation AS CR},
title = {Text document classification based on mixture models},
url = {http://eudml.org/doc/33701},
volume = {40},
year = {2004},
}

TY - JOUR
AU - Novovičová, Jana
AU - Malík, Antonín
TI - Text document classification based on mixture models
JO - Kybernetika
PY - 2004
PB - Institute of Information Theory and Automation AS CR
VL - 40
IS - 3
SP - [293]
EP - 304
AB - Finite mixture modelling of class-conditional distributions is a standard method in a statistical pattern recognition. This paper, using bag-of-words vector document representation, explores the use of the mixture of multinomial distributions as a model for class-conditional distribution for multiclass text document classification task. Experimental comparison of the proposed model and the standard Bernoulli and multinomial models as well as the model based on mixture of multivariate Bernoulli distributions was performed using Reuters-21578 and Newsgroups data sets. Preliminary experimental results indicate the effectiveness of the proposed model in a text classification problem.
LA - eng
KW - text classification; multinomialmixture model; text classification; multinomial mixture model
UR - http://eudml.org/doc/33701
ER -

References

top

Battiti R., 10.1109/72.298224, IEEE Trans. Neural Networks 5 (1994), 537–550 (1994) DOI10.1109/72.298224
Dempster A. P., Laird N. M., Rubin D. B., Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Statist. Soc. Ser. B 39 (1977), 1–38 (1977) Zbl0364.62022 MR0501537
Forman G., An experimental study of feature selection metrics for text categorization, J. Mach. Learning Res. 3 (2003), 1289–1305
Joachims T., Text categorization with support vector machines: Learning with many relevant features, In: Proc. 10th European Conference on Machine Learning (ECML’98), 1998, pp. 137–142 (1998)
Juan A., Vidal E., 10.1016/S0031-3203(01)00242-4, Pattern Recognition 35 (2002), 2705–2710 DOI10.1016/S0031-3203(01)00242-4
Kwak N., Choi C., Improved mutual information feature selector for neural networks in supervised learning, In: Proc. Internat. Joint Conference on Neural Networks (IJCNN ’99), 1999 pp. 1313–1318 (1999)
McCallum A., Nigam K., A comparison of event models for naive Bayes text classification, In: Proc. AAAI-98 Workshop on Learning for Text Categorization, 1998
McLachlan G. J., Peel D., Finite Mixture Models, Wiley, New York 2000 Zbl0963.62061 MR1789474
Mladenic D., Grobelnik M., Feature selection for unbalanced class distribution and Naive Bayes, In: Proc. Sixteenth Internat. Conference on Machine Learning, 1999, pp. 258–267 (1999)
Nigam K., McCallum A., Thrun, S., Mitchell T., 10.1023/A:1007692713085, Mach. Learning 39 (2000), 103–134 Zbl0949.68162 DOI10.1023/A:1007692713085
Novovičová J., Pudil, P., Kittler J., 10.1109/34.481557, IEEE Trans. Pattern Anal. Machine Intell. 18 (1996), 218–223 (1996) DOI10.1109/34.481557
Novovičová J., Malík A., Text Document Classification Using Finite Mixtures, Research Report No. 2063, Institute of Information Theory and Automation, Prague 2002
Novovičová J., Malík A., Application of multinomial mixture model to text classification, In: Pattern Recognition and Image Analysis (Lecture Notes in Computer Sciences 2652), Springer–Verlag, Berlin 2003, pp. 646–653
Novovičová J., Malík, A., Pudil P., Feature selection using improved mutual information for text classification, In: Structural, Syntactic and Statistical Pattern Recognition (Lecture Notes in Computer Science), Springer–Verlag, Berlin 2004 (in press) Zbl1104.68663
Pudil P., Novovičová, J., Kittler J., 10.1016/0031-3203(94)00009-B, Pattern Recognition 28 (1995), 1389–1398 (1995) DOI10.1016/0031-3203(94)00009-B
Ueda N., Saito K., Parametric mixture models for multi-labeled text, In: Proc. Neural Information Processing Systems, 2003
Yang Y., Pedersen J. O., A comparative study on feature selection in text categorization, In: Proc. Internat. Conference on Machine Learning, 1997, pp. 412–420 (1997)
Yang Y., Liu X., A re-examination of text categorization methods, In: Proc. 22nd Internat. ACM SIGIR Conference on Research and Development in Inform. Retrieval, 1999, pp. 42–49 (1999)
Yang Y., 10.1023/A:1009982220290, J. Inform. Retrieval 1 (1999), 67–88 (1999) DOI10.1023/A:1009982220290
Yang Y., Zhang, J., Kisiel B., A scalability analysis of classifier in text categorization, In: Proc. 26th ACM SIGIR Conference on Research and Development in Inform. Retrieval, 2003

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Language to use for this widget.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Number of notes per page

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.