Mathematical Document Classification via Symbol Frequency Analysis

Watt, Stephen M.

  • Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008, Publisher: Masaryk University(Brno), page 29-40

Abstract

top
Earlier work has examined the frequency of symbol and expression use in mathematical documents for various purposes including mathematical handwriting recognition and forming the most natural output from computer algebra systems. This work has found, unsurprisingly, that the particulars of symbol and expression vary from area to area and, in particular, between different top-level subjects of the 2000 Mathematical Subject Classification. If the area of mathematics is known in advance, then an area-specific information can be used for the recognition or output problem. What is more interesting is that although the specifics of which symbols are ranked as most frequent vary from area to area, the shape of the relative frequency curve remains the same. The present work examines the inverse problem: Given the relative frequencies of symbols in a document, is it possible to classify the document and determine the most likely area of mathematics of the work? We examine the symbol frequency “fingerprints” for the different areas of the Mathematical Subject Classification.

How to cite

top

Watt, Stephen M.. "Mathematical Document Classification via Symbol Frequency Analysis." Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008. Brno: Masaryk University, 2008. 29-40. <http://eudml.org/doc/220164>.

@inProceedings{Watt2008,
abstract = {Earlier work has examined the frequency of symbol and expression use in mathematical documents for various purposes including mathematical handwriting recognition and forming the most natural output from computer algebra systems. This work has found, unsurprisingly, that the particulars of symbol and expression vary from area to area and, in particular, between different top-level subjects of the 2000 Mathematical Subject Classification. If the area of mathematics is known in advance, then an area-specific information can be used for the recognition or output problem. What is more interesting is that although the specifics of which symbols are ranked as most frequent vary from area to area, the shape of the relative frequency curve remains the same. The present work examines the inverse problem: Given the relative frequencies of symbols in a document, is it possible to classify the document and determine the most likely area of mathematics of the work? We examine the symbol frequency “fingerprints” for the different areas of the Mathematical Subject Classification.},
author = {Watt, Stephen M.},
booktitle = {Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008},
keywords = {mathematical document classification; Mathematical Subject Classification},
location = {Brno},
pages = {29-40},
publisher = {Masaryk University},
title = {Mathematical Document Classification via Symbol Frequency Analysis},
url = {http://eudml.org/doc/220164},
year = {2008},
}

TY - CLSWK
AU - Watt, Stephen M.
TI - Mathematical Document Classification via Symbol Frequency Analysis
T2 - Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008
PY - 2008
CY - Brno
PB - Masaryk University
SP - 29
EP - 40
AB - Earlier work has examined the frequency of symbol and expression use in mathematical documents for various purposes including mathematical handwriting recognition and forming the most natural output from computer algebra systems. This work has found, unsurprisingly, that the particulars of symbol and expression vary from area to area and, in particular, between different top-level subjects of the 2000 Mathematical Subject Classification. If the area of mathematics is known in advance, then an area-specific information can be used for the recognition or output problem. What is more interesting is that although the specifics of which symbols are ranked as most frequent vary from area to area, the shape of the relative frequency curve remains the same. The present work examines the inverse problem: Given the relative frequencies of symbols in a document, is it possible to classify the document and determine the most likely area of mathematics of the work? We examine the symbol frequency “fingerprints” for the different areas of the Mathematical Subject Classification.
KW - mathematical document classification; Mathematical Subject Classification
UR - http://eudml.org/doc/220164
ER -

References

top
  1. arXiv e-Print archive, , http://arxiv.org. 
  2. 2000 Mathematics Subject Classification, . American Mathematical Society, http://www.ams.org/msc. 
  3. Garain, U., Chaudhuri, B. B., A corpus for OCR research on mathematical expressions, , International Journal on Document Analysis and Recognition, Vol. 7, Issue 4, pp. 241–259. (September 2005). (2005) 
  4. Uchida, S., Nomura, A., Suzuki, M., Quantitative analysis of mathematical documents, , International Journal on Document Analysis and Recognition, Vol. 7, Issue 4, pp. 211–218. (September 2005). (2005) 
  5. Clare, M. So, Watt, S. M., Determining Empirical Properties of Mathematical Expression Use, , Proc. Fourth International Conference on Mathematical Knowledge Management, (MKM 2005), July 15–17, 2005, Bremen Germany, Springer Verlag LNCS 3863, pp. 361–375. 
  6. Clare, M. So, An Analysis of Mathematical Expressions Used in Practice, , Masters Thesis, University of Western Ontario, 2005. 
  7. Watt, S. M., Exploiting Implicit Mathematical Semantics in Conversion between TeX and MathML, , Proc. Internet Accessible Mathematical Communication,http://www.symbolicnet.org/conferences/iamc02, July 7, 2002, Lille, France. (2002) 
  8. Watt, S. M., An Empirical Measure on the Set of Symbols Occurring in Engineering Mathematics Texts, , Proc. 8 IAPR International Workshop on Document Analysis Systems, (DAS 2008), Sept 17–19, 2008, Nara, Japan, (IEEE, to appear). (2008) 
  9. Kreyszig, E., Advanced Engineering Mathematics, 8 t h ed., , Wiley & Sons 1999. (1999) MR1665766
  10. Kreyszig, E., Advanced Engineering Mathematics, 9 t h ed., , Wiley & Sons 2006. (2006) 
  11. Greenberg, M., Advanced Engineering Mathematics, 2 n d ed., , Prentice Hall 1998. (1998) 
  12. O’Neil, P., Advanced Engineering Mathematics, 5 t h ed., , Thomson-Nelson 2003. (2003) 
  13. Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T., Infty—an integrated OCR system for mathematical documents, , Proceedings of ACM Symposium on Document Engineering 2003, Grenoble, 2003, pp. 95–104. (2003) 
  14. Smirnova, E., Watt, S. M., Context-Sensitive Mathematical Character Recognition, , August 19–21, 2008, Montreal, Canada, (IEEE, to appear). (2008) 
  15. Zipf, G. K., Human Behavior and the Principle of Least-Effort, , Addison-Wesley, 1949. (1949) 

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.