Mathematical Formulae Recognition and Logical Structure Analysis of Mathematical Papers

Suzuki, Masakazu

  • Towards a Digital Mathematics Library. Paris, France, July 7-8th, 2010, Publisher: Masaryk University Press(Brno, Czech Republic), page 7-7

Abstract

top
In most cases the current on-line journals in mathematics are supplied in the form of PDF with print images of papers in the front and OCR’ed hidden texts behind to provide with search facilily using key words. The embedded hidden texts usually does not include good information about mathematical formulae in the papers. We can say that, for the future development of DML, it is desirable to include, in the digitised journals, more structured information of the content of mathematical papers, e.g. tag information to indicate logical structure of papers such as headings of sections, definitions, theorems, lemmas, etc., together with mathematical formulae structures included. In the talk, I will present the current stage of our technology to extract such information from the scanned images in the retro-digitised mathematical papers. Mechanically-prepared new journals in the form of PDF are also the target of our research since it is not an easy task to get uniform structure description of mathematical formulae for example from the original LaTeX source with various styles and macro commands depending on authors. Although there are many methods presented in literature to recognize mathematical formulae, very few applications appeared to do this task in practical sense. One of the major problem in the development of math OCR is to avoid fatal effects caused by mis-recognition and mis-segmentation of characters and symbols. In the talk, I will explain first the method we took to overcome this difficulty. Some demonstration of our software InftyReader to recognize mathematical documents will also be given in the lecture. Secondly, as a better approach to recognize a large number of pages like the case of DML, our adaptive method to improve the recognition rates of characters/symbols, mathematical formulae structures and logical structures of articles will also be presented.

How to cite

top

Suzuki, Masakazu. "Mathematical Formulae Recognition and Logical Structure Analysis of Mathematical Papers." Towards a Digital Mathematics Library. Paris, France, July 7-8th, 2010. Brno, Czech Republic: Masaryk University Press, 2010. 7-7. <http://eudml.org/doc/220749>.

@inProceedings{Suzuki2010,
abstract = {In most cases the current on-line journals in mathematics are supplied in the form of PDF with print images of papers in the front and OCR’ed hidden texts behind to provide with search facilily using key words. The embedded hidden texts usually does not include good information about mathematical formulae in the papers. We can say that, for the future development of DML, it is desirable to include, in the digitised journals, more structured information of the content of mathematical papers, e.g. tag information to indicate logical structure of papers such as headings of sections, definitions, theorems, lemmas, etc., together with mathematical formulae structures included. In the talk, I will present the current stage of our technology to extract such information from the scanned images in the retro-digitised mathematical papers. Mechanically-prepared new journals in the form of PDF are also the target of our research since it is not an easy task to get uniform structure description of mathematical formulae for example from the original LaTeX source with various styles and macro commands depending on authors. Although there are many methods presented in literature to recognize mathematical formulae, very few applications appeared to do this task in practical sense. One of the major problem in the development of math OCR is to avoid fatal effects caused by mis-recognition and mis-segmentation of characters and symbols. In the talk, I will explain first the method we took to overcome this difficulty. Some demonstration of our software InftyReader to recognize mathematical documents will also be given in the lecture. Secondly, as a better approach to recognize a large number of pages like the case of DML, our adaptive method to improve the recognition rates of characters/symbols, mathematical formulae structures and logical structures of articles will also be presented.},
author = {Suzuki, Masakazu},
booktitle = {Towards a Digital Mathematics Library. Paris, France, July 7-8th, 2010},
keywords = {InftyReader},
location = {Brno, Czech Republic},
pages = {7-7},
publisher = {Masaryk University Press},
title = {Mathematical Formulae Recognition and Logical Structure Analysis of Mathematical Papers},
url = {http://eudml.org/doc/220749},
year = {2010},
}

TY - CLSWK
AU - Suzuki, Masakazu
TI - Mathematical Formulae Recognition and Logical Structure Analysis of Mathematical Papers
T2 - Towards a Digital Mathematics Library. Paris, France, July 7-8th, 2010
PY - 2010
CY - Brno, Czech Republic
PB - Masaryk University Press
SP - 7
EP - 7
AB - In most cases the current on-line journals in mathematics are supplied in the form of PDF with print images of papers in the front and OCR’ed hidden texts behind to provide with search facilily using key words. The embedded hidden texts usually does not include good information about mathematical formulae in the papers. We can say that, for the future development of DML, it is desirable to include, in the digitised journals, more structured information of the content of mathematical papers, e.g. tag information to indicate logical structure of papers such as headings of sections, definitions, theorems, lemmas, etc., together with mathematical formulae structures included. In the talk, I will present the current stage of our technology to extract such information from the scanned images in the retro-digitised mathematical papers. Mechanically-prepared new journals in the form of PDF are also the target of our research since it is not an easy task to get uniform structure description of mathematical formulae for example from the original LaTeX source with various styles and macro commands depending on authors. Although there are many methods presented in literature to recognize mathematical formulae, very few applications appeared to do this task in practical sense. One of the major problem in the development of math OCR is to avoid fatal effects caused by mis-recognition and mis-segmentation of characters and symbols. In the talk, I will explain first the method we took to overcome this difficulty. Some demonstration of our software InftyReader to recognize mathematical documents will also be given in the lecture. Secondly, as a better approach to recognize a large number of pages like the case of DML, our adaptive method to improve the recognition rates of characters/symbols, mathematical formulae structures and logical structures of articles will also be presented.
KW - InftyReader
UR - http://eudml.org/doc/220749
ER -

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.