Extracting Precise Data on the Mathematical Content of PDF Documents

Baker, Josef B.; Sexton, Alan P.; Sorge, Volker

Extracting Precise Data on the Mathematical Content of PDF Documents

Baker, Josef B.; Sexton, Alan P.; Sorge, Volker

Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008, Publisher: Masaryk University(Brno), page 75-79

Access Full Article

top

Access to full text

Full (PDF)

Abstract

top

As more and more scientific documents become available in PDF format, their automatic analysis becomes increasingly important. We present a procedure that extracts mathematical symbols from PDF documents by examining both the original PDF file and a rasterized version. This provides more precise information than is available either directly from the PDF file or by traditional character recognition techniques. The data can then be used to improve mathematical parsing methods that transform the mathematics into richer formats such as MathML.

How to cite

top

MLA
BibTeX
RIS

Baker, Josef B., Sexton, Alan P., and Sorge, Volker. "Extracting Precise Data on the Mathematical Content of PDF Documents." Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008. Brno: Masaryk University, 2008. 75-79. <http://eudml.org/doc/221546>.

@inProceedings{Baker2008,
abstract = {As more and more scientific documents become available in PDF format, their automatic analysis becomes increasingly important. We present a procedure that extracts mathematical symbols from PDF documents by examining both the original PDF file and a rasterized version. This provides more precise information than is available either directly from the PDF file or by traditional character recognition techniques. The data can then be used to improve mathematical parsing methods that transform the mathematics into richer formats such as MathML.},
author = {Baker, Josef B., Sexton, Alan P., Sorge, Volker},
booktitle = {Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008},
keywords = {document analysis},
location = {Brno},
pages = {75-79},
publisher = {Masaryk University},
title = {Extracting Precise Data on the Mathematical Content of PDF Documents},
url = {http://eudml.org/doc/221546},
year = {2008},
}

TY - CLSWK
AU - Baker, Josef B.
AU - Sexton, Alan P.
AU - Sorge, Volker
TI - Extracting Precise Data on the Mathematical Content of PDF Documents
T2 - Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008
PY - 2008
CY - Brno
PB - Masaryk University
SP - 75
EP - 79
AB - As more and more scientific documents become available in PDF format, their automatic analysis becomes increasingly important. We present a procedure that extracts mathematical symbols from PDF documents by examining both the original PDF file and a rasterized version. This provides more precise information than is available either directly from the PDF file or by traditional character recognition techniques. The data can then be used to improve mathematical parsing methods that transform the mathematics into richer formats such as MathML.
KW - document analysis
UR - http://eudml.org/doc/221546
ER -

References

top

Proberts, S., Brailsford, D., Substituting Outline Fonts for Bitmap Fonts in Archived PDF Files, . In Soft. Pract Exper., 33(9) pp. 885–899, 2003. (2003)
Phelps, T., Multivalent, . http://multivalent.sourceforge.net/.
Rahman, F., Alam, H., Conversion of PDF documents into HTML: A case study of document image analysis, . In Conf. on Signal, Systems, Computers pp. 87–91, 2003. (2003)
Shao, M., Futrelle, R., Graphics recognition in PDF documents, . In Proc. of GREC 2005, LNCS 3926. Springer, 2006. (2006)
Raja, A., Rayner, M., Sexton, A., Sorge, V., Towards a parser for mathematical formula recognition, . In Proc. of MKM 2006, LNCS 4108. pp. 139–151, Springer, 2006. (2006) Zbl1188.68284
Grbavec, A., Blostein, D., Mathematics recognition using graph rewriting, . In Proc. of ICDAR ’95, pp. 417–421, 1995. (1995)

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Language to use for this widget.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Number of notes per page

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.