Towards Reverse Engineering of PDF Documents

Baker, Josef B.; Sexton, Alan P.; Sorge, Volker

Towards Reverse Engineering of PDF Documents

Baker, Josef B.; Sexton, Alan P.; Sorge, Volker

Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011, Publisher: Masaryk University Press(Brno, Czech Republic), page 65-75

Access Full Article

top

Access to full text

Full (PDF)

Abstract

top

We present a progress report on our ongoing project of reverse engineering scientific PDF documents. The aim is to obtain mathematical markup that can be used as source for regenerating a document that resembles the original as closely as possible. This source can then be a basis for further document processing. Our current tool uses specialised PDF extraction together with image analysis to produce near perfect input for parsing mathematical formula. Applying a linear grammar and specific drivers for each output format to this input, we can produce an accurate reproduction of formulae when presented with their coordinates. In this paper we will show how this information can be exploited to discover the locations of both inline and display formulae, and also to perform rudimentary layout analysis of the whole document, identifying structures such as headings and paragraphs.

How to cite

top

MLA
BibTeX
RIS

Baker, Josef B., Sexton, Alan P., and Sorge, Volker. "Towards Reverse Engineering of PDF Documents." Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011. Brno, Czech Republic: Masaryk University Press, 2011. 65-75. <http://eudml.org/doc/221395>.

@inProceedings{Baker2011,
abstract = {We present a progress report on our ongoing project of reverse engineering scientific PDF documents. The aim is to obtain mathematical markup that can be used as source for regenerating a document that resembles the original as closely as possible. This source can then be a basis for further document processing. Our current tool uses specialised PDF extraction together with image analysis to produce near perfect input for parsing mathematical formula. Applying a linear grammar and specific drivers for each output format to this input, we can produce an accurate reproduction of formulae when presented with their coordinates. In this paper we will show how this information can be exploited to discover the locations of both inline and display formulae, and also to perform rudimentary layout analysis of the whole document, identifying structures such as headings and paragraphs.},
author = {Baker, Josef B., Sexton, Alan P., Sorge, Volker},
booktitle = {Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011},
location = {Brno, Czech Republic},
pages = {65-75},
publisher = {Masaryk University Press},
title = {Towards Reverse Engineering of PDF Documents},
url = {http://eudml.org/doc/221395},
year = {2011},
}

TY - CLSWK
AU - Baker, Josef B.
AU - Sexton, Alan P.
AU - Sorge, Volker
TI - Towards Reverse Engineering of PDF Documents
T2 - Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011
PY - 2011
CY - Brno, Czech Republic
PB - Masaryk University Press
SP - 65
EP - 75
AB - We present a progress report on our ongoing project of reverse engineering scientific PDF documents. The aim is to obtain mathematical markup that can be used as source for regenerating a document that resembles the original as closely as possible. This source can then be a basis for further document processing. Our current tool uses specialised PDF extraction together with image analysis to produce near perfect input for parsing mathematical formula. Applying a linear grammar and specific drivers for each output format to this input, we can produce an accurate reproduction of formulae when presented with their coordinates. In this paper we will show how this information can be exploited to discover the locations of both inline and display formulae, and also to perform rudimentary layout analysis of the whole document, identifying structures such as headings and paragraphs.
UR - http://eudml.org/doc/221395
ER -

References

top

Anderson, R.H., Syntax-Directed Recognition of Hand-Printed Two-dimensional Mathematics, Ph.D. thesis, Harvard University, Cambridge, MA (1968). (1968) Zbl0207.17806
Baker, J., Sexton, A., Sorge, V., A linear grammar approach to mathematical formula recognition from PDF, In: Proceedings of Intelligent Computer Mathematics (2009). (2009)
Baker, J., Sexton, A.P., Sorge, V., Suzuki, M., Comparing approaches to mathematical document analysis, In: 11th International Conference on Document Analysis and Recognition (to appear) (2011). (2011)
Baker, J., Sexton, A., Sorge, V., Faithful mathematical formula recognition from PDF documents, In: 9th IAPR International Workshop on Document Analysis Systems, Extended Abstracts. pp. 485–492. ACM Press, Boston, USA (2010). (2010)
Garain, U., Identification of mathematical expressions in document images, In: Document Analysis and Recognition, International Conference on. pp. 1340–1344. IEEE Computer Society, Los Alamitos, CA, USA (2009). (2009)
Mittelbach, F., Goossens, M., The LaTeX Companion, Pearson Education, 2e edn. (2005), TeX spacing table, page 525. (2005)
Sternberg, S., Theory of functions of a real variable, (2005), http://www.math.harvard.edu/~shlomo/docs/Real_Variables.pdf (2005)
Suzuki, M., Uchida, S., Nomura, A., A ground-truthed mathematical character and symbol image database, In: Proc. of ICDAR. pp. 675–679. IEEE Computer Society (2005). (2005)
Suzuki, M., Infty, (2011), http://www.inftyproject.org (2011)

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Language to use for this widget.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Number of notes per page

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.