Towards Reverse Engineering of PDF Documents
Baker, Josef B.; Sexton, Alan P.; Sorge, Volker
- Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011, Publisher: Masaryk University Press(Brno, Czech Republic), page 65-75
Access Full Article
topAbstract
topHow to cite
topBaker, Josef B., Sexton, Alan P., and Sorge, Volker. "Towards Reverse Engineering of PDF Documents." Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011. Brno, Czech Republic: Masaryk University Press, 2011. 65-75. <http://eudml.org/doc/221395>.
@inProceedings{Baker2011,
abstract = {We present a progress report on our ongoing project of reverse engineering scientific PDF documents. The aim is to obtain mathematical markup that can be used as source for regenerating a document that resembles the original as closely as possible. This source can then be a basis for further document processing. Our current tool uses specialised PDF extraction together with image analysis to produce near perfect input for parsing mathematical formula. Applying a linear grammar and specific drivers for each output format to this input, we can produce an accurate reproduction of formulae when presented with their coordinates. In this paper we will show how this information can be exploited to discover the locations of both inline and display formulae, and also to perform rudimentary layout analysis of the whole document, identifying structures such as headings and paragraphs.},
author = {Baker, Josef B., Sexton, Alan P., Sorge, Volker},
booktitle = {Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011},
location = {Brno, Czech Republic},
pages = {65-75},
publisher = {Masaryk University Press},
title = {Towards Reverse Engineering of PDF Documents},
url = {http://eudml.org/doc/221395},
year = {2011},
}
TY - CLSWK
AU - Baker, Josef B.
AU - Sexton, Alan P.
AU - Sorge, Volker
TI - Towards Reverse Engineering of PDF Documents
T2 - Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011
PY - 2011
CY - Brno, Czech Republic
PB - Masaryk University Press
SP - 65
EP - 75
AB - We present a progress report on our ongoing project of reverse engineering scientific PDF documents. The aim is to obtain mathematical markup that can be used as source for regenerating a document that resembles the original as closely as possible. This source can then be a basis for further document processing. Our current tool uses specialised PDF extraction together with image analysis to produce near perfect input for parsing mathematical formula. Applying a linear grammar and specific drivers for each output format to this input, we can produce an accurate reproduction of formulae when presented with their coordinates. In this paper we will show how this information can be exploited to discover the locations of both inline and display formulae, and also to perform rudimentary layout analysis of the whole document, identifying structures such as headings and paragraphs.
UR - http://eudml.org/doc/221395
ER -
References
top- Anderson, R.H., Syntax-Directed Recognition of Hand-Printed Two-dimensional Mathematics, Ph.D. thesis, Harvard University, Cambridge, MA (1968). (1968) Zbl0207.17806
- Baker, J., Sexton, A., Sorge, V., A linear grammar approach to mathematical formula recognition from PDF, In: Proceedings of Intelligent Computer Mathematics (2009). (2009)
- Baker, J., Sexton, A.P., Sorge, V., Suzuki, M., Comparing approaches to mathematical document analysis, In: 11th International Conference on Document Analysis and Recognition (to appear) (2011). (2011)
- Baker, J., Sexton, A., Sorge, V., Faithful mathematical formula recognition from PDF documents, In: 9th IAPR International Workshop on Document Analysis Systems, Extended Abstracts. pp. 485–492. ACM Press, Boston, USA (2010). (2010)
- Garain, U., Identification of mathematical expressions in document images, In: Document Analysis and Recognition, International Conference on. pp. 1340–1344. IEEE Computer Society, Los Alamitos, CA, USA (2009). (2009)
- Mittelbach, F., Goossens, M., The LaTeX Companion, Pearson Education, 2e edn. (2005), TeX spacing table, page 525. (2005)
- Sternberg, S., Theory of functions of a real variable, (2005), http://www.math.harvard.edu/~shlomo/docs/Real_Variables.pdf (2005)
- Suzuki, M., Uchida, S., Nomura, A., A ground-truthed mathematical character and symbol image database, In: Proc. of ICDAR. pp. 675–679. IEEE Computer Society (2005). (2005)
- Suzuki, M., Infty, (2011), http://www.inftyproject.org (2011)
NotesEmbed ?
topTo embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.