Workflow of Metadata Extraction from Retro-Born Digital Documents

Tkaczyk, Dominika; Bolikowski, Łukasz

  • Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011, Publisher: Masaryk University Press(Brno, Czech Republic), page 39-44

Abstract

top
In this work-in-progress report we propose a workflow for metadata extraction from articles in a digital form. We decompose the problem into clearly defined sub-tasks and outline possible implementations of the sub-tasks. We report the progress of implementation and tests, and state future work.

How to cite

top

Tkaczyk, Dominika, and Bolikowski, Łukasz. "Workflow of Metadata Extraction from Retro-Born Digital Documents." Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011. Brno, Czech Republic: Masaryk University Press, 2011. 39-44. <http://eudml.org/doc/221804>.

@inProceedings{Tkaczyk2011,
abstract = {In this work-in-progress report we propose a workflow for metadata extraction from articles in a digital form. We decompose the problem into clearly defined sub-tasks and outline possible implementations of the sub-tasks. We report the progress of implementation and tests, and state future work.},
author = {Tkaczyk, Dominika, Bolikowski, Łukasz},
booktitle = {Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011},
keywords = {metadata extraction; page segmentation; zone classification; Hidden Markov Model},
location = {Brno, Czech Republic},
pages = {39-44},
publisher = {Masaryk University Press},
title = {Workflow of Metadata Extraction from Retro-Born Digital Documents},
url = {http://eudml.org/doc/221804},
year = {2011},
}

TY - CLSWK
AU - Tkaczyk, Dominika
AU - Bolikowski, Łukasz
TI - Workflow of Metadata Extraction from Retro-Born Digital Documents
T2 - Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011
PY - 2011
CY - Brno, Czech Republic
PB - Masaryk University Press
SP - 39
EP - 44
AB - In this work-in-progress report we propose a workflow for metadata extraction from articles in a digital form. We decompose the problem into clearly defined sub-tasks and outline possible implementations of the sub-tasks. We report the progress of implementation and tests, and state future work.
KW - metadata extraction; page segmentation; zone classification; Hidden Markov Model
UR - http://eudml.org/doc/221804
ER -

References

top
  1. iText, http://itextpdf.com/. 
  2. MARG, http://marg.nlm.nih.gov/. Zbl1143.68407
  3. PDFBox, http://pdfbox.apache.org/ 
  4. Automating the production of bibliographic records for MEDLINE, Tech. rep. (2001). (2001) 
  5. Cui, B., Chen, X., An improved hidden Markov model for literature metadata extraction, Advanced Intelligent Computing Theories and Applications. pp. 205–212 (2010). (2010) 
  6. Hetzner, E., A simple method for citation metadata extraction using Hidden Markov Models, In: JCDL ’08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries. pp. 280–284. ACM, New York, NY, USA (2008). (2008) 
  7. Marinai, S., Metadata Extraction from PDF Papers for Digital Library Ingest, 10th International Conference on Document Analysis and Recognition. pp. 251–255 (2009). (2009) 
  8. Nagy, G., Seth, S., Viswanathan, M., A prototype document image analysis system for technical journals, Computer 25(7), 10–22 (1992). (1992) 
  9. O’Gorman, L., The document spectrum for page layout analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993). (1993) 
  10. Sojka, P., An Experience with Building Digital Open Access Repository DML-CZ, In: Proceedings of CASLIN 2009. pp. 74–78 (2009). (2009) 
  11. Sutton, C., McCallum, A., An Introduction to Conditional Random Fields for Relational Learning, (2006). (2006) 

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.