On an application of the work of D.E. Knuth to semigroups.
This paper describes several innovative PDF document enhancements and tools that can be used when building a digital library. The main result presented in this paper is the PDF re-compression tool, developed using the jbig2enc encoder called pdfJbIm. This re-compression tool enables the size of the original bitonal PDFs to be, on average, downsized by one third. Some modifications to the jbig2enc encoder that increase the compression ratio even further are also described here. Together with another...
We describe here how Tralics can be used to convert LaTeX documents into XML or HTML. It uses an ad-hoc DTD (a simplification of the TEI), but the translation of the math formulas is conforming to the presentation MathML 2.0 recommendations. We explain how to run and parametrize the software. We give an overview of the various MathML constructs, and how they are rendered by different browsers.
The project DML-CZ: The Czech Digital Mathematics Library has been implemented since 2005 and in 2010 switched over to routine operation. This report describes progress, growth and usage of the DML-CZ, the experience from cooperation with content providers in the designed editorial workflow, some newly implemented features, adjustments of the workflow following from both the ongoing practical experience and the requirements of the advancing EuDML project, the general public acceptance and attendance...
We present three corpus-based studies on symbol declaration in mathematical writing. We focus on simple object denoting symbols which may be part of larger expressions. We look into whether the symbols are explicitly introduced into the discourse and whether the information on once interpreted symbols can be used to interpret structurally related symbols. Our goal is to support fine-grained semantic interpretation of simple and complex mathematical expressions. The results of our analysis empirically...
After an extensive study of the metadata policy of each of its content partners, the EuDML project evaluated many different strategies and existing schemas that could store every detail faithfully, and yet reserve room for the enhancements foreseen in the project’s work plan. The framework provided by the so-called NLM Journal Archiving and Interchange Tag Suite was selected as best readily available approximation of our needs. Some modifications of it have been endorsed by the project, defining...
The exchange of preprints and journals plays an important role to communicate new research ideas and results in many academic fields. Distribution of preprints and journal articles by electronic file via the Internet has become a primary method in addition to paper publication. Electronic preprints and articles in the paperless era should be certified in terms of existence proof and tamper resistance because they are easily modified by their site administrator. We developed a secure preprint and...
The workshop’s objectives were to formulate the strategy and goals of a global mathematical digital library and to summarize the current successes and failures of ongoing technologies and related projects. There is already some experience with building smaller DMLs and/or building big thematical scientific digital libraries. Why there are already big fulltext digital library in some domains like PubMed in biomedical one, but none in others? We try to pose such and other questions, and try to find...
The DML workshop’s objectives were to formulate the strategy and goals of a global mathematical digital library and to summarize the current successes and failures of ongoing technologies and related projects. There is already experience with building regional DMLs or building big thematic scientific digital libraries. EuDML project reached it halflife period. While there are already big fulltext digital libraries in some domains like PubMed Central in the biomedical domain, Inspire in high-energy...
In this paper we propose a flexible, modular framework for author name disambiguation. Our solution consists of the core which orchestrates the disambiguation process, and replaceable modules performing concrete tasks. The approach is suitable for distributed computing, in particular it maps well to the MapReduce framework. We describe each component in detail and discuss possible alternatives. Finally, we propose procedures for calibration and evaluation of the described system.
We present a progress report on our ongoing project of reverse engineering scientific PDF documents. The aim is to obtain mathematical markup that can be used as source for regenerating a document that resembles the original as closely as possible. This source can then be a basis for further document processing. Our current tool uses specialised PDF extraction together with image analysis to produce near perfect input for parsing mathematical formula. Applying a linear grammar and specific drivers...
We present a method for determining the context-dependent denotation of simple object-denoting mathematical expressions in mathematical documents. Our approach relies on estimating the similarity between the linguistic context within which the given expression occurs and a set of terms from a flat domain taxonomy of mathematical concepts; one of 7 head concepts dominating a set of terms with highest similarity score to the symbol’s context is assigned as the symbol’s interpretation. The taxonomy...
We demonstrate searching of mathematical expressions in technical digital libraries on a MREC collection of 439,423 real scientific documents with more than 158 million mathematical formulae. Our solution—the WebMIaS system—allows the retrieval of mathematical expressions written in TeX or MathML. TeX queries are converted on-the-fly into tree representations of Presentation MathML, which is used for indexing. WebMIaS allows complex queries composed of plain text and mathematical formulae, using...
In this work-in-progress report we propose a workflow for metadata extraction from articles in a digital form. We decompose the problem into clearly defined sub-tasks and outline possible implementations of the sub-tasks. We report the progress of implementation and tests, and state future work.