The search session has expired. Please query the service again.

The search session has expired. Please query the service again.

The search session has expired. Please query the service again.

The Unreasonable Effectiveness of Pattern Generation

Petr Sojka; Ondřej Sojka

Zpravodaj Československého sdružení uživatelů TeXu (2019)

  • Volume: 029, Issue: 1-4, page 73-86
  • ISSN: 1211-6661

Abstract

top
Languages are constantly evolving, and so are their hyphenation rules and needs. The effectiveness and utility of TeX's hyphenation have been proven by its usage in almost all typesetting systems in use today. The current Czech hyphenation patterns were generated in 1995, and no hyphenated word database was freely available. We have developed a new Czech word database and have used the patgen program to generate new effective Czech hyphenation patterns efficiently and evaluated their generalization qualities. We have achieved full coverage on the training dataset of 3,000,000 words, and developed a validation procedure of new patterns for Czech based on the testing database of 105,000 words approved by the Czech Academy of Science linguists. Our pattern generation case study exemplifies a practical solution to the widespread dictionary problem. The study has proven the versatility, effectiveness, and extensibility of Liang's approach to hyphenation developed for TeX. The unreasonable effectiveness of the pattern technology has led to applications that are and will be used, even more widely now, nearly 40 years after its inception.

How to cite

top

Sojka, Petr, and Sojka, Ondřej. "The Unreasonable Effectiveness of Pattern Generation." Zpravodaj Československého sdružení uživatelů TeXu 029.1-4 (2019): 73-86. <http://eudml.org/doc/298743>.

@article{Sojka2019,
abstract = {Languages are constantly evolving, and so are their hyphenation rules and needs. The effectiveness and utility of TeX's hyphenation have been proven by its usage in almost all typesetting systems in use today. The current Czech hyphenation patterns were generated in 1995, and no hyphenated word database was freely available. We have developed a new Czech word database and have used the patgen program to generate new effective Czech hyphenation patterns efficiently and evaluated their generalization qualities. We have achieved full coverage on the training dataset of 3,000,000 words, and developed a validation procedure of new patterns for Czech based on the testing database of 105,000 words approved by the Czech Academy of Science linguists. Our pattern generation case study exemplifies a practical solution to the widespread dictionary problem. The study has proven the versatility, effectiveness, and extensibility of Liang's approach to hyphenation developed for TeX. The unreasonable effectiveness of the pattern technology has led to applications that are and will be used, even more widely now, nearly 40 years after its inception.},
author = {Sojka, Petr, Sojka, Ondřej},
journal = {Zpravodaj Československého sdružení uživatelů TeXu},
keywords = {hyphenation patterns; patgen; unreasonable effectiveness; Czech; patgen; vzory dělení slov; nepochopitelná efektivita; čeština},
language = {eng},
number = {1-4},
pages = {73-86},
publisher = {Československé sdružení uživatelů TeXu},
title = {The Unreasonable Effectiveness of Pattern Generation},
url = {http://eudml.org/doc/298743},
volume = {029},
year = {2019},
}

TY - JOUR
AU - Sojka, Petr
AU - Sojka, Ondřej
TI - The Unreasonable Effectiveness of Pattern Generation
JO - Zpravodaj Československého sdružení uživatelů TeXu
PY - 2019
PB - Československé sdružení uživatelů TeXu
VL - 029
IS - 1-4
SP - 73
EP - 86
AB - Languages are constantly evolving, and so are their hyphenation rules and needs. The effectiveness and utility of TeX's hyphenation have been proven by its usage in almost all typesetting systems in use today. The current Czech hyphenation patterns were generated in 1995, and no hyphenated word database was freely available. We have developed a new Czech word database and have used the patgen program to generate new effective Czech hyphenation patterns efficiently and evaluated their generalization qualities. We have achieved full coverage on the training dataset of 3,000,000 words, and developed a validation procedure of new patterns for Czech based on the testing database of 105,000 words approved by the Czech Academy of Science linguists. Our pattern generation case study exemplifies a practical solution to the widespread dictionary problem. The study has proven the versatility, effectiveness, and extensibility of Liang's approach to hyphenation developed for TeX. The unreasonable effectiveness of the pattern technology has led to applications that are and will be used, even more widely now, nearly 40 years after its inception.
LA - eng
KW - hyphenation patterns; patgen; unreasonable effectiveness; Czech; patgen; vzory dělení slov; nepochopitelná efektivita; čeština
UR - http://eudml.org/doc/298743
ER -

References

top
  1. Pereira, Fernando, Norvig, Peter, Halevy, Alon, 10.1109/MIS.2009.36, IEEE Intelligent Systems. 2009, vol. 24, no. 02, s. 8–12. ISSN 1541-1672. Dostupné z DOI: 10.1109/MIS.2009.36 (2009) DOI10.1109/MIS.2009.36
  2. Wigner, Eugene P., 10.1002/cpa.3160130102, Richard Courant Lecture in Mathematical Sciences delivered at New York University, May 11, 1959. Communications on Pure and Applied Mathematics. 1960, vol. 13, no. 1, s. 1–14. Dostupné z DOI: 10.1002/cpa.3160130102 (1960) MR0824292DOI10.1002/cpa.3160130102
  3. Hamming, Richard W., 10.1080/00029890.1980.11994966, The American Mathematical Monthly. 1980, vol. 87, no. 2, s. 81–90. ISSN 00029890, 19300972. ISSN 00029890, 19300972. Dostupné také z: https://www.jstor.org/stable/2321982. (1980) MR0559142DOI10.1080/00029890.1980.11994966
  4. Liang, Franklin M., Word Hy-phen-a-tion by Com-put-er, 1983. Dostupné také z: https://tug.org/docs/liang/. Disertační práce. Stanford University. (1983) 
  5. Sojka, Petr, Competing Patterns in Language Engineering and Computer Typesetting, 2005. Disertační práce. Faculty of Informatics. (2005) 
  6. Reutenauer, Arthur, Miklavec, Mojca, TeX hyphenation patterns, [online]. TUG [cit. 2019-11-14]. Dostupné z: https://tug.org/tex-hyphen/. 
  7. Lemberg, Werner, A database of German words with hyphenation information, Dostupné také z: https://repo.or.cz/wortliste.git. 
  8. Sojka, Petr, Ševeček, Pavel, Hyphenation in TeX - Quo Vadis?, TUGboat. 1995, vol. 16, no. 3, s. 280–289. (1995) 
  9. Internetová jazyková příručka (Internet Language Reference Book), [online]. Institute of Czech language, Czech Academy of Sciences [cit. 2019-07-18]. Dostupné z: http://prirucka.ujc.cas.cz/?id=135. 
  10. Sojka, Petr, Hyphenation on Demand, TUGboat. 1999, vol. 20, no. 3, s. 241–247. https://tug.org/TUGboat/tb20-3/tb64sojka.pdf. (1999) 
  11. Sojka, Ondřej, Sojka, Petr, cshyphen repository, Dostupné také z: https://github.com/tensojka/cshyphen. 
  12. Sojka, Petr, Notes on Compound Word Hyphenation in TeX, TUGboat. 1995, vol. 16, no. 3, s. 290–297. (1995) 
  13. Jakubíčekm Milos, Kilgarriff, Adam, Kovář, Vojtěch, Rychlý, Pavel, Suchomel, Vít, The TenTen Corpus Family, In: Proc. of 7th International Corpus Linguistics Conference (CL). Lancaster, 2013, s. 125–127. (2013) 
  14. Suchomel, Vít, Pomikálek, Jan, Efficient Web Crawling for Large Text Corpora, In: KILGARRIFF, Adam; SHAROFF, Serge (eds.). Proc. of the seventh Web as Corpus Workshop (WAC). Lyon, 2012, s. 39–43. Dostupné také z: https://sigwac.org.uk/raw-attachment/wiki/WAC7/wac7-proc.pdf. (2012) 
  15. Šmerk, Pavel, Fast Morphological Analysis of Czech, In: SOJKA, Petr; HORÁK, Aleš (eds.). Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2009. Karlova Studánka, Czech Republic: Masaryk University, 2009, s. 13–16. ISBN 978-80-210-5048-8. Dostupné také z: http://nlp.fi.muni.cz/raslan/2009/. (2009) 
  16. Scannell, Kevin Patrick, 10.1162/tacl_a_00033, TUGboat. 2003, vol. 24, no. 2, s. 236–239. (2003) DOI10.1162/tacl_a_00033
  17. Shao, Yan, Hardmeier, Christina, Nivre, Joakim, 10.18653/v1/P16-1162, ransactions of the Association for Computational Linguistics. 2018, vol. 6, s. 421–435. Dostupné z DOI: 10.1162/tacl_a_00033 (2018) DOI10.18653/v1/P16-1162
  18. Sennrich, Rico, Haddor, Barry, Birch, Alexandra, 10.18653/v1/W18-5811, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).Berlin, Germany: Association for Computational Linguistics, 2016, s. 1715-1725. Dostupné z DOI: 10.18653/v1/P16-1162 (2016) DOI10.18653/v1/W18-5811
  19. Zeldes, Amir, A Characterwise Windowed Approach to Hebrew Morphological Segmentation, In: Proc. of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology. Brussels, Belgium: Association for Computational Linguistics, 2018, s. 101–110. Dostupné z DOI: 10.18653/v1/W18-5811 (2018) 
  20. Lample, Guillaume, Sablayrolles, Alexandre, Ranzato, Marc'Aurelio, Denoyer, Ludovic, Jégou, Hervé, Large Memory Layers with Product Keys, [online]. 2019 [cit. 2019-07-18]. Dostupné z arXiv: 1907.05242 [cs.CL]. (2019) 

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.