Correcting spelling errors by modelling their causes

Sebastian Deorowicz; Marcin Ciura

International Journal of Applied Mathematics and Computer Science (2005)

  • Volume: 15, Issue: 2, page 275-285
  • ISSN: 1641-876X

Abstract

top
This paper accounts for a new technique of correcting isolated words in typed texts. A language-dependent set of string substitutions reflects the surface form of errors that result from vocabulary incompetence, misspellings, or mistypings. Candidate corrections are formed by applying the substitutions to text words absent from the computer lexicon. A minimal acyclic deterministic finite automaton storing the lexicon allows quick rejection of nonsense corrections, while costs associated with the substitutions serve to rank the remaining ones. A comparison of the correction lists generated by several spellcheckers for two corpora of English spelling errors shows that our technique suggests the right words more accurately than the others.

How to cite

top

Deorowicz, Sebastian, and Ciura, Marcin. "Correcting spelling errors by modelling their causes." International Journal of Applied Mathematics and Computer Science 15.2 (2005): 275-285. <http://eudml.org/doc/207742>.

@article{Deorowicz2005,
abstract = {This paper accounts for a new technique of correcting isolated words in typed texts. A language-dependent set of string substitutions reflects the surface form of errors that result from vocabulary incompetence, misspellings, or mistypings. Candidate corrections are formed by applying the substitutions to text words absent from the computer lexicon. A minimal acyclic deterministic finite automaton storing the lexicon allows quick rejection of nonsense corrections, while costs associated with the substitutions serve to rank the remaining ones. A comparison of the correction lists generated by several spellcheckers for two corpora of English spelling errors shows that our technique suggests the right words more accurately than the others.},
author = {Deorowicz, Sebastian, Ciura, Marcin},
journal = {International Journal of Applied Mathematics and Computer Science},
keywords = {spelling errors; finite state automata; spelling correction},
language = {eng},
number = {2},
pages = {275-285},
title = {Correcting spelling errors by modelling their causes},
url = {http://eudml.org/doc/207742},
volume = {15},
year = {2005},
}

TY - JOUR
AU - Deorowicz, Sebastian
AU - Ciura, Marcin
TI - Correcting spelling errors by modelling their causes
JO - International Journal of Applied Mathematics and Computer Science
PY - 2005
VL - 15
IS - 2
SP - 275
EP - 285
AB - This paper accounts for a new technique of correcting isolated words in typed texts. A language-dependent set of string substitutions reflects the surface form of errors that result from vocabulary incompetence, misspellings, or mistypings. Candidate corrections are formed by applying the substitutions to text words absent from the computer lexicon. A minimal acyclic deterministic finite automaton storing the lexicon allows quick rejection of nonsense corrections, while costs associated with the substitutions serve to rank the remaining ones. A comparison of the correction lists generated by several spellcheckers for two corpora of English spelling errors shows that our technique suggests the right words more accurately than the others.
LA - eng
KW - spelling errors; finite state automata; spelling correction
UR - http://eudml.org/doc/207742
ER -

References

top
  1. Atkinson K. (2002): Spell checker test kernel results. - Available at http://aspell.net/test 
  2. Atkinson K. (2003): GNU Aspell. - Available at http://aspell.sourceforge.net/ 
  3. Baeza-Yates R. and Navarro G. (1998): Fast approximate string matching in a dictionary. - Proc. 5th Int. Symp. String Processing and Information Retrieval, SPIRE'98,Santa Crus de la Sierra, Bolivia, pp. 14-22. 
  4. Berkel B. van and Smedt K.D. (1988): Triphone analysis: A combined method for the correction of orthographical and typographical errors. -Proc. 2nd Conf. Applied Natural Language Processing, Austin, pp. 77-83. 
  5. Brill E. and Moore R.C. (2000): An improved error model for noisy channel spelling correction. -Proc. 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, pp. 286-293. 
  6. Carrasco R. and Forcada M. (2002): Incremental construction and maintenance of minimal finite-state automata. - Comput. Linguistics, Vol. 28, No. 2, pp. 207-216. Zbl1232.68080
  7. Church K.W. and Gale W.A. (1991): Probability scoring for spelling correction. -Statist. Comput., Vol. 1, No. 1, pp. 93-103. 
  8. Ciura M.G. and Deorowicz S. (2001): How to squeeze a lexicon. -Soft. Pract. Exper., Vol. 31, No. 11, pp. 1077-1090. Zbl0987.68782
  9. Czech Z.J., Havas G. and Majewski B.H. (1997): Perfect hashing. -Theoret. Comput. Sci., Vol. 182, No. 1-2, pp. 1-143. 
  10. Daciuk J., Mihov S., Watson B.W. and Watson R.E. (2000): Incremental construction of minimal acyclic finite-state automata. -Comput. Linguistics, Vol. 26, No. 1, pp. 3-16. Zbl1232.68081
  11. Damerau F.J. (1964): A technique for computer detection and correction of spelling errors. -Comm. ACM, Vol. 7, No. 3, pp. 171-176. 
  12. Damerau F.J. (1990): Evaluating computer-generated domain-vocabularies. -Inf. Process. Manag., Vol. 26, No. 6, pp. 791-801. 
  13. Damerau F.J. and Mays E. (1989): An examination of undetected typing errors. -Inf. Process. Manag., Vol. 25, No. 6, pp. 659-664. 
  14. Darragh J.J., Cleary J.G. and Witten I.H. (1993): Bonsai: A compact representation of trees. -Softw. Pract. Exper., Vol. 23, No. 3, pp. 277-291. 
  15. Gadd T.N. (1990): PHONIX: The algorithm. -Program: Automat. Library Inf. Syst.,Vol. 24, No. 4, pp. 363-366. 
  16. Grudin J. (1983): Error patterns in skilled and novice transcription typing, In: Cognitive Aspects of Skilled Typewriting, (W.E. Copper, Ed.). -New York: Springer-Verlag, pp. 121-143. 
  17. Hodge V.J. and Austin J. (2003): A comparison of standard spell checking algorithms and a novel binary neural approach. - IEEE Trans. Knowl. Data Eng., Vol. 15, No. 5, pp. 1073-1081. 
  18. Jassem W. (1983): The Phonology of Modern English. -Warsaw: Polish Scientific Publishers. 
  19. Knuth D.E. (1973): The Art of Computer Programming (Vol. 3. Sorting and Searching Algorithms). - Reading, MA: Addison-Wesley. Zbl0302.68010
  20. Kuenning G.H. (2003): International ispell. - Available at http://www.cs.hmc.edu/~geoff/ispell.html. 
  21. Kukich K. (1988): Variations on a back-propagation name recognition net. - Proc. Advanced Technology Conference, U.S. Postal Service, Wash. D.C., USA, pp. 722-735. 
  22. Kukich K. (1992): Techniques for automatically correcting words in text. - ACM Comput. Surveys, Vol. 24, No. 4, pp. 377-439. 
  23. Levenshtein V.I. (1966): Binary codes capable of correcting deletions, insertions and reversals. - Soviet Physics Doklady, Vol. 10, pp. 707-710. 
  24. Maly K. (1976): Compressed tries. - Comm. ACM, Vol. 19, No. 7, pp. 409-415. Zbl0329.68037
  25. Mateescu D. (2003): English phonetics and phonological theory. - University of Bucharest, Romania. Available at http://www.unibuc.ro/eBooks/filologie/mateescu 
  26. Merriam-Webster (2002): A dictionary of prefixes, suffixes, and combining forms from Webster's third new international dictionary, unabridged. - Merriam-Webster. Available at http://www.spellingbee.com/pre_suf_comb.pdf 
  27. Mihov S. and Schulz K.U. (2004): Fast approximate search in large dictionaries. - Comput. Linguistics, Vol. 30, No. 4, pp. 451-477. Zbl1234.68424
  28. Mitton R. (1987): Spelling checkers, spelling correctors, and the misspellings of poor spellers. - Inf. Process. Manag., Vol. 23, No. 5, pp. 495-505. 
  29. Mitton R. (1996): Spellchecking by computer. - J. Simplif. Spell. Soc., Vol. 20. No. 1, pp. 4-11. 
  30. Morrison D.R. (1968): PATRICIA-Practical Algorithm to Retrieve Information Coded in Alphanumeric. - J. ACM, Vol. 15, No. 4, pp. 514-534. 
  31. Odell M.K. and Russell R.C. (1918): U.S. Patent Numbers, 1,261,167 (1918) and 1,435,663 (1922). - U.S. Patent Office, Washington, D.C. 
  32. Oflazer K. (1996): Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. - Comput. Linguistics, Vol. 22, No. 1, pp. 73-89. 
  33. Peterson J.L. (1986): A note on undetected typing errors. - Comm. ACM, Vol. 29, No. 7, pp. 633-637. 
  34. Philips L. (1990): Hanging on the metaphone. - Comput. Lang. Mag., Vol. 7, No. 12, pp. 38-44. 
  35. Philips L. (2000): The double metaphone search algorithm. - CC++ Users Journal, Vol. 18, No. 6. 
  36. Pollock J.J. and Zamora A. (1983): Collection and characterization of spelling errors in scientific and scholary text. - J. Amer. Soc. Inf. Sci., Vol. 34, No. 1, pp. 51-58. 
  37. Pollock J.J. and Zamora A. (1984): Automatic spelling correction in scientific and scholarly text. - Comm. ACM, Vol. 27, No. 4, pp. 358-368. 
  38. Savary A. (2001): Typographical nearest-neighbor search in a finite-state lexicon and its application to spelling correction. - Lect. Notes Comput. Sci., Vol. 2494, pp. 251-260. Zbl1015.68555
  39. Schulz K.U. and Mihov S. (2002): Fast string correction with Levenshtein-automata. - Int. J. Docum. Anal. Recognit., Vol. 5, No. 1, pp. 67-85. Zbl1039.68137
  40. Toutanova K. and Moore R.C. (2002): Pronunciation modelling for improved spelling correction. - Proc. 40th Annual Meeting of the Association for Computational Linguistics, Hong Kong, pp. 144-151. 
  41. Wagner R.A. (1974): Order-n correction for regular languages. - Comm. ACM, Vol. 17, No. 5, pp. 265-268. Zbl0276.68011
  42. Wikipedia (2003): Wikipedia: List of common misspellings. - Available at http://en2.wikipedia.org/wiki/Wikipedia:List_of_common_misspellings 
  43. Yannakoudakis E.J. and Fawthrop D. (1983a): An intelligent spelling corrector. - Inf. Process. Manag., Vol. 19, No. 12, pp. 101-108. 
  44. Yannakoudakis E.J. and Fawthrop D. (1983b): The rules of spelling errors. - Inf. Process. Manag., Vol. 19, No. 2, pp. 87-99. 

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.