Forbidden factors and fragment assembly

F. Mignosi; A. Restivo; M. Sciortino

RAIRO - Theoretical Informatics and Applications - Informatique Théorique et Applications (2001)

  • Volume: 35, Issue: 6, page 565-577
  • ISSN: 0988-3754

Abstract

top
In this paper methods and results related to the notion of minimal forbidden words are applied to the fragment assembly problem. The fragment assembly problem can be formulated, in its simplest form, as follows: reconstruct a word w from a given set I of substrings (fragments) of a word w . We introduce an hypothesis involving the set of fragments I and the maximal length m ( w ) of the minimal forbidden factors of w . Such hypothesis allows us to reconstruct uniquely the word w from the set I in linear time. We prove also that, if w is a word randomly generated by a memoryless source with identical symbol probabilities, m ( w ) is logarithmic with respect to the size of w . This result shows that our reconstruction algorithm is suited to approach the above problem in several practical applications e.g. in the case of DNA sequences.

How to cite

top

Mignosi, F., Restivo, A., and Sciortino, M.. "Forbidden factors and fragment assembly." RAIRO - Theoretical Informatics and Applications - Informatique Théorique et Applications 35.6 (2001): 565-577. <http://eudml.org/doc/92685>.

@article{Mignosi2001,
abstract = {In this paper methods and results related to the notion of minimal forbidden words are applied to the fragment assembly problem. The fragment assembly problem can be formulated, in its simplest form, as follows: reconstruct a word $w$ from a given set $I$ of substrings (fragments) of a word $w$. We introduce an hypothesis involving the set of fragments $I$ and the maximal length $m(w)$ of the minimal forbidden factors of $w$. Such hypothesis allows us to reconstruct uniquely the word $w$ from the set $I$ in linear time. We prove also that, if $w$ is a word randomly generated by a memoryless source with identical symbol probabilities, $m(w)$ is logarithmic with respect to the size of $w$. This result shows that our reconstruction algorithm is suited to approach the above problem in several practical applications e.g. in the case of DNA sequences.},
author = {Mignosi, F., Restivo, A., Sciortino, M.},
journal = {RAIRO - Theoretical Informatics and Applications - Informatique Théorique et Applications},
keywords = {factor automaton; minimal forbidden factor; fragment assembly; minimal forbidden words},
language = {eng},
number = {6},
pages = {565-577},
publisher = {EDP-Sciences},
title = {Forbidden factors and fragment assembly},
url = {http://eudml.org/doc/92685},
volume = {35},
year = {2001},
}

TY - JOUR
AU - Mignosi, F.
AU - Restivo, A.
AU - Sciortino, M.
TI - Forbidden factors and fragment assembly
JO - RAIRO - Theoretical Informatics and Applications - Informatique Théorique et Applications
PY - 2001
PB - EDP-Sciences
VL - 35
IS - 6
SP - 565
EP - 577
AB - In this paper methods and results related to the notion of minimal forbidden words are applied to the fragment assembly problem. The fragment assembly problem can be formulated, in its simplest form, as follows: reconstruct a word $w$ from a given set $I$ of substrings (fragments) of a word $w$. We introduce an hypothesis involving the set of fragments $I$ and the maximal length $m(w)$ of the minimal forbidden factors of $w$. Such hypothesis allows us to reconstruct uniquely the word $w$ from the set $I$ in linear time. We prove also that, if $w$ is a word randomly generated by a memoryless source with identical symbol probabilities, $m(w)$ is logarithmic with respect to the size of $w$. This result shows that our reconstruction algorithm is suited to approach the above problem in several practical applications e.g. in the case of DNA sequences.
LA - eng
KW - factor automaton; minimal forbidden factor; fragment assembly; minimal forbidden words
UR - http://eudml.org/doc/92685
ER -

References

top
  1. [1] A.V. Aho, J.E. Hopcroft and J.D. Ullman, Data Structures and Algorithms. Addison Wesley, Reading, Mass (1983). Zbl0487.68005MR666695
  2. [2] M.-P. Béal, F. Mignosi, A. Restivo and M. Sciortino, Forbidden Words in Symbolic Dynamics. Adv. in Appl. Math. 25 (2000) 163-193. Zbl0965.37014MR1780764
  3. [3] A. Blumer, J. Blumer, A. Ehrenfeucht, D. Haussler, M.T. Chen and J. Seiferas, The smallest automaton recognizing the subwords of a text. Theoret. Comput. Sci. 40 (1985) 31-55. Zbl0574.68070MR828515
  4. [4] A. Carpi, A. de Luca and S. Varricchio, Words, univalent factors, and boxes. Acta Inform. (to appear). Zbl1025.68052
  5. [5] M. Crochemore and C. Hancart, Automata for matching patterns, in Handbook of Formal Languages, Vol. 2, Chap. 9, edited by G. Rosenberg and A. Salomaan. Springer (1997) 399-462. MR1470014
  6. [6] M. Crochemore, F. Mignosi and A. Restivo, Automata and forbidden words. Inform. Process. Lett. 67 (1998) 111-117. Zbl06590735MR1638178
  7. [7] M. Crochemore, F. Mignosi, A. Restivo and S. Salemi, Data compression using antidictionaries, in Proc. of the IEEE, Special Issue on Lossless Data Compression, Vol. 88, edited by J.A. Storer (2000) 1756-1768. 
  8. [8] A. Frieze and B.V. Halldórsson, Optimal Sequencing by Hybridization in Rounds, in Proc. of RECOMB 2001, edited by T. Lengauer, D. Sankoff, S. Istrail, P. Pevzner and M. Waterman. ACM Press (2001) 141-148 
  9. [9] D. Gusfield, Algorithms on strings, trees, and sequences: Computer science and computational biology. Cambridge University Press (1997). Zbl0934.68103MR1460730
  10. [10] R. Idury and M. Waterman, A new algorithm for DNA sequence assembly. J. Comput. Biol. 2 (1995) 291-306. 
  11. [11] F. Mignosi and A. Restivo, Periodicity, in M. Lothaire, Algebraic Combinatorics on Words, Chap. 8. Cambridge University Press (to appear) 237-274. Also available at url: http://www-igm.univ-mlv.fr/~berstel/Lothaire/index.html MR1905123
  12. [12] F. Mignosi, A. Restivo and M. Sciortino, Forbidden Factors and Fragment Assembly. Lecture Notes in Comput. Sci. (2001). Proceedings of DLT’01. Zbl1073.68704
  13. [13] F. Mignosi, A. Restivo and M. Sciortino, Words and Forbidden Factors. Theoret. Comput. Sci. 273 (2002) 99-117. Zbl0997.68093MR1872445
  14. [14] F. Mignosi, A. Restivo, M. Sciortino and J. Storer, On Sequence Assembly, Technical Report cs-00-210. Brandeis University (2000). 
  15. [15] S. Muthukrishnan and S.C. Sahinalp, Approximate nearest neighbors and sequence comparison with block operations. ACM Press (2000). Proceedings of STOC 2000. Zbl1296.68082MR2115277
  16. [16] G. Myers, Whole-Genome DNA Sequencing, IEEE Comput. Engrg. Sci. 3 (1999) 33-43. 
  17. [17] H. Peltola, H. Soderlund and E. Ukkonen, SEQAID: A DNA Sequence Assembly Program Based on a Mathematical Model. Nucl. Acids Res. 12 (1984) 307-321. 
  18. [18] M. Peltola, H. Soderlund, J. Tarhio and E. Ukkonen, Algorithms for some string matching problems arising in molecular genetics, in Proc. of the 9th IFIP World Computer Congress (1983) 59-64. 
  19. [19] P.A. Pevzner, H. Tang and M. Waterman, A New Approach Fragment Assembly in DNA Sequencing, in Proc. of RECOMB 2001, edited by T. Lengauer, D. Sankoff, S. Istrail, P. Pevzner and M. Waterman. ACM Press (2001) 141-148. 
  20. [20] R. Shamir and D. Tsur, Large Scale Sequencing by Hybridization, in Proc. of RECOMB 2001, edited by T. Lengauer, D. Sankoff, S. Istrail, P. Pevzner and M. Waterman. ACM Press (2001) 269-278. 
  21. [21] M.S. Waterman, Introduction to computational biology: Maps, sequences and genomes. Chapman & Hall (1995). Zbl0831.92011

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.