Digital search trees and chaos game representation*

Peggy Cénac; Brigitte Chauvin; Stéphane Ginouillac; Nicolas Pouyanne

ESAIM: Probability and Statistics (2009)

  • Volume: 13, page 15-37
  • ISSN: 1292-8100

Abstract

top
In this paper, we consider a possible representation of a DNA sequence in a quaternary tree, in which one can visualize repetitions of subwords (seen as suffixes of subsequences). The CGR-tree turns a sequence of letters into a Digital Search Tree (DST), obtained from the suffixes of the reversed sequence. Several results are known concerning the height, the insertion depth for DST built from independent successive random sequences having the same distribution. Here the successive inserted words are strongly dependent. We give the asymptotic behaviour of the insertion depth and the length of branches for the CGR-tree obtained from the suffixes of a reversed i.i.d. or Markovian sequence. This behaviour turns out to be at first order the same one as in the case of independent words. As a by-product, asymptotic results on the length of longest runs in a Markovian sequence are obtained.

How to cite

top

Cénac, Peggy, et al. "Digital search trees and chaos game representation*." ESAIM: Probability and Statistics 13 (2009): 15-37. <http://eudml.org/doc/250658>.

@article{Cénac2009,
abstract = { In this paper, we consider a possible representation of a DNA sequence in a quaternary tree, in which one can visualize repetitions of subwords (seen as suffixes of subsequences). The CGR-tree turns a sequence of letters into a Digital Search Tree (DST), obtained from the suffixes of the reversed sequence. Several results are known concerning the height, the insertion depth for DST built from independent successive random sequences having the same distribution. Here the successive inserted words are strongly dependent. We give the asymptotic behaviour of the insertion depth and the length of branches for the CGR-tree obtained from the suffixes of a reversed i.i.d. or Markovian sequence. This behaviour turns out to be at first order the same one as in the case of independent words. As a by-product, asymptotic results on the length of longest runs in a Markovian sequence are obtained. },
author = {Cénac, Peggy, Chauvin, Brigitte, Ginouillac, Stéphane, Pouyanne, Nicolas},
journal = {ESAIM: Probability and Statistics},
keywords = {Random tree; Digital Search Tree; CGR; lengths of the paths; height; insertion depth; asymptotic growth; strong convergence; random tree; digital search tree; DST; length of branches; asymptotic behaviour},
language = {eng},
month = {2},
pages = {15-37},
publisher = {EDP Sciences},
title = {Digital search trees and chaos game representation*},
url = {http://eudml.org/doc/250658},
volume = {13},
year = {2009},
}

TY - JOUR
AU - Cénac, Peggy
AU - Chauvin, Brigitte
AU - Ginouillac, Stéphane
AU - Pouyanne, Nicolas
TI - Digital search trees and chaos game representation*
JO - ESAIM: Probability and Statistics
DA - 2009/2//
PB - EDP Sciences
VL - 13
SP - 15
EP - 37
AB - In this paper, we consider a possible representation of a DNA sequence in a quaternary tree, in which one can visualize repetitions of subwords (seen as suffixes of subsequences). The CGR-tree turns a sequence of letters into a Digital Search Tree (DST), obtained from the suffixes of the reversed sequence. Several results are known concerning the height, the insertion depth for DST built from independent successive random sequences having the same distribution. Here the successive inserted words are strongly dependent. We give the asymptotic behaviour of the insertion depth and the length of branches for the CGR-tree obtained from the suffixes of a reversed i.i.d. or Markovian sequence. This behaviour turns out to be at first order the same one as in the case of independent words. As a by-product, asymptotic results on the length of longest runs in a Markovian sequence are obtained.
LA - eng
KW - Random tree; Digital Search Tree; CGR; lengths of the paths; height; insertion depth; asymptotic growth; strong convergence; random tree; digital search tree; DST; length of branches; asymptotic behaviour
UR - http://eudml.org/doc/250658
ER -

References

top
  1. M. Abramowitz and I.A. Stegun, Handbook of mathematical functions with formulas, graphs, and mathematical tables, National Bureau of Standards Applied Mathematics Series55. For sale by the superintendent of Documents, U.S. Government Printing Office, Washington, D.C. (1964).  
  2. D. Aldous and P. Shields, A diffusion limit for a class of randomly-growing binary search trees. Probab. Theory Related Fields79 (1998) 509–542.  
  3. J.S. Almeida, J.A. Carriço, A. Maretzek, P.A. Noble and M. Fletcher, Analysis of genomic sequences by Chaos Game Representation. Bioinformatics17 (2001) 429–437.  
  4. G. Blom and D. Thorburn, How many random digits are required until given sequences are obtained? J. Appl. Probab.19 (1982) 518–531.  
  5. P. Cénac, Test on the structure of biological sequences via chaos game representation. Stat. Appl. Genet. Mol. Biol.4 (2005) 36 (electronic).  
  6. P. Cénac, G. Fayolle and J.M. Lasgouttes, Dynamical systems in the analysis of biological sequences. Technical Report 5351, INRIA (2004).  
  7. M. Drmota, The variance of the height of digital search trees. Acta Informatica38 (2002) 261–276.  
  8. M. Duflo, Random Iterative Models. Springer (1997).  
  9. P. Erdős and P. Révész, On the length of the longest head run, in Topics in Information Theory16 (1975) 219–228, I. Csizàr and P. Elias, Eds. North-Holland, Amsterdam Colloq. Math. Soc. Jànos Bolyai.  
  10. P. Erdős and P. Révész, On the length of the longest head-run. In Topics in information theory (Second Colloq., Keszthely, 1975). Colloq. Math. Soc. János Bolyai 16 (1977) 219–228.  
  11. J. Fayolle, Compression de données sans perte et combinatoire analytique. Thèse de l'université Paris VI (2006), available at  fayolle/these.pdf.  URIhttp://www.lri.fr/
  12. J.C. Fu, Bounds for reliability of large consecutive-k-out-of-n:f system. IEEE Trans. Reliability35 (1986) 316–319.  
  13. J.C. Fu and M.V. Koutras, Distribution theory of runs: a markov chain approach. J. Amer. Statist. Soc.89 (1994) 1050–1058.  
  14. H. Gerber and S. Li, The occurence of sequence patterns in repeated experiments and hitting times in a markov chain. Stochastic Process. Appl.11 (1981) 101–108.  
  15. N. Goldman, Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Res.21 (1993) 2487–2491.  
  16. L. Gordon, M.F. Schilling and M.S. Waterman, An extreme value theory for long head runs. Probab. Theory Related Fields72 (1986) 279–287.  
  17. H.J. Jeffrey, Chaos Game Representation of gene structure. Nucleic Acid. Res.18 (1990) 2163–2170.  
  18. M.V. Koutras, Waiting times and number of appearances of events in a sequence of discrete random variables, in Advances in combinatorial methods and applications to probability and statistics, Stat. Ind. Technol., Birkhäuser Boston, Boston, MA (1997) 363–384.  
  19. Shuo-Yen Robert Li, A martingale approach to the study of occurrence of sequence patterns in repeated experiments. Ann. Probab.8 (1980) 1171–1176.  
  20. H.M. Mahmoud, Evolution of random search trees. Wiley-Interscience Series in Discrete Mathematics and Optimization. John Wiley & Sons Inc., New York (1992).  
  21. W. Penney, Problem: Penney-ante. J. Recreational Math.2 (1969) 241.  
  22. V. Petrov, On the probabilities of large deviations for sums of independent random variables. Theory Prob. Appl. (1965) 287–298.  
  23. B. Pittel, Asymptotic growth of a class of random trees. Annals Probab.13 (1985) 414–427.  
  24. V. Pozdnyakov, J. Glaz, M. Kulldorff and J.M. Steele, A martingale approach to scan statistics. Ann. Inst. Statist. Math.57 (2005) 21–37.  
  25. M. Régnier, A unified approach to word occurence probabilities. Discrete Appl. Math.104 (2000) 259–280.  
  26. G. Reinert, S. Schbath and M.S. Waterman, Probabilistic and statistical properties of words: An overview. J. Comput. Biology7 (2000) 1–46.  
  27. S. Robin and J.J. Daudin, Exact distribution of word occurences in a random sequence of letters. J. Appl. Prob.36 (1999) 179–193.  
  28. A. Roy, C. Raychaudhury and A. Nandy, Novel techniques of graphical representation and analysis of DNA sequences – A review. J. Biosci.23 (1998) 55–71.  
  29. S.S. Samarova, On the length of the longest head-run for a markov chain with two states. Theory Probab. Appl.26 (1981) 498–509.  
  30. V. Stefanov and A.G. Pakes, Explicit distributional results in pattern formation. Ann. Appl. Probab.7 (1997) 666–678.  
  31. W. Szpankowski, Average Case Analysis of Algorithms on Sequences. John Wiley & Sons, New York (2001).  
  32. D. Williams, Probability with martingales. Cambridge Mathematical Textbooks. Cambridge University Press, Cambridge (1991).  
  33. * Partially supported by the French Agence Nationale de la Recherche, project SADA ANR-05-BLAN-0372.  

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.