Compound Poisson approximation of word counts in DNA sequences

Sophie Schbath

ESAIM: Probability and Statistics (2010)

  • Volume: 1, page 1-16
  • ISSN: 1292-8100

Abstract

top
Identifying words with unexpected frequencies is an important problem in the analysis of long DNA sequences. To solve it, we need an approximation of the distribution of the number of occurrences N(W) of a word W. Modeling DNA sequences with m-order Markov chains, we use the Chen-Stein method to obtain Poisson approximations for two different counts. We approximate the “declumped” count of W by a Poisson variable and the number of occurrences N(W) by a compound Poisson variable. Combinatorial results are used to solve the general case of overlapping words and to calculate the parameters of these distributions.

How to cite

top

Schbath, Sophie. "Compound Poisson approximation of word counts in DNA sequences." ESAIM: Probability and Statistics 1 (2010): 1-16. <http://eudml.org/doc/116578>.

@article{Schbath2010,
abstract = { Identifying words with unexpected frequencies is an important problem in the analysis of long DNA sequences. To solve it, we need an approximation of the distribution of the number of occurrences N(W) of a word W. Modeling DNA sequences with m-order Markov chains, we use the Chen-Stein method to obtain Poisson approximations for two different counts. We approximate the “declumped” count of W by a Poisson variable and the number of occurrences N(W) by a compound Poisson variable. Combinatorial results are used to solve the general case of overlapping words and to calculate the parameters of these distributions. },
author = {Schbath, Sophie},
journal = {ESAIM: Probability and Statistics},
keywords = {DNA sequences / word counts / Poisson approximations / compound Poisson distribution / Chen-Stein method / Markov chains / word periods.; DNA sequences; word counts; Poisson approximations; compound Poisson distribution; Chen-Stein method; Markov chains; word periods},
language = {eng},
month = {3},
pages = {1-16},
publisher = {EDP Sciences},
title = {Compound Poisson approximation of word counts in DNA sequences},
url = {http://eudml.org/doc/116578},
volume = {1},
year = {2010},
}

TY - JOUR
AU - Schbath, Sophie
TI - Compound Poisson approximation of word counts in DNA sequences
JO - ESAIM: Probability and Statistics
DA - 2010/3//
PB - EDP Sciences
VL - 1
SP - 1
EP - 16
AB - Identifying words with unexpected frequencies is an important problem in the analysis of long DNA sequences. To solve it, we need an approximation of the distribution of the number of occurrences N(W) of a word W. Modeling DNA sequences with m-order Markov chains, we use the Chen-Stein method to obtain Poisson approximations for two different counts. We approximate the “declumped” count of W by a Poisson variable and the number of occurrences N(W) by a compound Poisson variable. Combinatorial results are used to solve the general case of overlapping words and to calculate the parameters of these distributions.
LA - eng
KW - DNA sequences / word counts / Poisson approximations / compound Poisson distribution / Chen-Stein method / Markov chains / word periods.; DNA sequences; word counts; Poisson approximations; compound Poisson distribution; Chen-Stein method; Markov chains; word periods
UR - http://eudml.org/doc/116578
ER -

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.