Two algorithms based on Markov chains and their application to recognition of protein coding genes in prokaryotic genomes

Małgorzata Grabińska; Paweł Błażej; Paweł Mackiewicz

Applicationes Mathematicae (2013)

  • Volume: 40, Issue: 4, page 447-457
  • ISSN: 1233-7234

Abstract

top
Methods based on the theory of Markov chains are most commonly used in the recognition of protein coding sequences. However, they require big learning sets to fill up all elements in transition probability matrices describing dependence between nucleotides in the analyzed sequences. Moreover, gene prediction is strongly influenced by the nucleotide bias measured by e.g. G+C content. In this paper we compare two methods: (i) the classical GeneMark algorithm, which uses a three-periodic non-homogeneous Markov chain, and (ii) an algorithm called PMC that considers six independent homogeneous Markov chains to describe transition between nucleotides separately for each of three codon positions in two DNA strands. We have tested the efficiency (in terms of true positive rate) of these two Markov chain methods for the model bacterial genome of Escherichia coli depending on the size of the learning set, uncertainty of ORFs' function annotation, and model order of these algorithms. We have also applied the methods with different model orders for 163 prokaryotic genomes that covered a wide range of G+C content. The PMC algorithm of different chain orders turns out to be more stable in comparison to the GeneMark algorithm. The PMC also outperforms the GM algorithm giving a higher fraction of coding sequences in the tested set of annotated genes. Moreover, it requires much smaller learning sets than GM to work properly.

How to cite

top

Małgorzata Grabińska, Paweł Błażej, and Paweł Mackiewicz. "Two algorithms based on Markov chains and their application to recognition of protein coding genes in prokaryotic genomes." Applicationes Mathematicae 40.4 (2013): 447-457. <http://eudml.org/doc/279938>.

@article{MałgorzataGrabińska2013,
abstract = {Methods based on the theory of Markov chains are most commonly used in the recognition of protein coding sequences. However, they require big learning sets to fill up all elements in transition probability matrices describing dependence between nucleotides in the analyzed sequences. Moreover, gene prediction is strongly influenced by the nucleotide bias measured by e.g. G+C content. In this paper we compare two methods: (i) the classical GeneMark algorithm, which uses a three-periodic non-homogeneous Markov chain, and (ii) an algorithm called PMC that considers six independent homogeneous Markov chains to describe transition between nucleotides separately for each of three codon positions in two DNA strands. We have tested the efficiency (in terms of true positive rate) of these two Markov chain methods for the model bacterial genome of Escherichia coli depending on the size of the learning set, uncertainty of ORFs' function annotation, and model order of these algorithms. We have also applied the methods with different model orders for 163 prokaryotic genomes that covered a wide range of G+C content. The PMC algorithm of different chain orders turns out to be more stable in comparison to the GeneMark algorithm. The PMC also outperforms the GM algorithm giving a higher fraction of coding sequences in the tested set of annotated genes. Moreover, it requires much smaller learning sets than GM to work properly.},
author = {Małgorzata Grabińska, Paweł Błażej, Paweł Mackiewicz},
journal = {Applicationes Mathematicae},
keywords = {Markov chains; DNA sequence; recognition of protein coding sequences; open reading frame},
language = {eng},
number = {4},
pages = {447-457},
title = {Two algorithms based on Markov chains and their application to recognition of protein coding genes in prokaryotic genomes},
url = {http://eudml.org/doc/279938},
volume = {40},
year = {2013},
}

TY - JOUR
AU - Małgorzata Grabińska
AU - Paweł Błażej
AU - Paweł Mackiewicz
TI - Two algorithms based on Markov chains and their application to recognition of protein coding genes in prokaryotic genomes
JO - Applicationes Mathematicae
PY - 2013
VL - 40
IS - 4
SP - 447
EP - 457
AB - Methods based on the theory of Markov chains are most commonly used in the recognition of protein coding sequences. However, they require big learning sets to fill up all elements in transition probability matrices describing dependence between nucleotides in the analyzed sequences. Moreover, gene prediction is strongly influenced by the nucleotide bias measured by e.g. G+C content. In this paper we compare two methods: (i) the classical GeneMark algorithm, which uses a three-periodic non-homogeneous Markov chain, and (ii) an algorithm called PMC that considers six independent homogeneous Markov chains to describe transition between nucleotides separately for each of three codon positions in two DNA strands. We have tested the efficiency (in terms of true positive rate) of these two Markov chain methods for the model bacterial genome of Escherichia coli depending on the size of the learning set, uncertainty of ORFs' function annotation, and model order of these algorithms. We have also applied the methods with different model orders for 163 prokaryotic genomes that covered a wide range of G+C content. The PMC algorithm of different chain orders turns out to be more stable in comparison to the GeneMark algorithm. The PMC also outperforms the GM algorithm giving a higher fraction of coding sequences in the tested set of annotated genes. Moreover, it requires much smaller learning sets than GM to work properly.
LA - eng
KW - Markov chains; DNA sequence; recognition of protein coding sequences; open reading frame
UR - http://eudml.org/doc/279938
ER -

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.