KIS: An automated attribute induction method for classification of DNA sequences

Rafał Biedrzycki; Jarosław Arabas

KIS: An automated attribute induction method for classification of DNA sequences

Rafał Biedrzycki; Jarosław Arabas

International Journal of Applied Mathematics and Computer Science (2012)

Volume: 22, Issue: 3, page 711-721
ISSN: 1641-876X

Access Full Article

top

Access to full text

Full (PDF)

Abstract

top

This paper presents an application of methods from the machine learning domain to solving the task of DNA sequence recognition. We present an algorithm that learns to recognize groups of DNA sequences sharing common features such as sequence functionality. We demonstrate application of the algorithm to find splice sites, i.e., to properly detect donor and acceptor sequences. We compare the results with those of reference methods that have been designed and tuned to detect splice sites. We also show how to use the algorithm to find a human readable model of the IRE (Iron-Responsive Element) and to find IRE sequences. The method, although universal, yields results which are of quality comparable to those obtained by reference methods. In contrast to reference methods, this approach uses models that operate on sequence patterns, which facilitates interpretation of the results by humans.

How to cite

top

MLA
BibTeX
RIS

Rafał Biedrzycki, and Jarosław Arabas. "KIS: An automated attribute induction method for classification of DNA sequences." International Journal of Applied Mathematics and Computer Science 22.3 (2012): 711-721. <http://eudml.org/doc/244053>.

@article{RafałBiedrzycki2012,
abstract = {This paper presents an application of methods from the machine learning domain to solving the task of DNA sequence recognition. We present an algorithm that learns to recognize groups of DNA sequences sharing common features such as sequence functionality. We demonstrate application of the algorithm to find splice sites, i.e., to properly detect donor and acceptor sequences. We compare the results with those of reference methods that have been designed and tuned to detect splice sites. We also show how to use the algorithm to find a human readable model of the IRE (Iron-Responsive Element) and to find IRE sequences. The method, although universal, yields results which are of quality comparable to those obtained by reference methods. In contrast to reference methods, this approach uses models that operate on sequence patterns, which facilitates interpretation of the results by humans.},
author = {Rafał Biedrzycki, Jarosław Arabas},
journal = {International Journal of Applied Mathematics and Computer Science},
keywords = {classification; optimization; annotation; patterns; DNA},
language = {eng},
number = {3},
pages = {711-721},
title = {KIS: An automated attribute induction method for classification of DNA sequences},
url = {http://eudml.org/doc/244053},
volume = {22},
year = {2012},
}

TY - JOUR
AU - Rafał Biedrzycki
AU - Jarosław Arabas
TI - KIS: An automated attribute induction method for classification of DNA sequences
JO - International Journal of Applied Mathematics and Computer Science
PY - 2012
VL - 22
IS - 3
SP - 711
EP - 721
AB - This paper presents an application of methods from the machine learning domain to solving the task of DNA sequence recognition. We present an algorithm that learns to recognize groups of DNA sequences sharing common features such as sequence functionality. We demonstrate application of the algorithm to find splice sites, i.e., to properly detect donor and acceptor sequences. We compare the results with those of reference methods that have been designed and tuned to detect splice sites. We also show how to use the algorithm to find a human readable model of the IRE (Iron-Responsive Element) and to find IRE sequences. The method, although universal, yields results which are of quality comparable to those obtained by reference methods. In contrast to reference methods, this approach uses models that operate on sequence patterns, which facilitates interpretation of the results by humans.
LA - eng
KW - classification; optimization; annotation; patterns; DNA
UR - http://eudml.org/doc/244053
ER -

References

top

Baten, A.K.M.A., Chang, B.C.H., Halgamuge, S.K. and Li, J. (2006). Splice site identification using probabilistic parameters and SVM classification, BMC Bioinformatics 7(Suppl 5): S15, DOI:10.1186/1471-2105-7-S5-S15.
Berget, S.M., Moore, C. and Sharp, P.A. (1977). Spliced segments at the 5' terminus of adenovirus 2 late mRNA, Proceedings of the National Academy of Sciences 74(8): 3171-3175.
Carrasco, R.C. and Oncina, J. (1994). Learning stochastic regular grammars by means of a state merging method, ICGI'94: Proceedings of the Second International Colloquium on Grammatical Inference and Applications, Alicante, Spain, pp. 139-152.
Chen, T.-M., Lu, C.-C. and Li, W.-H. (2005). Prediction of splice sites with dependency graphs and their expanded Bayesian networks, Bioinformatics 21(4): 471-482.
Davis, J. and Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves, ICML'06: Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, pp. 233-240.
Deshpande, M. and Karypis, G. (2002). Evaluation of techniques for classifying biological sequences, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Taipei, Taiwan, pp. 417-431. Zbl1048.68733
Diederich, J. (2008). Rule Extraction from Support Vector Machines, Studies in Computational Intelligence, Vol. 80, Springer, Berlin/Heidelberg. Zbl1138.68003
Durbin, R., Eddy, S.R., Krogh, A. and Mitchison, G. (1998). Biological Sequence Analysis-Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, Cambridge. Zbl0929.92010
Elsik, C.G., Worley, K.C., Zhang, L., Milshina, N.V., Jiang, H., Reese, J.T., Childs, K.L., Venkatraman, A., Dickens, C.M., Weinstock, G.M. and Gibbs, R.A. (2006). Community annotation: Procedures, protocols, and supporting tools, Genome Research 16(11): 1329-1333.
Kashiwabara, A.Y., Vieira, D.C.G., Machado-Lima, A. and Durham, A.M. (2007). Splice site prediction using stochastic regular grammars, GMR 6(1): 105-115.
Michalewicz, Z. (1996). Genetic Algorithms + Data Structures = Evolution Programs, 3rd Edn., Springer-Verlag, London. Zbl0841.68047
Oncina, J. and Garcia, P. (1992). Inferring regular languages in polynomial update time, in A. Sanfeliu, N. Pérez de la Blanca and E. Vidal (Eds.), Pattern Recognition and Image Analysis, World Scientific Publishing, Singapore, pp. 49-61.
Pesole, G., Grillo, G., Larizza, A. and Liuni, S. (2000). The untranslated regions of eukaryotic mRNAs: Structure, function, evolution and bioinformatic tools for their analysis, Briefings in Bioinformatics 1(3): 236-249.
Quinlan, J.R. (1986). Induction of decision trees, Machine Learning 1(1): 81-106.
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Francisco, CA.
Rätsch, G. and Sonnenburg, S. (2004). Accurate Splice Site Detection for Caenorhabditis Elegans, MIT Press, Cambridge, MA.
Rätsch, G., Sonnenburg, S. and Schölkopf, B. (2005). RASE: Recognition of alternatively spliced exons in C. elegans, Bioinformatics 21(Suppl 1): i369-i377.
Reese, M.G., Eeckman, F.H., Kulp, D. and Haussler, D. (1997). Improved splice site detection in Genie, Journal of Computational Biology 4(3): 311-324.
Ron, D., Singer, Y. and Tishby, N. (1996). The power of amnesia: Learning probabilistic automata with variable memory length, Machine Learning 25(2): 117-149. Zbl0869.68066
Ron, D., Singer, Y. and Tishby, N. (1998). On the learnability and usage of acyclic probabilistic finite automata, Journal of Computer and System Sciences 56(2): 133-152. Zbl0915.68124
Sonnenburg, S. (2009). Machine Learning for Genomic Sequence Analysis, Ph.D. thesis, Technischen Universität Berlin, Berlin.
Sonnenburg, S., Schweikert, G., Philips, P., Behr, J. and Rätsch, G. (2007). Accurate splice site prediction using support vector machines, BMC Bioinformatics 8(Suppl 10): S7.
Tickle, A., Andrews, R., Golea, M. and Diederich, J. (1998). The truth will come to light: Directions and challenges in extracting the knowledge embedded within trained artificial neural networks, IEEE Transactions on Neural Networks 9(6): 1057-1068.

Citations in EuDML Documents

top

Damian Bogdanowicz, Krzysztof Giaro, On a matching distance between rooted phylogenetic trees

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Language to use for this widget.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Number of notes per page

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.