The EM algorithm and its implementation for the estimation of frequencies of SNP-haplotypes

Joanna Polańska

International Journal of Applied Mathematics and Computer Science (2003)

  • Volume: 13, Issue: 3, page 419-429
  • ISSN: 1641-876X

Abstract

top
A haplotype analysis is becoming increasingly important in studying complex genetic diseases. Various algorithms and specialized computer software have been developed to statistically estimate haplotype frequencies from marker phenotypes in unrelated individuals. However, currently there are very few empirical reports on the performance of the methods for the recovery of haplotype frequencies. One of the most widely used methods of haplotype reconstruction is the Maximum Likelihood method, employing the Expectation-Maximization (EM) algorithm. The aim of this study is to explore the variability of the EM estimates of the haplotype frequency for real data. We analyzed haplotypes at the BLM, WRN, RECQL and ATM genes with 8-14 biallelic markers per gene in 300 individuals. We also re-analyzed the data presented by Mano et al. (2002). We studied the convergence speed, the shape of the loglikelihood hypersurface, and the existence of local maxima, as well as their relations with heterozygosity, the linkage disequilibrium and departures from the Hardy-Weinberg equilibrium. Our study contributes to determining practical values for algorithm sensitivities.

How to cite

top

Polańska, Joanna. "The EM algorithm and its implementation for the estimation of frequencies of SNP-haplotypes." International Journal of Applied Mathematics and Computer Science 13.3 (2003): 419-429. <http://eudml.org/doc/207655>.

@article{Polańska2003,
abstract = {A haplotype analysis is becoming increasingly important in studying complex genetic diseases. Various algorithms and specialized computer software have been developed to statistically estimate haplotype frequencies from marker phenotypes in unrelated individuals. However, currently there are very few empirical reports on the performance of the methods for the recovery of haplotype frequencies. One of the most widely used methods of haplotype reconstruction is the Maximum Likelihood method, employing the Expectation-Maximization (EM) algorithm. The aim of this study is to explore the variability of the EM estimates of the haplotype frequency for real data. We analyzed haplotypes at the BLM, WRN, RECQL and ATM genes with 8-14 biallelic markers per gene in 300 individuals. We also re-analyzed the data presented by Mano et al. (2002). We studied the convergence speed, the shape of the loglikelihood hypersurface, and the existence of local maxima, as well as their relations with heterozygosity, the linkage disequilibrium and departures from the Hardy-Weinberg equilibrium. Our study contributes to determining practical values for algorithm sensitivities.},
author = {Polańska, Joanna},
journal = {International Journal of Applied Mathematics and Computer Science},
keywords = {gene frequency; algorithms; likelihood functions; haplotypes},
language = {eng},
number = {3},
pages = {419-429},
title = {The EM algorithm and its implementation for the estimation of frequencies of SNP-haplotypes},
url = {http://eudml.org/doc/207655},
volume = {13},
year = {2003},
}

TY - JOUR
AU - Polańska, Joanna
TI - The EM algorithm and its implementation for the estimation of frequencies of SNP-haplotypes
JO - International Journal of Applied Mathematics and Computer Science
PY - 2003
VL - 13
IS - 3
SP - 419
EP - 429
AB - A haplotype analysis is becoming increasingly important in studying complex genetic diseases. Various algorithms and specialized computer software have been developed to statistically estimate haplotype frequencies from marker phenotypes in unrelated individuals. However, currently there are very few empirical reports on the performance of the methods for the recovery of haplotype frequencies. One of the most widely used methods of haplotype reconstruction is the Maximum Likelihood method, employing the Expectation-Maximization (EM) algorithm. The aim of this study is to explore the variability of the EM estimates of the haplotype frequency for real data. We analyzed haplotypes at the BLM, WRN, RECQL and ATM genes with 8-14 biallelic markers per gene in 300 individuals. We also re-analyzed the data presented by Mano et al. (2002). We studied the convergence speed, the shape of the loglikelihood hypersurface, and the existence of local maxima, as well as their relations with heterozygosity, the linkage disequilibrium and departures from the Hardy-Weinberg equilibrium. Our study contributes to determining practical values for algorithm sensitivities.
LA - eng
KW - gene frequency; algorithms; likelihood functions; haplotypes
UR - http://eudml.org/doc/207655
ER -

References

top
  1. Bonnen P.E., Story M.D., Ashorn C.L., Buchholz T.A., Weil M.M. and Nelson D.L. (2000): Haplotypes at ATM identify coding-sequence variation and indicate a region of extensive linkage disequilibrium. — Am. J. Hum. Genet., Vol. 67, No. 6, pp. 1437–1451. 
  2. Chiano M.N. and Clayton D.G. (1998): Fine genetic mapping using haplotype analysis and the missing data problem. — Ann. Hum. Genet., Vol. 62, Pt. 1, pp. 55–60. 
  3. Clark A.G. (1990): Inference of haplotypes from PCR-amplified samples of diploid populations. — Mol. Biol. Evol., Vol. 7, No. 2, pp. 111–122. 
  4. Clark V.J., Metheny N., Dean M. and Peterson R.J. (2001): Statistical estimation and pedigree analysis of CCR2-CCR5 haplotypes. — Hum. Genet., Vol. 108, No. 6, pp. 484–493. 
  5. Dempster A.P., Laird N.M. and Rubin D.B. (1977): Maximum likelihood from incomplete data via the EM algorithm. — J. R. Stat. Soc., Vol. 39, No. 1, pp. 1–38. Zbl0364.62022
  6. Excoffier L. and Slatkin M (1995): Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. — Mol. Biol. Evol., Vol. 12, No. 5, pp. 921–927. 
  7. Fallin D. and Schork N.J. (2000): Accuracy of haplotype frequency estimation for biallelic loci, via the Expectation- Maximization algorithm for unphased diploid genotype data. — Am. J. Hum. Genet., Vol. 67, No. 4, pp. 947–959. 
  8. Ghosh S. and Majumder P.P. (2000): Mapping a quantitative trait locus via the EM algorithm and Bayesian classification. — Genet. Epidemiol., Vol. 19, No. 2, pp. 97–126. 
  9. Hawley M.E. and Kidd K.K. (1995): HAPLO: A program using the EM algorithm to estimate the frequencies of multi-site haplotypes. — J. Heredity, Vol. 86, No. 5, pp. 409–411. 
  10. Hudson R.R. and Kaplan N.L. (1985): Statistical properties of the number of recombination events in the history of a sample of DNA sequence. — Genetics, Vol. 111, No. 1, pp. 147–164. 
  11. Kalinowski S.T. and Hedrick P.W. (2001): Estimation of linkage disequilibrium for loci with multiple alleles: Basic approach and an application using data from boghorn sheep. — Heredity, Vol. 87, Pt. 6, pp. 698–708. 
  12. Lin S., Cutler D.J., Zwick M.E. and Chakravarti A. (2002): Haplotype inference in random population samples. — Am. J. Hum. Genet., Vol. 71, No. 5, pp. 1129–1137. 
  13. Long J.C., Williams R.C. and Urbanek M. (1995): An E-M algorithm and testing strategy for multiple-locus haplotypes. — Am. J. Hum. Genet., Vol. 56, No. 3, pp. 799–810. 
  14. Mano S., Yasuda N., Tamiya G., Inoko H., Gojobori T. and Imanishi T. (2002): Phase space structure if haplotype frequency estimation by the EM algorithm. — Proc. Waterfront Symp. Human Genome ScienceWASH 2002, Tokyo, Japan. 
  15. McKeigue P.M. (2000): Efficiency of estimation of haplotype frequencies: Use of marker phenotypes of unrelated individuals versus counting of phase-known gametes. — Am. J. Hum. Genet., Vol. 67, No. 6, pp. 1626–1627. 
  16. McLachlan G.J. and Thriyambakam K. (1997): The EM algorithm and extensions. — New York: Wiley. Zbl0882.62012
  17. Meng X. and van Dyke D. (1977): The EM algorithm — An old folk-song sung to a fast new tune. — J. R. Statist. Soc. B, Vol. 59, No. 3, pp. 511–567. Zbl1090.62518
  18. Niu T., Qin Z.S., Xu X. and Liu J.S. (2002): Bayesian haplotype inference for multiple linked Single-Nucleotide Polymorphisms. — Am. J. Hum. Genet., Vol. 70, No. 1, pp. 157– 169. 
  19. Patil N., Berno A.J., Hinds D.A., Barrett W.A., Doshi J.M., Hacker C.R., Kautzer C.R., Lee D.H. Marjoribanks C., McDonough D.P., et al. (2001): Blocks of limited halplotype diversity revealed by high-resolution scanning of human chromosome 21. — Science, Vol. 294, No. 5547, pp. 1719–1723. 
  20. Qin Z.S., Niu T. and Liu J.S. (2002):Partition-Ligation- Expectation-Maximization algorithm for haplotype inference with Single-Nucleotide Polymorphism. — Am. J. Hum. Genet., Vol. 71, No. 5, pp. 1242–1247. 
  21. Rohde K. and Fuerst R. (2001): Haplotyping and estimation of haplotype frequencies for closely linked biallelic multilocus genetic phenotypes including nuclear family information. — Hum. Mutat., Vol. 17, No. 4, pp. 289–295. 
  22. Schneider S., Roessli D. and Excoffier L. (2000): Arlequin 2.001: A software for population genetics data analysis. — Genetics and Biometry Laboratory, University of Geneva, Switzerland. 
  23. Single R.M., Meyer D., Hollenbach J.A., Nelson M.P., Noble J.A., Erlich H.A. and Thomson G. (2002): Haplotype frequency estimation in patient populations: the effect of departures from Hardy Weinberg proportions and collapsing over a locus in the HLA region. — Genet. Epidemiol., Vol. 22, No. 2, pp. 186–195. 
  24. Slatkin M. and Excoffier L. (1996): Testing for linkage disequilibrium in genotypic data using the Expectation- Maximization algorithm. — Heredity, Vol. 76, Pt. 4, pp. 377–383. 
  25. Stephens M., Smith N.J. and Donnelly P. (2001a): A new statistical method for haplotype reconstruction from population data. — Am. J. Hum. Genet., Vol. 68, No. 4, pp. 978–989. 
  26. Stephens M., Smith N.J. and Donnelly P. (2001b): Reply to Zhang et al. — Am. J. Hum. Genet., Vol. 69, No. 4, pp. 912–914. 
  27. Tishkoff S.A., Pakstis A.J., Ruano G. and Kidd K.K. (2000): The accuracy of statistical methods for estimation of haplotype frequencies: An example from the CD4 locus. — Am. J. Hum. Genet., Vol. 67, No. 2, pp. 518–522. 
  28. Trikka D., Fang Z., Renwick A., Jones S.H., Chakraborty R., Kimmel M. and Nelson D.L. (2002): Complex SNP-based haplotypes in three human helicases: implication for cancer association studies. — Genome Res., Vol. 12, No. 4, pp. 627–639. 
  29. Wang N., Akey J.M., Zhang K., Chakraborty R. and Jin L. (2002): Distribution of recombination crossovers and the origin of haplotype blocks: The interplay of population history, recombination, and mutation. — Am. J. Hum. Genet., Vol. 71, No. 5, pp. 1227–1234. 
  30. Wu C.F.J. (1983): On the convergence properties of the EM algorithm. — Ann. Stat., Vol. 11, No. 1, pp. 95–103. Zbl0517.62035
  31. Xu C.F., Lewis K., Cantone K.L., Khan P., Donnelly C., White N., Crocker N., Boyd P.R., Zaykin D.V. and Purvis I.J. (2002): Effectivness of computational methods in haplotype prediction. — Hum. Genet., Vol. 110, No. 2, pp. 148– 156. 
  32. Zhang S., Pakstis A.J., Kidd K.K. and Zhao H. (2001): Comparision of two methods for haplotype reconstruction and haplotype frequency estimation from population data. — Am. J. Hum. Genet., Vol. 69, No. 4, pp. 906–912. 

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.