Estimating a discrete distribution via histogram selection

Nathalie Akakpo

Estimating a discrete distribution via histogram selection

Nathalie Akakpo

ESAIM: Probability and Statistics (2011)

Volume: 15, page 1-29
ISSN: 1292-8100

Access Full Article

top

Access to full text

Abstract

top

Our aim is to estimate the joint distribution of a finite sequence of independent categorical variables. We consider the collection of partitions into dyadic intervals and the associated histograms, and we select from the data the best histogram by minimizing a penalized least-squares criterion. The choice of the collection of partitions is inspired from approximation results due to DeVore and Yu. Our estimator satisfies a nonasymptotic oracle-type inequality and adaptivity properties in the minimax sense. Moreover, its computational complexity is only linear in the length of the sequence. We also use that estimator during the preliminary stage of a hybrid procedure for detecting multiple change-points in the joint distribution of the sequence. That second procedure still satisfies adaptivity properties and can be implemented efficiently. We provide a simulation study and apply the hybrid procedure to the segmentation of a DNA sequence.

How to cite

top

MLA
BibTeX
RIS

Akakpo, Nathalie. "Estimating a discrete distribution via histogram selection." ESAIM: Probability and Statistics 15 (2011): 1-29. <http://eudml.org/doc/277143>.

@article{Akakpo2011,
abstract = {Our aim is to estimate the joint distribution of a finite sequence of independent categorical variables. We consider the collection of partitions into dyadic intervals and the associated histograms, and we select from the data the best histogram by minimizing a penalized least-squares criterion. The choice of the collection of partitions is inspired from approximation results due to DeVore and Yu. Our estimator satisfies a nonasymptotic oracle-type inequality and adaptivity properties in the minimax sense. Moreover, its computational complexity is only linear in the length of the sequence. We also use that estimator during the preliminary stage of a hybrid procedure for detecting multiple change-points in the joint distribution of the sequence. That second procedure still satisfies adaptivity properties and can be implemented efficiently. We provide a simulation study and apply the hybrid procedure to the segmentation of a DNA sequence.},
author = {Akakpo, Nathalie},
journal = {ESAIM: Probability and Statistics},
keywords = {adaptive estimator; approximation result; categorical variable; change-point detection; minimax estimation; model selection; nonparametric estimation; penalized least-squares estimation},
language = {eng},
pages = {1-29},
publisher = {EDP-Sciences},
title = {Estimating a discrete distribution via histogram selection},
url = {http://eudml.org/doc/277143},
volume = {15},
year = {2011},
}

TY - JOUR
AU - Akakpo, Nathalie
TI - Estimating a discrete distribution via histogram selection
JO - ESAIM: Probability and Statistics
PY - 2011
PB - EDP-Sciences
VL - 15
SP - 1
EP - 29
AB - Our aim is to estimate the joint distribution of a finite sequence of independent categorical variables. We consider the collection of partitions into dyadic intervals and the associated histograms, and we select from the data the best histogram by minimizing a penalized least-squares criterion. The choice of the collection of partitions is inspired from approximation results due to DeVore and Yu. Our estimator satisfies a nonasymptotic oracle-type inequality and adaptivity properties in the minimax sense. Moreover, its computational complexity is only linear in the length of the sequence. We also use that estimator during the preliminary stage of a hybrid procedure for detecting multiple change-points in the joint distribution of the sequence. That second procedure still satisfies adaptivity properties and can be implemented efficiently. We provide a simulation study and apply the hybrid procedure to the segmentation of a DNA sequence.
LA - eng
KW - adaptive estimator; approximation result; categorical variable; change-point detection; minimax estimation; model selection; nonparametric estimation; penalized least-squares estimation
UR - http://eudml.org/doc/277143
ER -

References

top

[1] M. Aerts and N. Veraverbeke, Bootstrapping a nonparametric polytomous regression model. Math. Meth. Statist.4 (1995) 189–200. Zbl0832.62031 MR1335154
[2] Y. Baraud and L. Birgé, Estimating the intensity of a random measure by histogram type estimators. Prob. Theory Relat. Fields143 (2009) 239–284. Zbl1149.62019 MR2449129
[3] A. Barron, L. Birgé and P. Massart, Risk bounds for model selection via penalization. Prob. Theory Relat. Fields113 (1999) 301–413. Zbl0946.62036 MR1679028
[4] C. Bennett and R. Sharpley, Interpolation of operators, volume 129 of Pure and Applied Mathematics. Academic Press Inc., Boston, M.A. (1988). Zbl0647.46057 MR928802
[5] L. Birgé, Model selection via testing: an alternative to (penalized) maximum likelihood estimators. Ann. Inst. H. Poincaré Probab. Statist.42 (2006) 273–325. Zbl1333.62094 MR2219712
[6] L. Birgé, Model selection for Poisson processes, in Asymptotics: Particles, Processes and Inverse Problems, Festschrift for Piet Groeneboom. IMS Lect. Notes Monograph Ser. 55. IMS, Beachwood, USA (2007) 32–64. Zbl1176.62082 MR2459930
[7] L. Birgé and P. Massart, Minimal penalties for Gaussian model selection. Prob. Theory Relat. Fields138 (2007) 33–73. Zbl1112.62082 MR2288064
[8] J.V. Braun and H.-G. Müller, Statistical methods for DNA sequence segmentation. Stat. Sci.13 (1998) 142–162. Zbl0960.62121
[9] J.V. Braun, R.K. Braun and H.-G. Müller, Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation. Biometrika87 (2000) 301–314. Zbl0963.62067 MR1782480
[10] T.H. Cormen, C.E. Leiserson, R.L. Rivest and C. Stein, Introduction to algorithms. Second edition. MIT Press, Cambridge, MA (2001). Zbl1187.68679 MR1848805
[11] M. Csűrös, Algorithms for finding maximum-scoring segment sets, in Proc. of the 4th international workshop on algorithms in bioinformatics 2004. Lect. Notes Comput. Sci. 3240. Springer, Berlin, Heidelberg (2004) 62–73. MR2155594
[12] R.A. DeVore and G.G. Lorentz, Constructive approximation. Springer-Verlag, Berlin, Heidelberg (1993). Zbl0797.41016 MR1261635
[13] R.A. DeVore and R.C. Sharpley, Maximal functions measuring smoothness. Mem. Amer. Math. Soc. 47 (1984) 293. Zbl0529.42005 MR727820
[14] R.A. DeVore and X.M. Yu, Degree of adaptive approximation. Math. Comp.55 (1990) 625–635. Zbl0723.41015 MR1035930
[15] C. Durot, E. Lebarbier and A.-S. Tocquet, Estimating the joint distribution of independent categorical variables via model selection. Bernoulli15 (2009) 475–507. Zbl1200.62024 MR2543871
[16] Y.-X. Fu and R.N. Curnow, Maximum likelihood estimation of multiple change points. Biometrika77 (1990) 562–565. Zbl0724.62025 MR1087847
[17] S. GeyS. and E. Lebarbier, Using CART to detect multiple change-points in the mean for large samples. SSB preprint, Research report No. 12 (2008).
[18] M. Hoebeke, P. Nicolas and P. Bessières, MuGeN: simultaneous exploration of multiple genomes and computer analysis results. Bioinformatics19 (2003) 859–864.
[19] E. Lebarbier, Quelques approches pour la détection de ruptures à horizon fini. Ph.D. thesis, Université Paris Sud, Orsay, 2002.
[20] E. Lebarbier and E. Nédélec, Change-points detection for discrete sequences via model selection. SSB preprint, Research Report No. 9 (2007).
[21] P. Massart, Concentration inequalities and model selection. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23, 2003. Lect. Notes Math. 1896. Springer, Berlin, Heidelberg (2007). Zbl1170.60006 MR2319879
[22] P. Nicolas et al., Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models. Nucleic Acids Res.30 (2002) 1418–1426.
[23] W. Szpankowski, L. Szpankowski and W. Ren, An optimal DNA segmentation based on the MDL principle. Int. J. Bioinformatics Res. Appl.1 (2005) 3–17.

Citations in EuDML Documents

top

Alice Cleynen, Emilie Lebarbier, Segmentation of the Poisson and negative binomial rate models: a penalized estimator

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Language to use for this widget.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Number of notes per page

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.