Sparsity in penalized empirical risk minimization

Vladimir Koltchinskii

Annales de l'I.H.P. Probabilités et statistiques (2009)

  • Volume: 45, Issue: 1, page 7-57
  • ISSN: 0246-0203

Abstract

top
Let (X, Y) be a random couple in S×T with unknown distribution P. Let (X1, Y1), …, (Xn, Yn) be i.i.d. copies of (X, Y), Pn being their empirical distribution. Let h1, …, hN:S↦[−1, 1] be a dictionary consisting of N functions. For λ∈ℝN, denote fλ:=∑j=1Nλjhj. Let ℓ:T×ℝ↦ℝ be a given loss function, which is convex with respect to the second variable. Denote (ℓ•f)(x, y):=ℓ(y; f(x)). We study the following penalized empirical risk minimization problem λ ^ ε : = argmin λ N P n ( f λ ) + ε λ p p , which is an empirical version of the problem λ ε : = argmin λ N P ( f λ ) + ε λ p p (hereɛ≥0 is a regularization parameter; λ0 corresponds to ɛ=0). A number of regression and classification problems fit this general framework. We are interested in the case when p≥1, but it is close enough to 1 (so that p−1 is of the order 1 log N , or smaller). We show that the “sparsity” ofλɛ implies the “sparsity” of λ̂ɛ and study the impact of “sparsity” on bounding the excess risk P(ℓ•fλ̂ɛ)−P(ℓ•fλ0) of solutions of empirical risk minimization problems.

How to cite

top

Koltchinskii, Vladimir. "Sparsity in penalized empirical risk minimization." Annales de l'I.H.P. Probabilités et statistiques 45.1 (2009): 7-57. <http://eudml.org/doc/78023>.

@article{Koltchinskii2009,
abstract = {Let (X, Y) be a random couple in S×T with unknown distribution P. Let (X1, Y1), …, (Xn, Yn) be i.i.d. copies of (X, Y), Pn being their empirical distribution. Let h1, …, hN:S↦[−1, 1] be a dictionary consisting of N functions. For λ∈ℝN, denote fλ:=∑j=1Nλjhj. Let ℓ:T×ℝ↦ℝ be a given loss function, which is convex with respect to the second variable. Denote (ℓ•f)(x, y):=ℓ(y; f(x)). We study the following penalized empirical risk minimization problem \[\hat\{\lambda \}^\{\varepsilon \}:=\mathop \{\operatorname\{argmin\}\}\_\{\lambda \in \{\mathbb \{R\}\}^\{N\}\}\bigl [P\_\{n\}(\ell \bullet f\_\{\lambda \})+\varepsilon \Vert \lambda \Vert \_\{\ell \_\{p\}\}^\{p\}\bigr ],\] which is an empirical version of the problem \[\lambda ^\{\varepsilon \}:=\mathop \{\operatorname\{argmin\}\}\_\{\lambda \in \{\mathbb \{R\}\}^\{N\}\}\bigl [P(\ell \bullet f\_\{\lambda \})+\varepsilon \Vert \lambda \Vert \_\{\ell \_\{p\}\}^\{p\}\bigr ]\] (hereɛ≥0 is a regularization parameter; λ0 corresponds to ɛ=0). A number of regression and classification problems fit this general framework. We are interested in the case when p≥1, but it is close enough to 1 (so that p−1 is of the order $\frac\{1\}\{\log N\}$, or smaller). We show that the “sparsity” ofλɛ implies the “sparsity” of λ̂ɛ and study the impact of “sparsity” on bounding the excess risk P(ℓ•fλ̂ɛ)−P(ℓ•fλ0) of solutions of empirical risk minimization problems.},
author = {Koltchinskii, Vladimir},
journal = {Annales de l'I.H.P. Probabilités et statistiques},
keywords = {empirical risk; penalized empirical risk; ℓ\_p-penalty; sparsity; oracle inequalities; -penalty; Rademacher processes},
language = {eng},
number = {1},
pages = {7-57},
publisher = {Gauthier-Villars},
title = {Sparsity in penalized empirical risk minimization},
url = {http://eudml.org/doc/78023},
volume = {45},
year = {2009},
}

TY - JOUR
AU - Koltchinskii, Vladimir
TI - Sparsity in penalized empirical risk minimization
JO - Annales de l'I.H.P. Probabilités et statistiques
PY - 2009
PB - Gauthier-Villars
VL - 45
IS - 1
SP - 7
EP - 57
AB - Let (X, Y) be a random couple in S×T with unknown distribution P. Let (X1, Y1), …, (Xn, Yn) be i.i.d. copies of (X, Y), Pn being their empirical distribution. Let h1, …, hN:S↦[−1, 1] be a dictionary consisting of N functions. For λ∈ℝN, denote fλ:=∑j=1Nλjhj. Let ℓ:T×ℝ↦ℝ be a given loss function, which is convex with respect to the second variable. Denote (ℓ•f)(x, y):=ℓ(y; f(x)). We study the following penalized empirical risk minimization problem \[\hat{\lambda }^{\varepsilon }:=\mathop {\operatorname{argmin}}_{\lambda \in {\mathbb {R}}^{N}}\bigl [P_{n}(\ell \bullet f_{\lambda })+\varepsilon \Vert \lambda \Vert _{\ell _{p}}^{p}\bigr ],\] which is an empirical version of the problem \[\lambda ^{\varepsilon }:=\mathop {\operatorname{argmin}}_{\lambda \in {\mathbb {R}}^{N}}\bigl [P(\ell \bullet f_{\lambda })+\varepsilon \Vert \lambda \Vert _{\ell _{p}}^{p}\bigr ]\] (hereɛ≥0 is a regularization parameter; λ0 corresponds to ɛ=0). A number of regression and classification problems fit this general framework. We are interested in the case when p≥1, but it is close enough to 1 (so that p−1 is of the order $\frac{1}{\log N}$, or smaller). We show that the “sparsity” ofλɛ implies the “sparsity” of λ̂ɛ and study the impact of “sparsity” on bounding the excess risk P(ℓ•fλ̂ɛ)−P(ℓ•fλ0) of solutions of empirical risk minimization problems.
LA - eng
KW - empirical risk; penalized empirical risk; ℓ_p-penalty; sparsity; oracle inequalities; -penalty; Rademacher processes
UR - http://eudml.org/doc/78023
ER -

References

top
  1. [1] A. Barron, L. Birgé and P. Massart. Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 (1999) 301–413. Zbl0946.62036MR1679028
  2. [2] P. Bartlett, O. Bousquet and S. Mendelson. Local Rademacher complexities. Ann. Statist. 33 (2005) 1497–1537. Zbl1083.62034MR2166554
  3. [3] A. Ben-Tal and A. Nemirovski. Lectures on Modern Convex Optimization. Analysis, Algorithms and Engineering Applications. MPS/SIAM, Series on Optimization, Philadelphia, 2001. Zbl0986.90032MR1857264
  4. [4] F. Bunea, A. Tsybakov and M. Wegkamp. Aggregation for Gaussian regression. Ann. Statist. 35 (2007) 1674–1697. Zbl1209.62065MR2351101
  5. [5] F. Bunea, A. Tsybakov and M. Wegkamp. Sparsity oracle inequalities for the LASSO. Electron. J. Statist. 1 (2007) 169–194. Zbl1146.62028MR2312149
  6. [6] E. Candes and T. Tao. The Dantzig selector statistical estimation when p is much larger than n. Ann. Statist. 35 (2007) 2313–2351. Zbl1139.62019MR2382644
  7. [7] E. Candes, M. Rudelson, T. Tao and R. Vershynin. Error correction via linear programming. In Proc. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS05) 295–308. IEEE, Pittsburgh, PA, 2005. 
  8. [8] E. Candes, J. Romberg and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math. 59 (2006) 1207–1223. Zbl1098.94009MR2230846
  9. [9] O. Catoni. Statistical Learning Theory and Stochastic Optimization. Springer, New York, 2004. Zbl1076.93002MR2163920
  10. [10] D. L. Donoho. For most large underdetermined systems of equations the minimal ℓ1-norm near-solution approximates the sparsest near-solution. Preprint, 2004. Zbl1105.90068
  11. [11] D. L. Donoho. For most large underdetermined systems of linear equations the minimal ℓ1-norm solution is also the sparsest solution. Comm. Pure Appl. Math. 59 (2006) 797–829. Zbl1113.15004MR2217606
  12. [12] D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory 52 (2006) 1289–1306. Zbl1288.94016MR2241189
  13. [13] D. L. Donoho, M. Elad and V. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inform. Theory 52 (2006) 6–18. Zbl1288.94017MR2237332
  14. [14] van de S. Geer. High-dimensional generalized linear models and the Lasso. Ann. Statist. 36 (2008) 614–645. Zbl1138.62323MR2396809
  15. [15] V. Koltchinskii. Model selection and aggregation in sparse classification problems. Oberwolfach Reports Meeting on Statistical and Probabilistic Methods of Model Selection, October, 2005. 
  16. [16] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk mnimization. Ann. Statist. 34 (2006) 2593–2656. Zbl1118.62065MR2329442
  17. [17] V. Koltchinskii and D. Panchenko. Complexities of convex combinations and bounding the generalization error in classification. Ann. Statist. 33 (2005) 1455–1496. Zbl1080.62045MR2166553
  18. [18] M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer, New York, 1991. Zbl0748.60004MR1102015
  19. [19] P. Massart. Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Tolouse (IX) 9 (2000) 245–303. Zbl0986.62002MR1813803
  20. [20] P. Massart. Concentration Inequalities and Model Selection. Springer, Berlin, 2007. Zbl1170.60006MR2319879
  21. [21] S. Mendelson, A. Pajor and N. Tomczak-Jaegermann. Reconstruction and subgaussian operators in Asymptotic Geometric Analysis. Geom. Funct. Anal. 17 (2007) 1248–1282. Zbl1163.46008MR2373017
  22. [22] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the LASSO. Ann. Statist. 34 (2006) 1436–1462. Zbl1113.62082MR2278363
  23. [23] A. Nemirovski. Topics in non-parametric statistics. In Ecole d’Et’e de Probabilités de Saint-Flour XXVIII, 199885–277. P. Bernard (Ed). Springer, New York, 2000. Zbl0998.62033MR1775640
  24. [24] M. Rudelson and R. Vershynin. Geometric approach to error correcting codes and reconstruction of signals. Int. Math. Res. Not. 64 (2005) 4019–4041. Zbl1103.94014MR2206919
  25. [25] R. Tibshirani. Regression shrinkage and selection via Lasso. J. Royal Statist. Soc. Ser. B 58 (1996) 267–288. Zbl0850.62538MR1379242
  26. [26] A. Tsybakov. Optimal rates of aggregation. In Proc. 16th Annual Conference on Learning Theory (COLT) and 7th Annual Workshop on Kernel Machines, 303–313. Lecture Notes in Artificial Intelligence 2777. Springer, New York, 2003. Zbl1208.62073
  27. [27] van der A. Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer, New York, 1996. Zbl0862.60002MR1385671
  28. [28] Y. Yang. Mixing strategies for density estimation. Ann. Statist. 28 (2000) 75–87. Zbl1106.62322MR1762904
  29. [29] Y. Yang. Aggregating regression procedures for a better performance. Bernoulli 10 (2004) 25–47. Zbl1040.62030MR2044592
  30. [30] P. Zhao and B. Yu. On model selection consistency of LASSO. J. Mach. Learn. Res. 7 (2006) 2541–2563. Zbl1222.62008MR2274449

NotesEmbed ?

top

You must be logged in to post comments.

To embed these notes on your page include the following JavaScript code on your page where you want the notes to appear.

Only the controls for the widget will be shown in your chosen language. Notes will be shown in their authored language.

Tells the widget how many notes to show per page. You can cycle through additional notes using the next and previous controls.

    
                

Note: Best practice suggests putting the JavaScript code just before the closing </body> tag.