diff options
| author | Gertjan van den Burg <gertjanvandenburg@gmail.com> | 2021-01-09 22:14:01 +0000 |
|---|---|---|
| committer | Gertjan van den Burg <gertjanvandenburg@gmail.com> | 2021-01-09 22:14:01 +0000 |
| commit | bd8e6991b350a69fd0e08720711ede17261b1025 (patch) | |
| tree | c6d4b039690ecb1d82f590c0335e8ec2f5ff4da5 | |
| parent | Documentation updates (diff) | |
| download | sparsestep-bd8e6991b350a69fd0e08720711ede17261b1025.tar.gz sparsestep-bd8e6991b350a69fd0e08720711ede17261b1025.zip | |
Update readme with mini-tutorial
| -rw-r--r-- | .github/images/sparsestep_prostate_1.png | bin | 0 -> 20549 bytes | |||
| -rw-r--r-- | .github/images/sparsestep_prostate_2.png | bin | 0 -> 23840 bytes | |||
| -rw-r--r-- | README.md | 193 |
3 files changed, 143 insertions, 50 deletions
diff --git a/.github/images/sparsestep_prostate_1.png b/.github/images/sparsestep_prostate_1.png Binary files differnew file mode 100644 index 0000000..8f53392 --- /dev/null +++ b/.github/images/sparsestep_prostate_1.png diff --git a/.github/images/sparsestep_prostate_2.png b/.github/images/sparsestep_prostate_2.png Binary files differnew file mode 100644 index 0000000..b76492f --- /dev/null +++ b/.github/images/sparsestep_prostate_2.png @@ -1,36 +1,145 @@ -SparseStep R Package -==================== +# SparseStep R Package -Paper: [SparseStep: Approximating the Counting Norm for Sparse +SparseStep is an R package for sparse regularized regression and provides an +alternative to methods such as best subset selection, elastic net, lasso, and +lars. The SparseStep method is introduced in the following paper: + +[SparseStep: Approximating the Counting Norm for Sparse Regularization](https://arxiv.org/abs/1701.06967) by G.J.J. van den Burg, P.J.F. Groenen, and A. Alfons (*Arxiv preprint arXiv:1701.06967 [stat.ME]*, 2017). -GitHub: -[https://github.com/GjjvdBurg/SparseStep](https://github.com/GjjvdBurg/SparseStep). - -Introduction ------------- - -This R package implements the SparseStep method for solving the regression -problem with a sparsity constraint on the parameters. The package is -extensively documented through the builtin R documentation. See: - - ?'sparsestep-package' - ?sparsestep - ?path.sparsestep - -for more information. - -Installation ------------- - -This package can be installed through CRAN: - - install.packages('sparsestep') - -Reference ---------- +This R package can be easily installed by running +``install.packages('sparsestep')`` in R. If you use the package in your work, +please cite the above reference using, for instance, the following BibTeX +entry: + +```bibtex +@article{vandenburg2017sparsestep, + title = {{SparseStep}: Approximating the Counting Norm for Sparse Regularization}, + author = {{Van den Burg}, G. J. J. and Groenen, P. J. F. and Alfons, A.}, + journal = {arXiv preprint arXiv:1701.06967}, + year = {2017} +} +``` + +## Introduction + +The SparseStep method solves the regression problem regularized with the +[`l_0` norm](https://en.wikipedia.org/wiki/Lp_space#When_p_=_0). Since the +`l_0` term is highly non-convex and therefore difficult to optimize, this +non-convexity is introduced gradually in SparseStep during optimization. As in +other regularized regression methods such as ridge regression and lasso, a +regularization parameter ``lambda`` can be specified to control the amount of +regularization. The choice of regularization parameter affects how many +non-zero variables remain in the final model. + +We will give a quick guide to SparseStep using the Prostate dataset from the +book [Elements of Statistical +Learning](https://web.stanford.edu/~hastie/ElemStatLearn/). + +We will show a few examples of running SparseStep on the Prostate dataset from +the [lasso2](https://cran.r-project.org/web/packages/lasso2/index.html) +package. First we load the data and create a data matrix and outcome vector: + +```r +> prostate <- +> read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data") +> X <- prostate[prostate$train == T, c(-1, -10)] +> X <- as.matrix(X) +> y <- prostate[prostate$train == T, 1] +> y <- as.vector(y) +``` + +The easiest way to fit a SparseStep model is to use the ``path.sparsestep`` +function. This estimates the entire path of solutions for the SparseStep model +for different values of the regularization parameter using a [golden section +search](https://en.wikipedia.org/wiki/Golden-section_search) algorithm. + +```r +> path <- path.sparsestep(X, y) +Found maximum value of lambda: 2^( 7 ) +Found minimum value of lambda: 2^( -3 ) +Running search in interval [ -3 , 7 ] ... +Running search in interval [ -3 , 2 ] ... +Running search in interval [ -3 , -0.5 ] ... +Running search in interval [ -3 , -1.75 ] ... +Running search in interval [ -0.5 , 2 ] ... +Running search in interval [ -0.5 , 0.75 ] ... +Running search in interval [ 0.125 , 0.75 ] ... +Running search in interval [ 2 , 7 ] ... + +> plot(path, col=1:nrow(path$beta)) # col specifies colors to matplot +> legend('topleft', legend=rownames(path$beta), lty=1, col=1:nrow(path$beta)) +``` + +In the resulting plot we can see the coefficients of the features that are +included in the model at different values of ``lambda``: + + + +The coefficients of the model can be obtained using ``coef(path)``, which +returns a sparse matrix: + +```r +> coef(path) +9 x 9 sparse Matrix of class "dgCMatrix" + s0 s1 s2 s3 s4 s5 s6 s7 +Intercept 1.31349155 1.313491553 1.313491553 1.31349155 1.313491553 1.31349155 1.3134916 1.313492 +lweight -0.11336968 -0.113485291 . . . . . . +age 0.02010188 0.020182049 0.018605327 0.01491472 0.018704172 0.01623212 . . +lbph -0.05698125 -0.059026246 -0.069116923 . . . . . +svi 0.03511645 . . . . . . . +lcp 0.41845469 0.423398063 0.420516410 0.43806447 0.433449263 0.38174743 0.3887863 . +gleason 0.22438690 0.222333394 0.236944796 0.23503609 . . . . +pgg45 -0.00911273 -0.009084031 -0.008949463 -0.00853420 -0.004328518 . . . +lpsa 0.57545508 0.580111724 0.561063637 0.53017309 0.528953966 0.51473225 0.5336907 0.754266 + s8 +Intercept 1.313492 +lweight . +age . +lbph . +svi . +lcp . +gleason . +pgg45 . +lpsa . +``` + +Note that the final model included in ``coef(beta)`` is a intercept-only +model, which is generally not very useful. Predicting out-of-sample data can +be done easily using the ``predict`` function. + +By default SparseStep centers the regressors and outcome variable ``y`` and +normalizes the regressors ``X`` to ensure that the regularization is applied +evenly among them and the intercept is not penalized. If you prefer to use a +constant term in the regression and penalize this as well, you'll have to +transform the input data and disable the intercept: + +```r +> Z <- cbind(constant=1, X) +> path <- path.sparsestep(Z, y, intercept=F) +... +> plot(path, col=1:nrow(path$beta)) +> legend('bottomright', legend=rownames(path$beta), lty=1, col=1:nrow(path$beta)) +``` + +Note that since we add the constant through the data matrix it is subject to +regularization and therefore sparsity: + + + +For more information and examples, please see the documentation included with +the package. In particular, the following pages are good places to start: + +```r +> ?'sparsestep-package' +> ?sparsestep +> ?path.sparsestep +``` + +## Reference If you use SparseStep in any of your projects, please cite the paper using the information available through the R command: @@ -51,26 +160,10 @@ or use the following BibTeX code: keywords = {Statistics - Methodology, 62J05, 62J07}, } -License -------- - - Copyright 2016, G.J.J. van den Burg. - - SparseStep is free software: you can redistribute it and/or modify - it under the terms of the GNU General Public License as published by - the Free Software Foundation, either version 3 of the License, or - (at your option) any later version. - - SparseStep is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with SparseStep. If not, see <http://www.gnu.org/licenses/>. - - For more information please contact: - - G.J.J. van den Burg - email: gertjanvandenburg@gmail.com +## Notes +This package is licensed under GPLv3. Please see the LICENSE file for more +information. If you have any questions or comments about this package, please +open an issue [on GitHub](https://github.com/GjjvdBurg/sparsestep) (don't +hesitate, you're helping to make this project better for everyone!). If you +prefer to use email, please write to ``gertjanvandenburg at gmail dot com``. |
