# SparseStep R Package

SparseStep is an R package for sparse regularized regression and provides an 
alternative to methods such as best subset selection, elastic net, lasso, and 
lars. The SparseStep method is introduced in the following paper:

[SparseStep: Approximating the Counting Norm for Sparse 
Regularization](https://arxiv.org/abs/1701.06967) by G.J.J. van den Burg, 
P.J.F. Groenen, and A. Alfons (*Arxiv preprint arXiv:1701.06967 [stat.ME]*, 
2017).

This R package can be easily installed by running 
``install.packages('sparsestep')`` in R. If you use the package in your work, 
please cite the above reference using, for instance, the following BibTeX 
entry:

```bibtex
@article{vandenburg2017sparsestep,
  title = {{SparseStep}: Approximating the Counting Norm for Sparse Regularization},
  author = {{Van den Burg}, G. J. J. and Groenen, P. J. F. and Alfons, A.},
  journal = {arXiv preprint arXiv:1701.06967},
  year = {2017}
}
```

## Introduction

The SparseStep method solves the regression problem regularized with the 
[`l_0` norm](https://en.wikipedia.org/wiki/Lp_space#When_p_=_0). Since the 
`l_0` term is highly non-convex and therefore difficult to optimize, this 
non-convexity is introduced gradually in SparseStep during optimization. As in 
other regularized regression methods such as ridge regression and lasso, a 
regularization parameter ``lambda`` can be specified to control the amount of 
regularization.  The choice of regularization parameter affects how many 
non-zero variables remain in the final model.

We will give a quick guide to SparseStep using the Prostate dataset from the 
book [Elements of Statistical 
Learning](https://web.stanford.edu/~hastie/ElemStatLearn/). 

We will show a few examples of running SparseStep on the Prostate dataset from 
the [lasso2](https://cran.r-project.org/web/packages/lasso2/index.html) 
package. First we load the data and create a data matrix and outcome vector:

```r
> prostate <- read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data")
> X <- prostate[prostate$train == T, c(-1, -10)]
> X <- as.matrix(X)
> y <- prostate[prostate$train == T, 1]
> y <- as.vector(y)
```

The easiest way to fit a SparseStep model is to use the ``path.sparsestep`` 
function. This estimates the entire path of solutions for the SparseStep model 
for different values of the regularization parameter using a [golden section 
search](https://en.wikipedia.org/wiki/Golden-section_search) algorithm.

```r
> path <- path.sparsestep(X, y)
Found maximum value of lambda: 2^( 7 )
Found minimum value of lambda: 2^( -3 )
Running search in interval [ -3 , 7 ] ...
Running search in interval [ -3 , 2 ] ...
Running search in interval [ -3 , -0.5 ] ...
Running search in interval [ -3 , -1.75 ] ...
Running search in interval [ -0.5 , 2 ] ...
Running search in interval [ -0.5 , 0.75 ] ...
Running search in interval [ 0.125 , 0.75 ] ...
Running search in interval [ 2 , 7 ] ...

> plot(path, col=1:nrow(path$beta))     # col specifies colors to matplot
> legend('topleft', legend=rownames(path$beta), lty=1, col=1:nrow(path$beta))
```

In the resulting plot we can see the coefficients of the features that are 
included in the model at different values of ``lambda``:

![SparseStep regression on Prostate dataset](./.github/images/sparsestep_prostate_1.png)

The coefficients of the model can be obtained using ``coef(path)``, which 
returns a sparse matrix:

```r
> coef(path)
9 x 9 sparse Matrix of class "dgCMatrix"
                   s0           s1           s2          s3           s4         s5        s6       s7
Intercept  1.31349155  1.313491553  1.313491553  1.31349155  1.313491553 1.31349155 1.3134916 1.313492
lweight   -0.11336968 -0.113485291  .            .           .           .          .         .
age        0.02010188  0.020182049  0.018605327  0.01491472  0.018704172 0.01623212 .         .
lbph      -0.05698125 -0.059026246 -0.069116923  .           .           .          .         .
svi        0.03511645  .            .            .           .           .          .         .
lcp        0.41845469  0.423398063  0.420516410  0.43806447  0.433449263 0.38174743 0.3887863 .
gleason    0.22438690  0.222333394  0.236944796  0.23503609  .           .          .         .
pgg45     -0.00911273 -0.009084031 -0.008949463 -0.00853420 -0.004328518 .          .         .
lpsa       0.57545508  0.580111724  0.561063637  0.53017309  0.528953966 0.51473225 0.5336907 0.754266
                s8
Intercept 1.313492
lweight   .
age       .
lbph      .
svi       .
lcp       .
gleason   .
pgg45     .
lpsa      .
```

Note that the final model included in ``coef(beta)`` is a intercept-only 
model, which is generally not very useful. Predicting out-of-sample data can 
be done easily using the ``predict`` function.

By default SparseStep centers the regressors and outcome variable ``y`` and 
normalizes the regressors ``X`` to ensure that the regularization is applied 
evenly among them and the intercept is not penalized. If you prefer to use a 
constant term in the regression and penalize this as well, you'll have to 
transform the input data and disable the intercept:

```r
> Z <- cbind(constant=1, X)
> path <- path.sparsestep(Z, y, intercept=F)
...
> plot(path, col=1:nrow(path$beta))
> legend('bottomright', legend=rownames(path$beta), lty=1, col=1:nrow(path$beta))
```

Note that since we add the constant through the data matrix it is subject to 
regularization and therefore sparsity:

![SparseStep regression on Prostate dataset (with constant)](./.github/images/sparsestep_prostate_2.png)

For more information and examples, please see the documentation included with 
the package. In particular, the following pages are good places to start:

```r
> ?'sparsestep-package'
> ?sparsestep
> ?path.sparsestep
```

## Reference

If you use SparseStep in any of your projects, please cite the paper using the 
information available through the R command:

    citation('sparsestep')

or use the following BibTeX code:

    @article{van2017sparsestep,
      title = {{SparseStep}: Approximating the Counting Norm for Sparse Regularization},
      author = {Gerrit J.J. {van den Burg} and Patrick J.F. Groenen and Andreas Alfons},
      journal = {arXiv preprint arXiv:1701.06967},
      archiveprefix = {arXiv},
      year = {2017},
      eprint = {1701.06967},
      url = {https://arxiv.org/abs/1701.06967},
      primaryclass = {stat.ME},
      keywords = {Statistics - Methodology, 62J05, 62J07},
    }

## Notes

This package is licensed under GPLv3. Please see the LICENSE file for more 
information. If you have any questions or comments about this package, please 
open an issue [on GitHub](https://github.com/GjjvdBurg/sparsestep) (don't 
hesitate, you're helping to make this project better for everyone!). If you 
prefer to use email, please write to ``gertjanvandenburg at gmail dot com``.