README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167

# SparseStep R Package

SparseStep is an R package for sparse regularized regression and provides an 
alternative to methods such as best subset selection, elastic net, lasso, and 
lars. The SparseStep method is introduced in the following paper:

[SparseStep: Approximating the Counting Norm for Sparse 
Regularization](https://arxiv.org/abs/1701.06967) by G.J.J. van den Burg, 
P.J.F. Groenen, and A. Alfons (*Arxiv preprint arXiv:1701.06967 [stat.ME]*, 
2017).

This R package can be easily installed by running 
``install.packages('sparsestep')`` in R. If you use the package in your work, 
please cite the above reference using, for instance, the following BibTeX 
entry:

```bibtex
@article{vandenburg2017sparsestep,
  title = {{SparseStep}: Approximating the Counting Norm for Sparse Regularization},
  author = {{Van den Burg}, G. J. J. and Groenen, P. J. F. and Alfons, A.},
  journal = {arXiv preprint arXiv:1701.06967},
  year = {2017}
}
```

## Introduction

The SparseStep method solves the regression problem regularized with the 
[`l_0` norm](https://en.wikipedia.org/wiki/Lp_space#When_p_=_0). Since the 
`l_0` term is highly non-convex and therefore difficult to optimize, this 
non-convexity is introduced gradually in SparseStep during optimization. As in 
other regularized regression methods such as ridge regression and lasso, a 
regularization parameter ``lambda`` can be specified to control the amount of 
regularization.  The choice of regularization parameter affects how many 
non-zero variables remain in the final model.

We will give a quick guide to SparseStep using the Prostate dataset from the 
book [Elements of Statistical 
Learning](https://web.stanford.edu/~hastie/ElemStatLearn/). 

We will show a few examples of running SparseStep on the Prostate dataset from 
the [lasso2](https://cran.r-project.org/web/packages/lasso2/index.html) 
package. First we load the data and create a data matrix and outcome vector:

```r
> prostate <- read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data")
> X <- prostate[prostate$train == T, c(-1, -10)]
> X <- as.matrix(X)
> y <- prostate[prostate$train == T, 1]
> y <- as.vector(y)
```

The easiest way to fit a SparseStep model is to use the ``path.sparsestep`` 
function. This estimates the entire path of solutions for the SparseStep model 
for different values of the regularization parameter using a [golden section 
search](https://en.wikipedia.org/wiki/Golden-section_search) algorithm.

```r
> path <- path.sparsestep(X, y)
Found maximum value of lambda: 2^( 7 )
Found minimum value of lambda: 2^( -3 )
Running search in interval [ -3 , 7 ] ...
Running search in interval [ -3 , 2 ] ...
Running search in interval [ -3 , -0.5 ] ...
Running search in interval [ -3 , -1.75 ] ...
Running search in interval [ -0.5 , 2 ] ...
Running search in interval [ -0.5 , 0.75 ] ...
Running search in interval [ 0.125 , 0.75 ] ...
Running search in interval [ 2 , 7 ] ...

> plot(path, col=1:nrow(path$beta))     # col specifies colors to matplot
> legend('topleft', legend=rownames(path$beta), lty=1, col=1:nrow(path$beta))
```

In the resulting plot we can see the coefficients of the features that are 
included in the model at different values of ``lambda``:

![SparseStep regression on Prostate dataset](./.github/images/sparsestep_prostate_1.png)

The coefficients of the model can be obtained using ``coef(path)``, which 
returns a sparse matrix:

```r
> coef(path)
9 x 9 sparse Matrix of class "dgCMatrix"
                   s0           s1           s2          s3           s4         s5        s6       s7
Intercept  1.31349155  1.313491553  1.313491553  1.31349155  1.313491553 1.31349155 1.3134916 1.313492
lweight   -0.11336968 -0.113485291  .            .           .           .          .         .
age        0.02010188  0.020182049  0.018605327  0.01491472  0.018704172 0.01623212 .         .
lbph      -0.05698125 -0.059026246 -0.069116923  .           .           .          .         .
svi        0.03511645  .            .            .           .           .          .         .
lcp        0.41845469  0.423398063  0.420516410  0.43806447  0.433449263 0.38174743 0.3887863 .
gleason    0.22438690  0.222333394  0.236944796  0.23503609  .           .          .         .
pgg45     -0.00911273 -0.009084031 -0.008949463 -0.00853420 -0.004328518 .          .         .
lpsa       0.57545508  0.580111724  0.561063637  0.53017309  0.528953966 0.51473225 0.5336907 0.754266
                s8
Intercept 1.313492
lweight   .
age       .
lbph      .
svi       .
lcp       .
gleason   .
pgg45     .
lpsa      .
```

Note that the final model included in ``coef(beta)`` is a intercept-only 
model, which is generally not very useful. Predicting out-of-sample data can 
be done easily using the ``predict`` function.

By default SparseStep centers the regressors and outcome variable ``y`` and 
normalizes the regressors ``X`` to ensure that the regularization is applied 
evenly among them and the intercept is not penalized. If you prefer to use a 
constant term in the regression and penalize this as well, you'll have to 
transform the input data and disable the intercept:

```r
> Z <- cbind(constant=1, X)
> path <- path.sparsestep(Z, y, intercept=F)
...
> plot(path, col=1:nrow(path$beta))
> legend('bottomright', legend=rownames(path$beta), lty=1, col=1:nrow(path$beta))
```

Note that since we add the constant through the data matrix it is subject to 
regularization and therefore sparsity:

![SparseStep regression on Prostate dataset (with constant)](./.github/images/sparsestep_prostate_2.png)

For more information and examples, please see the documentation included with 
the package. In particular, the following pages are good places to start:

```r
> ?'sparsestep-package'
> ?sparsestep
> ?path.sparsestep
```

## Reference

If you use SparseStep in any of your projects, please cite the paper using the 
information available through the R command:

    citation('sparsestep')

or use the following BibTeX code:

    @article{van2017sparsestep,
      title = {{SparseStep}: Approximating the Counting Norm for Sparse Regularization},
      author = {Gerrit J.J. {van den Burg} and Patrick J.F. Groenen and Andreas Alfons},
      journal = {arXiv preprint arXiv:1701.06967},
      archiveprefix = {arXiv},
      year = {2017},
      eprint = {1701.06967},
      url = {https://arxiv.org/abs/1701.06967},
      primaryclass = {stat.ME},
      keywords = {Statistics - Methodology, 62J05, 62J07},
    }

## Notes

This package is licensed under GPLv3. Please see the LICENSE file for more 
information. If you have any questions or comments about this package, please 
open an issue [on GitHub](https://github.com/GjjvdBurg/sparsestep) (don't 
hesitate, you're helping to make this project better for everyone!). If you 
prefer to use email, please write to ``gertjanvandenburg at gmail dot com``.