1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
|
# GenSVM Python Package
[](https://travis-ci.org/GjjvdBurg/PyGenSVM)
[](https://gensvm.readthedocs.io/en/latest/?badge=latest)
This is the Python package for the GenSVM multiclass classifier by [Gerrit
J.J. van den Burg](https://gertjanvandenburg.com) and [Patrick J.F.
Groenen](https://personal.eur.nl/groenen/).
**Useful links:**
- [PyGenSVM on GitHub](https://github.com/GjjvdBurg/PyGenSVM)
- [PyGenSVM on PyPI](https://pypi.org/project/gensvm/)
- [Package documentation](https://gensvm.readthedocs.io/en/latest/)
- Journal paper: [GenSVM: A Generalized Multiclass Support Vector
Machine](http://www.jmlr.org/papers/v17/14-526.html) JMLR, 17(225):1−42,
2016.
- There is also an [R package](https://github.com/GjjvdBurg/RGenSVM)
- Or you can directly use [the C library](https://github.com/GjjvdBurg/GenSVM)
## Installation
**Before** GenSVM can be installed, a working NumPy installation is required.
so GenSVM can be installed using the following command:
```bash
$ pip install numpy && pip install gensvm
```
If you encounter any errors, please [open an issue on
GitHub](https://github.com/GjjvdBurg/PyGenSVM). Don't hesitate, you're helping
to make this project better!
## Citing
If you use this package in your research please cite the paper, for instance
using the following BibTeX entry::
```bib
@article{JMLR:v17:14-526,
author = {{van den Burg}, G. J. J. and Groenen, P. J. F.},
title = {{GenSVM}: A Generalized Multiclass Support Vector Machine},
journal = {Journal of Machine Learning Research},
year = {2016},
volume = {17},
number = {225},
pages = {1-42},
url = {http://jmlr.org/papers/v17/14-526.html}
}
```
## Usage
The package contains two classes to fit the GenSVM model: [GenSVM] and
[GenSVMGridSearchCV]. These classes respectively fit a single GenSVM model or
fit a series of models for a parameter grid search. The interface to these
classes is the same as that of classifiers in [Scikit-Learn] so users
familiar with Scikit-Learn should have no trouble using this package. Below
we will show some examples of using the GenSVM classifier and the
GenSVMGridSearchCV class in practice.
In the examples we assume that we have loaded the [iris
dataset](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)
from Scikit-Learn as follows:
```python
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.preprocessing import MaxAbsScaler
>>> X, y = load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)
>>> scaler = MaxAbsScaler().fit(X_train)
>>> X_train, X_test = scaler.transform(X_train), scaler.transform(X_test)
```
Note that we scale the data using the
[MaxAbsScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html)
function. This scales the columns of the data matrix to ``[-1, 1]`` without
breaking sparsity. Scaling the dataset can have a significant effect on the
computation time of GenSVM and is [generally recommended for
SVMs](https://stats.stackexchange.com/q/65094).
### Example 1: Fitting a single GenSVM model
Let's start by fitting the most basic GenSVM model on the training data:
```python
>>> from gensvm import GenSVM
>>> clf = GenSVM()
>>> clf.fit(X_train, y_train)
GenSVM(coef=0.0, degree=2.0, epsilon=1e-06, gamma='auto', kappa=0.0,
kernel='linear', kernel_eigen_cutoff=1e-08, lmd=1e-05,
max_iter=100000000.0, p=1.0, random_state=None, verbose=0,
weights='unit')
```
With the model fitted, we can predict the test dataset:
```python
>>> y_pred = clf.predict(X_test)
```
Next, we can compute a score for the predictions. The GenSVM class has a
``score`` method which computes the
[accuracy_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
for the predictions. In the GenSVM paper, the [adjusted Rand
index](https://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index) is often
used to compare performance. We illustrate both options below (your results
may be different depending on the exact train/test split):
```python
>>> clf.score(X_test, y_test)
1.0
>>> from sklearn.metrics import adjusted_rand_score
>>> adjusted_rand_score(clf.predict(X_test), y_test)
1.0
```
We can try this again by changing the model parameters, for instance we can
turn on verbosity and use the Euclidean norm in the GenSVM model by setting ``p = 2``:
```python
>>> clf2 = GenSVM(verbose=True, p=2)
>>> clf2.fit(X_train, y_train)
Starting main loop.
Dataset:
n = 112
m = 4
K = 3
Parameters:
kappa = 0.000000
p = 2.000000
lambda = 0.0000100000000000
epsilon = 1e-06
iter = 0, L = 3.4499531579689533, Lbar = 7.3369415851139745, reldiff = 1.1266786095824437
...
Optimization finished, iter = 4046, loss = 0.0230726364692517, rel. diff. = 0.0000009998645783
Number of support vectors: 9
GenSVM(coef=0.0, degree=2.0, epsilon=1e-06, gamma='auto', kappa=0.0,
kernel='linear', kernel_eigen_cutoff=1e-08, lmd=1e-05,
max_iter=100000000.0, p=2, random_state=None, verbose=True,
weights='unit')
```
For other parameters that can be tuned in the GenSVM model, see [GenSVM].
### Example 2: Fitting a GenSVM model with a "warm start"
One of the key features of the GenSVM classifier is that training can be
accelerated by using so-called "warm-starts". This way the optimization can be
started in a location that is closer to the final solution than a random
starting position would be. To support this, the ``fit`` method of the GenSVM
class has an optional ``seed_V`` parameter. We'll illustrate how this can be
used below.
We start with relatively large value for the ``epsilon`` parameter in the
model. This is the stopping parameter that determines how long the
optimization continues (and therefore how exact the fit is).
```python
>>> clf1 = GenSVM(epsilon=1e-3)
>>> clf1.fit(X_train, y_train)
...
>>> clf1.n_iter_
163
```
The ``n_iter_`` attribute tells us how many iterations the model did. Now, we
can use the solution of this model to start the training for the next model:
```python
>>> clf2 = GenSVM(epsilon=1e-8)
>>> clf2.fit(X_train, y_train, seed_V=clf1.combined_coef_)
...
>>> clf2.n_iter_
3196
```
Compare this to a model with the same stopping parameter, but without the warm
start:
```python
>>> clf2.fit(X_train, y_train)
...
>>> clf2.n_iter_
3699
```
So we saved about 500 iterations! This effect will be especially significant
with large datasets and when you try out many parameter configurations.
Therefore this technique is built into the [GenSVMGridSearchCV] class that can
be used to do a grid search of parameters.
### Example 3: Running a GenSVM grid search
Often when we're fitting a machine learning model such as GenSVM, we have to
try several parameter configurations to figure out which one performs best on
our given dataset. This is usually combined with [cross
validation](http://scikit-learn.org/stable/modules/cross_validation.html) to
avoid overfitting. To do this efficiently and to make use of warm starts, the
[GenSVMGridSearchCV] class is available. This class works in the same way as
the
[GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
class of [Scikit-Learn], but uses the GenSVM C library for speed.
To do a grid search, we first have to define the parameters that we want to
vary and what values we want to try:
```python
>>> from gensvm import GenSVMGridSearchCV
>>> param_grid = {'p': [1.0, 2.0], 'lmd': [1e-8, 1e-6, 1e-4, 1e-2, 1.0], 'kappa': [-0.9, 0.0] }
```
For the values that are not varied in the parameter grid, the default values
will be used. This means that if you want to change a specific value (such as
``epsilon`` for instance), you can add this to the parameter grid as a
parameter with a single value to try (e.g. ``'epsilon': [1e-8]``).
Running the grid search is now straightforward:
```python
>>> gg = GenSVMGridSearchCV(param_grid)
>>> gg.fit(X_train, y_train)
GenSVMGridSearchCV(cv=None, iid=True,
param_grid={'p': [1.0, 2.0], 'lmd': [1e-06, 0.0001, 0.01, 1.0], 'kappa': [-0.9, 0.0]},
refit=True, return_train_score=True, scoring=None, verbose=0)
```
Note that if we have set ``refit=True`` (the default), then we can use the
[GenSVMGridSearchCV] instance to predict or score using the best estimator
found in the grid search:
```python
>>> y_pred = gg.predict(X_test)
>>> gg.score(X_test, y_test)
1.0
```
A nice feature borrowed from `Scikit-Learn`_ is that the results from the grid
search can be represented as a ``pandas`` DataFrame:
```python
>>> from pandas import DataFrame
>>> df = DataFrame(gg.cv_results_)
```
This can make it easier to explore the results of the grid search.
## Known Limitations
The following are known limitations that are on the roadmap for a future
release of the package. If you need any of these features, please vote on them
on the linked GitHub issues (this can make us add them sooner!).
1. [Support for sparse
matrices](https://github.com/GjjvdBurg/PyGenSVM/issues/1). NumPy supports
sparse matrices, as does the GenSVM C library. Getting them to work
together requires some additional effort. In the meantime, if you really
want to use sparse data with GenSVM (this can lead to significant
speedups!), check out the GenSVM C library.
2. [Specification of class misclassification
weights](https://github.com/GjjvdBurg/PyGenSVM/issues/3). Currently,
incorrectly classification an object from class A to class C is as bad as
incorrectly classifying an object from class B to class C. Depending on the
application, this may not be the desired effect. Adding class
misclassification weights can solve this issue.
## Questions and Issues
If you have any questions or encounter any issues with using this package,
please ask them on [GitHub](https://github.com/GjjvdBurg/PyGenSVM).
## License
This package is licensed under the GNU General Public License version 3.
Copyright (c) G.J.J. van den Burg, excluding the sections of the code that are
explicitly marked to come from Scikit-Learn.
[Scikit-Learn]: http://scikit-learn.org/stable/index.html
[GenSVM]: https://gensvm.readthedocs.io/en/latest/#gensvm
[GenSVMGridSearchCV]: https://gensvm.readthedocs.io/en/latest/#gensvmgridsearchcv
|