aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: a1e04b029cf3cf581dc61075427ec1062560b535 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
GenSVM C Package
================

GenSVM: A Generalized Multiclass Support Vector Machine.

Author: Gertjan van den Burg (<gertjanvandenburg@gmail.com>)

Introduction
------------

This is the C library for the GenSVM method. GenSVM is a general multiclass 
support vector machine, which you can use for classification problems with 
multiple classes. Training GenSVM in cross-validation or grid search setups 
can be done efficiently due to the ability to use warm starts. See the 
[paper]() for more information, and Usage below for how to use GenSVM.

The library has support for datasets in [MSVMpack]() and [LibSVM/SVMlight]() 
format, and can take advantage of sparse datasets. There is also (preliminary) 
support for nonlinear GenSVM through kernels.

For documentation on how the library is implemented, see the Doxygen 
documentation available [here](). There are also many unit tests, which you 
can use to further understand how the library works. Test coverage for the 
current version is reported [here]().

Usage
-----

First, download and compile the library. Minimal requirements for compilation 
are a working BLAS and LAPACK installation, which you can likely obtain from 
your package manager. It is however recommended to use ATLAS versions of these 
libraries, since this will give a significant increase in speed. If you choose 
not to use ATLAS, remove linking with ``-latlas`` in the ``LDFLAGS`` variable 
in the Makefile.

Then, compile the library with a simple:

    make

If you like to run the tests, use ``make test`` on the command line. 

After successful compilation, you will have two executables ``gensvm`` and 
``gensvm_grid``. Type:

    ./gensvm

To get an overview of the command line options to the executable (similar for 
``gensvm_grid``).

The ``gensvm`` executable can be used to train a GenSVM model on a dataset 
with a single hyperparameter configuration, whereas the ``gensvm_grid`` 
executable can be used to run a grid search on a dataset.

Here's an example of using the ``gensvm`` executable on a single dataset, with 
some custom parameters:

    ./gensvm -l 1e-5 -k 1.0 -p 1.5 data/iris.train

This fits the model with regularization parameter ``1e-5``, Huber hinge 
parameter ``1.0`` and lp norm parameter ``1.5``, and default settings 
otherwise. On my computer this yields a model with 18 support vectors in about 
0.1 seconds. The ``gensvm`` executable can also be used to get predictions for 
a test dataset, if it is supplied as final argument to the command. In this 
case, predictions will be printed to stdout, unless an output file is 
specified with the ``-o`` option.

The ``gensvm_grid`` executable can be used to run a grid search on a dataset.
The input to this executable is a file (called a grid file), which specifies 
the values of the parameters. See the ``training`` directory for examples and 
the documentation [here]() for more info on the file format. One important 
thing to note is that when the ``repeats`` field has a positive value, a 
so-called "consistency check" will be performed after the grid search has 
finished. This is a robustness check on the best performing configurations, to 
find the best overall hyperparameter configuration with the best performance 
and smallest training time. In this robustness check warm-starts are not used, 
to ensure the observations are independent measurements of training time.

Here's an example of running ``gensvm_grid`` without repeats on the iris 
dataset:

    ./gensvm_grid training/iris_norepeats.training

On my computer this runs in about 8 seconds with 342 hyperparameter 
configurations. Alternatively, if consistency checks are desired we can run:

    ./gensvm_grid training/iris.training

which runs the same grid search but also does 5 consistency repeats for each 
of the configurations with the 5% best performance. Note that the performance 
is measured by cross-validated accuracy scores. This example runs in about 13 
seconds on my computer.

Reference
---------

If you use GenSVM in any of your projects, please cite the GenSVM paper 
available at [link](link). You can use the following BibTeX code:

    bibtex here

License
-------

    Copyright 2016, G.J.J. van den Burg.

    GenSVM is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    GenSVM is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with GenSVM. If not, see <http://www.gnu.org/licenses/>.

    For more information please contact:

    G.J.J. van den Burg
    email: gertjanvandenburg@gmail.com