aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGertjan van den Burg <gertjanvandenburg@gmail.com>2017-12-12 20:19:12 -0500
committerGertjan van den Burg <gertjanvandenburg@gmail.com>2017-12-12 20:19:12 -0500
commit7d255c08c589a443aa72ff247b46022204a2ef22 (patch)
tree68c8f872966852d5627cef748da05612f693e4ef
parentadded gridsearch and extended gensvm class (diff)
downloadpygensvm-7d255c08c589a443aa72ff247b46022204a2ef22.tar.gz
pygensvm-7d255c08c589a443aa72ff247b46022204a2ef22.zip
added documentation
-rw-r--r--CHANGELOG.rst2
-rw-r--r--README.rst267
-rw-r--r--docs/Makefile20
-rw-r--r--docs/conf.py212
-rw-r--r--docs/index.rst25
5 files changed, 526 insertions, 0 deletions
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
new file mode 100644
index 0000000..9a203e1
--- /dev/null
+++ b/CHANGELOG.rst
@@ -0,0 +1,2 @@
+Change Log
+==========
diff --git a/README.rst b/README.rst
index e69de29..0182103 100644
--- a/README.rst
+++ b/README.rst
@@ -0,0 +1,267 @@
+GenSVM Python Package
+=====================
+
+This is the documentation of the Python package for the GenSVM classifier,
+introduced in `GenSVM: A Generalized Multiclass Support Vector Machine
+<http://www.jmlr.org/papers/v17/14-526.html>`_ by `Gerrit J.J. van den Burg
+<https://gertjanvandenburg.com>`_ and `Patrick J.F. Groenen
+<https://personal.eur.nl/groenen/>`_.
+
+The source code of this package is available on GitHub at:
+`https://github.com/GjjvdBurg/PyGenSVM
+<https://github.com/GjjvdBurg/PyGenSVM>`_.
+
+Installation
+------------
+
+GenSVM can be easily installed through pip:
+
+.. code:: bash
+
+ pip install gensvm
+
+Usage
+-----
+
+The package contains two classes to fit the GenSVM model: :class:`GenSVM` and
+:class:`GenSVMGridSearchCV`. These classes respectively fit a single GenSVM
+model or fit a series of models for a parameter grid search. The interface to
+these classes is the same as that of classifiers in `Scikit-Learn <http://scikit-learn.org/stable/index.html>`_ so users
+familiar with `Scikit-Learn <http://scikit-learn.org/stable/index.html>`_ should have no trouble using this package. Below
+we will show some examples of using the GenSVM classifier and the
+GenSVMGridSearchCV class in practice.
+
+In the examples We assume that we have loaded the `iris dataset
+<http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html>`_
+from Scikit-Learn as follows:
+
+.. code:: python
+
+ >>> from sklearn.datasets import load_iris
+ >>> from sklearn.model_selection import train_test_split
+ >>> from sklearn.preprocessing import maxabs_scale
+ >>> X, y = load_iris(return_X_y=True)
+ >>> X = maxabs_scale(X)
+ >>> X_train, X_test, y_train, y_test = train_test_split(X, y)
+
+Note that we scale the data using the `maxabs_scale
+<http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.maxabs_scale.html>`_
+function. This scales the columns of the data matrix to ``[-1, 1]`` without
+breaking sparsity. Scaling the dataset can have a significant effect on the
+computation time of GenSVM and is `generally recommended for SVMs
+<https://stats.stackexchange.com/q/65094>`_.
+
+
+Example 1: Fitting a single GenSVM model
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Let's start by fitting the most basic GenSVM model on the training data:
+
+.. code:: python
+
+ >>> from gensvm import GenSVM
+ >>> clf = GenSVM()
+ >>> clf.fit(X_train, y_train)
+ GenSVM(coef=0.0, degree=2.0, epsilon=1e-06, gamma='auto', kappa=0.0,
+ kernel='linear', kernel_eigen_cutoff=1e-08, lmd=1e-05,
+ max_iter=100000000.0, p=1.0, random_state=None, verbose=0,
+ weights='unit')
+
+
+With the model fitted, we can predict the test dataset:
+
+.. code:: python
+
+ >>> y_pred = clf.predict(X_test)
+
+Next, we can compute a score for the predictions. The GenSVM class has a
+``score`` method which computes the `accuracy_score
+<http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html>`_
+for the predictions. In the GenSVM paper, the `adjusted Rand index
+<https://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index>`_ is often used
+to compare performance. We illustrate both options below (your results may be
+different depending on the exact train/test split):
+
+.. code:: python
+
+ >>> clf.score(X_test, y_test)
+ 1.0
+ >>> from sklearn.metrics import adjusted_rand_score
+ >>> adjusted_rand_score(clf.predict(X_test), y_test)
+ 1.0
+
+We can try this again by changing the model parameters, for instance we can
+turn on verbosity and use the Euclidean norm in the GenSVM model by setting ``p = 2``:
+
+.. code:: python
+
+ >>> clf2 = GenSVM(verbose=True, p=2)
+ >>> clf2.fit(X_train, y_train)
+ Starting main loop.
+ Dataset:
+ n = 112
+ m = 4
+ K = 3
+ Parameters:
+ kappa = 0.000000
+ p = 2.000000
+ lambda = 0.0000100000000000
+ epsilon = 1e-06
+
+ iter = 0, L = 3.4499531579689533, Lbar = 7.3369415851139745, reldiff = 1.1266786095824437
+ ...
+ Optimization finished, iter = 4046, loss = 0.0230726364692517, rel. diff. = 0.0000009998645783
+ Number of support vectors: 9
+ GenSVM(coef=0.0, degree=2.0, epsilon=1e-06, gamma='auto', kappa=0.0,
+ kernel='linear', kernel_eigen_cutoff=1e-08, lmd=1e-05,
+ max_iter=100000000.0, p=2, random_state=None, verbose=True,
+ weights='unit')
+
+For other parameters that can be tuned in the GenSVM model, see `GenSVM`_.
+
+
+Example 2: Fitting a GenSVM model with a "warm start"
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+One of the key features of the GenSVM classifier is that training can be
+accelerated by using so-called "warm-starts". This way the optimization can be
+started in a location that is closer to the final solution than a random
+starting position would be. To support this, the ``fit`` method of the GenSVM
+class has an optional ``seed_V`` parameter. We'll illustrate how this can be
+used below.
+
+We start with relatively large value for the ``epsilon`` parameter in the
+model. This is the stopping parameter that determines how long the
+optimization continues (and therefore how exact the fit is).
+
+.. code:: python
+
+ >>> clf1 = GenSVM(epsilon=1e-3)
+ >>> clf1.fit(X_train, y_train)
+ ...
+ >>> clf1.n_iter_
+ 163
+
+The ``n_iter_`` attribute tells us how many iterations the model did. Now, we
+can use the solution of this model to start the training for the next model:
+
+.. code:: python
+
+ >>> clf2 = GenSVM(epsilon=1e-8)
+ >>> clf2.fit(X_train, y_train, seed_V=clf1.combined_coef_)
+ ...
+ >>> clf2.n_iter_
+ 3196
+
+Compare this to a model with the same stopping parameter, but without the warm
+start:
+
+.. code:: python
+
+ >>> clf2.fit(X_train, y_train)
+ ...
+ >>> clf2.n_iter_
+ 3699
+
+So we saved about 500 iterations! This effect will be especially significant
+with large datasets and when you try out many parameter configurations.
+Therefore this technique is built into the `GenSVMGridSearchCV`_ class that
+can be used to do a grid search of parameters.
+
+
+Example 3: Running a GenSVM grid search
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Often when we're fitting a machine learning model such as GenSVM, we have to
+try several parameter configurations to figure out which one performs best on
+our given dataset. This is usually combined with `cross validation
+<http://scikit-learn.org/stable/modules/cross_validation.html>`_ to avoid
+overfitting. To do this efficiently and to make use of warm starts, the
+`GenSVMGridSearchCV`_ class is available. This class works in the same way as
+the `GridSearchCV
+<http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html>`_
+class of `Scikit-Learn <http://scikit-learn.org/stable/index.html>`_, but uses
+the GenSVM C library for speed.
+
+To do a grid search, we first have to define the parameters that we want to
+vary and what values we want to try:
+
+.. code:: python
+
+ >>> from gensvm import GenSVMGridSearchCV
+ >>> param_grid = {'p': [1.0, 2.0], 'lmd': [1e-8, 1e-6, 1e-4, 1e-2, 1.0], 'kappa': [-0.9, 0.0] }
+
+For the values that are not varied in the parameter grid, the default values
+will be used. This means that if you want to change a specific value (such as
+``epsilon`` for instance), you can add this to the parameter grid as a
+parameter with a single value to try (e.g. ``'epsilon': [1e-8]``).
+
+Running the grid search is now straightforward:
+
+.. code:: python
+
+ >>> gg = GenSVMGridSearchCV(param_grid)
+ >>> gg.fit(X_train, y_train)
+ GenSVMGridSearchCV(cv=None, iid=True,
+ param_grid={'p': [1.0, 2.0], 'lmd': [1e-06, 0.0001, 0.01, 1.0], 'kappa': [-0.9, 0.0]},
+ refit=True, return_train_score=True, scoring=None, verbose=0)
+
+Note that if we have set ``refit=True`` (the default), then we can use the
+`GenSVMGridSearchCV`_ instance to predict or score using the best estimator
+found in the grid search:
+
+.. code:: python
+
+ >>> y_pred = gg.predict(X_test)
+ >>> gg.score(X_test, y_test)
+ 1.0
+
+A nice feature borrowed from `Scikit-Learn <http://scikit-learn.org>`_ is that
+the results from the grid search can be represented as a ``pandas`` DataFrame:
+
+.. code:: python
+
+ >>> from pandas import DataFrame
+ >>> df = DataFrame(gg.cv_results_)
+
+This can make it easier to explore the results of the grid search.
+
+Known Limitations
+-----------------
+
+The following are known limitations that are on the roadmap for a future
+release of the package. If you need any of these features, please vote on them
+on the linked GitHub issues (this can make us add them sooner!).
+
+1. `Support for sparse matrices
+ <https://github.com/GjjvdBurg/PyGenSVM/issues/1>`_. NumPy supports sparse
+ matrices, as does the GenSVM C library. Getting them to work together
+ requires some time. In the meantime, if you really want to use sparse data
+ with GenSVM (this can lead to significant speedups!), check out the GenSVM
+ C library.
+2. `Specification of instance weights
+ <https://github.com/GjjvdBurg/PyGenSVM/issues/2>`_. Currently the package
+ allows for two modes of instance weights: ``unit`` weights where each
+ instance gets weight 1 and ``group`` weights where instances get weights
+ inversely proportional to the size of their class. In the future, we want
+ to allow the user to specify a vector of weights as well.
+3. `Specification of class misclassification weights
+ <https://github.com/GjjvdBurg/PyGenSVM/issues/3>`_. Currently, incorrectly
+ classification an object from class A to class C is as bad as incorrectly
+ classifying an object from class B to class C. Depending on the
+ application, this may not be the desired effect. Adding class
+ misclassification weights can solve this issue.
+
+Questions and Issues
+--------------------
+
+If you have any questions or encounter any issues with using this package,
+please ask them on `GitHub <https://github.com/GjjvdBurg/PyGenSVM>`_.
+
+License
+-------
+
+This package is licensed under the GNU General Public License version 3.
+Copyright G.J.J. van den Burg, excluding the sections of the code that are
+explicitly marked to come from Scikit-Learn.
+
diff --git a/docs/Makefile b/docs/Makefile
new file mode 100644
index 0000000..ac6c1f0
--- /dev/null
+++ b/docs/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line.
+SPHINXOPTS =
+SPHINXBUILD = python -msphinx
+SPHINXPROJ = GenSVM
+SOURCEDIR = .
+BUILDDIR = ../../gensvm_docs
+
+# Put it first so that "make" without argument is like "make help".
+help:
+ @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+ @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/conf.py b/docs/conf.py
new file mode 100644
index 0000000..a5c06ea
--- /dev/null
+++ b/docs/conf.py
@@ -0,0 +1,212 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+#
+# GenSVM documentation build configuration file, created by
+# sphinx-quickstart on Tue Sep 26 00:11:33 2017.
+#
+# This file is execfile()d with the current directory set to its
+# containing dir.
+#
+# Note that not all possible configuration values are present in this
+# autogenerated file.
+#
+# All configuration values have a default; values that are commented out
+# serve to show the default.
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+
+import os
+import sys
+import sphinx_rtd_theme
+
+from unittest.mock import MagicMock
+
+sys.path.insert(0, os.path.abspath('..'))
+
+# mock out C extensions for ReadTheDocs
+# (http://docs.readthedocs.io/en/latest/faq.html)
+class Mock(MagicMock):
+ @classmethod
+ def __getattr__(cls, name):
+ return MagicMock()
+
+MOCK_MODULES = ['gensvm.wrapper']
+sys.modules.update((mod_name, Mock()) for mod_name in MOCK_MODULES)
+
+
+# -- General configuration ------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = ['sphinx.ext.autodoc',
+ 'sphinx.ext.doctest',
+ 'sphinx.ext.coverage',
+ 'sphinx.ext.mathjax',
+ 'sphinx.ext.githubpages',
+ 'sphinx.ext.napoleon',
+ 'sphinx.ext.intersphinx'
+ ]
+
+# intersphinx mappings (https://kev.inburke.com/kevin/sphinx-interlinks/)
+# https://stackoverflow.com/q/46080681
+intersphinx_mapping = {
+ 'sklearn': ('http://scikit-learn.org/stable', None)
+ }
+
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+# source_suffix = ['.rst', '.md']
+source_suffix = '.rst'
+
+# The master toctree document.
+master_doc = 'index'
+
+# General information about the project.
+project = 'GenSVM'
+copyright = '2017, Gertjan van den Burg'
+author = 'Gertjan van den Burg'
+
+# The version info for the project you're documenting, acts as replacement for
+# |version| and |release|, also used in various other places throughout the
+# built documents.
+#
+# The short X.Y version.
+#version = '0.1.0'
+# The full version, including alpha/beta/rc tags.
+#release = '0.1.0'
+__version__ = "1.0.0"
+try:
+ pth = os.path.realpath(__file__)
+ dr = os.path.dirname(pth)
+ init_pth = os.path.realpath(os.path.join(dr, '..', 'gensvm',
+ '__init__.py'))
+ line = open(init_pth).readlines()[0]
+ __version__ = line.split('=')[-1].strip("\n '")
+except:
+ pass
+
+version = __version__
+release = version
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = None
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This patterns also effect to html_static_path and html_extra_path
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = 'sphinx'
+
+# If true, `todo` and `todoList` produce output, else they produce nothing.
+todo_include_todos = False
+
+
+# -- Options for HTML output ----------------------------------------------
+
+# The theme to use for HTML and HTML Help pages. See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'sphinx_rtd_theme'
+html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further. For a list of options available for each theme, see the
+# documentation.
+#
+# html_theme_options = {}
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+
+# Custom sidebar templates, must be a dictionary that maps document names
+# to template names.
+#
+# This is required for the alabaster theme
+# refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars
+html_sidebars = {
+ '**': [
+ 'about.html',
+ 'navigation.html',
+ 'relations.html', # needs 'show_related': True theme option to display
+ 'searchbox.html',
+ 'donate.html',
+ ]
+}
+
+
+# -- Options for HTMLHelp output ------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'GenSVMdoc'
+
+
+# -- Options for LaTeX output ---------------------------------------------
+
+latex_elements = {
+ # The paper size ('letterpaper' or 'a4paper').
+ #
+ # 'papersize': 'letterpaper',
+
+ # The font size ('10pt', '11pt' or '12pt').
+ #
+ # 'pointsize': '10pt',
+
+ # Additional stuff for the LaTeX preamble.
+ #
+ # 'preamble': '',
+
+ # Latex figure (float) alignment
+ #
+ # 'figure_align': 'htbp',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+# author, documentclass [howto, manual, or own class]).
+latex_documents = [
+ (master_doc, 'GenSVM.tex', 'GenSVM Documentation',
+ 'Gertjan van den Burg', 'manual'),
+]
+
+
+# -- Options for manual page output ---------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [
+ (master_doc, 'gensvm', 'GenSVM Documentation',
+ [author], 1)
+]
+
+
+# -- Options for Texinfo output -------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+# dir menu entry, description, category)
+texinfo_documents = [
+ (master_doc, 'GenSVM', 'GenSVM Documentation',
+ author, 'GenSVM', 'Implementation of the GenSVM classifier in Python',
+ 'Miscellaneous'),
+]
diff --git a/docs/index.rst b/docs/index.rst
new file mode 100644
index 0000000..d8f8425
--- /dev/null
+++ b/docs/index.rst
@@ -0,0 +1,25 @@
+.. GenSVM documentation master file, created by
+ sphinx-quickstart on Tue Sep 26 00:11:33 2017.
+ You can adapt this file completely to your liking, but it should at least
+ contain the root `toctree` directive.
+
+
+.. include:: ../README.rst
+
+Classes
+=======
+
+The complete documentation of the available GenSVM classes is presented below.
+
+GenSVM
+------
+
+.. autoclass:: gensvm.core.GenSVM
+
+GenSVMGridSearchCV
+------------------
+
+.. autoclass:: gensvm.gridsearch.GenSVMGridSearchCV
+
+
+.. include:: ../CHANGELOG.rst