diff options
| -rw-r--r-- | CHANGELOG.rst | 2 | ||||
| -rw-r--r-- | README.rst | 267 | ||||
| -rw-r--r-- | docs/Makefile | 20 | ||||
| -rw-r--r-- | docs/conf.py | 212 | ||||
| -rw-r--r-- | docs/index.rst | 25 |
5 files changed, 526 insertions, 0 deletions
diff --git a/CHANGELOG.rst b/CHANGELOG.rst new file mode 100644 index 0000000..9a203e1 --- /dev/null +++ b/CHANGELOG.rst @@ -0,0 +1,2 @@ +Change Log +========== @@ -0,0 +1,267 @@ +GenSVM Python Package +===================== + +This is the documentation of the Python package for the GenSVM classifier, +introduced in `GenSVM: A Generalized Multiclass Support Vector Machine +<http://www.jmlr.org/papers/v17/14-526.html>`_ by `Gerrit J.J. van den Burg +<https://gertjanvandenburg.com>`_ and `Patrick J.F. Groenen +<https://personal.eur.nl/groenen/>`_. + +The source code of this package is available on GitHub at: +`https://github.com/GjjvdBurg/PyGenSVM +<https://github.com/GjjvdBurg/PyGenSVM>`_. + +Installation +------------ + +GenSVM can be easily installed through pip: + +.. code:: bash + + pip install gensvm + +Usage +----- + +The package contains two classes to fit the GenSVM model: :class:`GenSVM` and +:class:`GenSVMGridSearchCV`. These classes respectively fit a single GenSVM +model or fit a series of models for a parameter grid search. The interface to +these classes is the same as that of classifiers in `Scikit-Learn <http://scikit-learn.org/stable/index.html>`_ so users +familiar with `Scikit-Learn <http://scikit-learn.org/stable/index.html>`_ should have no trouble using this package. Below +we will show some examples of using the GenSVM classifier and the +GenSVMGridSearchCV class in practice. + +In the examples We assume that we have loaded the `iris dataset +<http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html>`_ +from Scikit-Learn as follows: + +.. code:: python + + >>> from sklearn.datasets import load_iris + >>> from sklearn.model_selection import train_test_split + >>> from sklearn.preprocessing import maxabs_scale + >>> X, y = load_iris(return_X_y=True) + >>> X = maxabs_scale(X) + >>> X_train, X_test, y_train, y_test = train_test_split(X, y) + +Note that we scale the data using the `maxabs_scale +<http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.maxabs_scale.html>`_ +function. This scales the columns of the data matrix to ``[-1, 1]`` without +breaking sparsity. Scaling the dataset can have a significant effect on the +computation time of GenSVM and is `generally recommended for SVMs +<https://stats.stackexchange.com/q/65094>`_. + + +Example 1: Fitting a single GenSVM model +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Let's start by fitting the most basic GenSVM model on the training data: + +.. code:: python + + >>> from gensvm import GenSVM + >>> clf = GenSVM() + >>> clf.fit(X_train, y_train) + GenSVM(coef=0.0, degree=2.0, epsilon=1e-06, gamma='auto', kappa=0.0, + kernel='linear', kernel_eigen_cutoff=1e-08, lmd=1e-05, + max_iter=100000000.0, p=1.0, random_state=None, verbose=0, + weights='unit') + + +With the model fitted, we can predict the test dataset: + +.. code:: python + + >>> y_pred = clf.predict(X_test) + +Next, we can compute a score for the predictions. The GenSVM class has a +``score`` method which computes the `accuracy_score +<http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html>`_ +for the predictions. In the GenSVM paper, the `adjusted Rand index +<https://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index>`_ is often used +to compare performance. We illustrate both options below (your results may be +different depending on the exact train/test split): + +.. code:: python + + >>> clf.score(X_test, y_test) + 1.0 + >>> from sklearn.metrics import adjusted_rand_score + >>> adjusted_rand_score(clf.predict(X_test), y_test) + 1.0 + +We can try this again by changing the model parameters, for instance we can +turn on verbosity and use the Euclidean norm in the GenSVM model by setting ``p = 2``: + +.. code:: python + + >>> clf2 = GenSVM(verbose=True, p=2) + >>> clf2.fit(X_train, y_train) + Starting main loop. + Dataset: + n = 112 + m = 4 + K = 3 + Parameters: + kappa = 0.000000 + p = 2.000000 + lambda = 0.0000100000000000 + epsilon = 1e-06 + + iter = 0, L = 3.4499531579689533, Lbar = 7.3369415851139745, reldiff = 1.1266786095824437 + ... + Optimization finished, iter = 4046, loss = 0.0230726364692517, rel. diff. = 0.0000009998645783 + Number of support vectors: 9 + GenSVM(coef=0.0, degree=2.0, epsilon=1e-06, gamma='auto', kappa=0.0, + kernel='linear', kernel_eigen_cutoff=1e-08, lmd=1e-05, + max_iter=100000000.0, p=2, random_state=None, verbose=True, + weights='unit') + +For other parameters that can be tuned in the GenSVM model, see `GenSVM`_. + + +Example 2: Fitting a GenSVM model with a "warm start" +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +One of the key features of the GenSVM classifier is that training can be +accelerated by using so-called "warm-starts". This way the optimization can be +started in a location that is closer to the final solution than a random +starting position would be. To support this, the ``fit`` method of the GenSVM +class has an optional ``seed_V`` parameter. We'll illustrate how this can be +used below. + +We start with relatively large value for the ``epsilon`` parameter in the +model. This is the stopping parameter that determines how long the +optimization continues (and therefore how exact the fit is). + +.. code:: python + + >>> clf1 = GenSVM(epsilon=1e-3) + >>> clf1.fit(X_train, y_train) + ... + >>> clf1.n_iter_ + 163 + +The ``n_iter_`` attribute tells us how many iterations the model did. Now, we +can use the solution of this model to start the training for the next model: + +.. code:: python + + >>> clf2 = GenSVM(epsilon=1e-8) + >>> clf2.fit(X_train, y_train, seed_V=clf1.combined_coef_) + ... + >>> clf2.n_iter_ + 3196 + +Compare this to a model with the same stopping parameter, but without the warm +start: + +.. code:: python + + >>> clf2.fit(X_train, y_train) + ... + >>> clf2.n_iter_ + 3699 + +So we saved about 500 iterations! This effect will be especially significant +with large datasets and when you try out many parameter configurations. +Therefore this technique is built into the `GenSVMGridSearchCV`_ class that +can be used to do a grid search of parameters. + + +Example 3: Running a GenSVM grid search +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Often when we're fitting a machine learning model such as GenSVM, we have to +try several parameter configurations to figure out which one performs best on +our given dataset. This is usually combined with `cross validation +<http://scikit-learn.org/stable/modules/cross_validation.html>`_ to avoid +overfitting. To do this efficiently and to make use of warm starts, the +`GenSVMGridSearchCV`_ class is available. This class works in the same way as +the `GridSearchCV +<http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html>`_ +class of `Scikit-Learn <http://scikit-learn.org/stable/index.html>`_, but uses +the GenSVM C library for speed. + +To do a grid search, we first have to define the parameters that we want to +vary and what values we want to try: + +.. code:: python + + >>> from gensvm import GenSVMGridSearchCV + >>> param_grid = {'p': [1.0, 2.0], 'lmd': [1e-8, 1e-6, 1e-4, 1e-2, 1.0], 'kappa': [-0.9, 0.0] } + +For the values that are not varied in the parameter grid, the default values +will be used. This means that if you want to change a specific value (such as +``epsilon`` for instance), you can add this to the parameter grid as a +parameter with a single value to try (e.g. ``'epsilon': [1e-8]``). + +Running the grid search is now straightforward: + +.. code:: python + + >>> gg = GenSVMGridSearchCV(param_grid) + >>> gg.fit(X_train, y_train) + GenSVMGridSearchCV(cv=None, iid=True, + param_grid={'p': [1.0, 2.0], 'lmd': [1e-06, 0.0001, 0.01, 1.0], 'kappa': [-0.9, 0.0]}, + refit=True, return_train_score=True, scoring=None, verbose=0) + +Note that if we have set ``refit=True`` (the default), then we can use the +`GenSVMGridSearchCV`_ instance to predict or score using the best estimator +found in the grid search: + +.. code:: python + + >>> y_pred = gg.predict(X_test) + >>> gg.score(X_test, y_test) + 1.0 + +A nice feature borrowed from `Scikit-Learn <http://scikit-learn.org>`_ is that +the results from the grid search can be represented as a ``pandas`` DataFrame: + +.. code:: python + + >>> from pandas import DataFrame + >>> df = DataFrame(gg.cv_results_) + +This can make it easier to explore the results of the grid search. + +Known Limitations +----------------- + +The following are known limitations that are on the roadmap for a future +release of the package. If you need any of these features, please vote on them +on the linked GitHub issues (this can make us add them sooner!). + +1. `Support for sparse matrices + <https://github.com/GjjvdBurg/PyGenSVM/issues/1>`_. NumPy supports sparse + matrices, as does the GenSVM C library. Getting them to work together + requires some time. In the meantime, if you really want to use sparse data + with GenSVM (this can lead to significant speedups!), check out the GenSVM + C library. +2. `Specification of instance weights + <https://github.com/GjjvdBurg/PyGenSVM/issues/2>`_. Currently the package + allows for two modes of instance weights: ``unit`` weights where each + instance gets weight 1 and ``group`` weights where instances get weights + inversely proportional to the size of their class. In the future, we want + to allow the user to specify a vector of weights as well. +3. `Specification of class misclassification weights + <https://github.com/GjjvdBurg/PyGenSVM/issues/3>`_. Currently, incorrectly + classification an object from class A to class C is as bad as incorrectly + classifying an object from class B to class C. Depending on the + application, this may not be the desired effect. Adding class + misclassification weights can solve this issue. + +Questions and Issues +-------------------- + +If you have any questions or encounter any issues with using this package, +please ask them on `GitHub <https://github.com/GjjvdBurg/PyGenSVM>`_. + +License +------- + +This package is licensed under the GNU General Public License version 3. +Copyright G.J.J. van den Burg, excluding the sections of the code that are +explicitly marked to come from Scikit-Learn. + diff --git a/docs/Makefile b/docs/Makefile new file mode 100644 index 0000000..ac6c1f0 --- /dev/null +++ b/docs/Makefile @@ -0,0 +1,20 @@ +# Minimal makefile for Sphinx documentation +# + +# You can set these variables from the command line. +SPHINXOPTS = +SPHINXBUILD = python -msphinx +SPHINXPROJ = GenSVM +SOURCEDIR = . +BUILDDIR = ../../gensvm_docs + +# Put it first so that "make" without argument is like "make help". +help: + @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) + +.PHONY: help Makefile + +# Catch-all target: route all unknown targets to Sphinx using the new +# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). +%: Makefile + @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/docs/conf.py b/docs/conf.py new file mode 100644 index 0000000..a5c06ea --- /dev/null +++ b/docs/conf.py @@ -0,0 +1,212 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# +# GenSVM documentation build configuration file, created by +# sphinx-quickstart on Tue Sep 26 00:11:33 2017. +# +# This file is execfile()d with the current directory set to its +# containing dir. +# +# Note that not all possible configuration values are present in this +# autogenerated file. +# +# All configuration values have a default; values that are commented out +# serve to show the default. + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use os.path.abspath to make it absolute, like shown here. +# + +import os +import sys +import sphinx_rtd_theme + +from unittest.mock import MagicMock + +sys.path.insert(0, os.path.abspath('..')) + +# mock out C extensions for ReadTheDocs +# (http://docs.readthedocs.io/en/latest/faq.html) +class Mock(MagicMock): + @classmethod + def __getattr__(cls, name): + return MagicMock() + +MOCK_MODULES = ['gensvm.wrapper'] +sys.modules.update((mod_name, Mock()) for mod_name in MOCK_MODULES) + + +# -- General configuration ------------------------------------------------ + +# If your documentation needs a minimal Sphinx version, state it here. +# +# needs_sphinx = '1.0' + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = ['sphinx.ext.autodoc', + 'sphinx.ext.doctest', + 'sphinx.ext.coverage', + 'sphinx.ext.mathjax', + 'sphinx.ext.githubpages', + 'sphinx.ext.napoleon', + 'sphinx.ext.intersphinx' + ] + +# intersphinx mappings (https://kev.inburke.com/kevin/sphinx-interlinks/) +# https://stackoverflow.com/q/46080681 +intersphinx_mapping = { + 'sklearn': ('http://scikit-learn.org/stable', None) + } + + +# Add any paths that contain templates here, relative to this directory. +templates_path = ['_templates'] + +# The suffix(es) of source filenames. +# You can specify multiple suffix as a list of string: +# +# source_suffix = ['.rst', '.md'] +source_suffix = '.rst' + +# The master toctree document. +master_doc = 'index' + +# General information about the project. +project = 'GenSVM' +copyright = '2017, Gertjan van den Burg' +author = 'Gertjan van den Burg' + +# The version info for the project you're documenting, acts as replacement for +# |version| and |release|, also used in various other places throughout the +# built documents. +# +# The short X.Y version. +#version = '0.1.0' +# The full version, including alpha/beta/rc tags. +#release = '0.1.0' +__version__ = "1.0.0" +try: + pth = os.path.realpath(__file__) + dr = os.path.dirname(pth) + init_pth = os.path.realpath(os.path.join(dr, '..', 'gensvm', + '__init__.py')) + line = open(init_pth).readlines()[0] + __version__ = line.split('=')[-1].strip("\n '") +except: + pass + +version = __version__ +release = version + +# The language for content autogenerated by Sphinx. Refer to documentation +# for a list of supported languages. +# +# This is also used if you do content translation via gettext catalogs. +# Usually you set "language" from the command line for these cases. +language = None + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This patterns also effect to html_static_path and html_extra_path +exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] + +# The name of the Pygments (syntax highlighting) style to use. +pygments_style = 'sphinx' + +# If true, `todo` and `todoList` produce output, else they produce nothing. +todo_include_todos = False + + +# -- Options for HTML output ---------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +html_theme = 'sphinx_rtd_theme' +html_theme_path = [sphinx_rtd_theme.get_html_theme_path()] + +# Theme options are theme-specific and customize the look and feel of a theme +# further. For a list of options available for each theme, see the +# documentation. +# +# html_theme_options = {} + +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +html_static_path = ['_static'] + +# Custom sidebar templates, must be a dictionary that maps document names +# to template names. +# +# This is required for the alabaster theme +# refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars +html_sidebars = { + '**': [ + 'about.html', + 'navigation.html', + 'relations.html', # needs 'show_related': True theme option to display + 'searchbox.html', + 'donate.html', + ] +} + + +# -- Options for HTMLHelp output ------------------------------------------ + +# Output file base name for HTML help builder. +htmlhelp_basename = 'GenSVMdoc' + + +# -- Options for LaTeX output --------------------------------------------- + +latex_elements = { + # The paper size ('letterpaper' or 'a4paper'). + # + # 'papersize': 'letterpaper', + + # The font size ('10pt', '11pt' or '12pt'). + # + # 'pointsize': '10pt', + + # Additional stuff for the LaTeX preamble. + # + # 'preamble': '', + + # Latex figure (float) alignment + # + # 'figure_align': 'htbp', +} + +# Grouping the document tree into LaTeX files. List of tuples +# (source start file, target name, title, +# author, documentclass [howto, manual, or own class]). +latex_documents = [ + (master_doc, 'GenSVM.tex', 'GenSVM Documentation', + 'Gertjan van den Burg', 'manual'), +] + + +# -- Options for manual page output --------------------------------------- + +# One entry per manual page. List of tuples +# (source start file, name, description, authors, manual section). +man_pages = [ + (master_doc, 'gensvm', 'GenSVM Documentation', + [author], 1) +] + + +# -- Options for Texinfo output ------------------------------------------- + +# Grouping the document tree into Texinfo files. List of tuples +# (source start file, target name, title, author, +# dir menu entry, description, category) +texinfo_documents = [ + (master_doc, 'GenSVM', 'GenSVM Documentation', + author, 'GenSVM', 'Implementation of the GenSVM classifier in Python', + 'Miscellaneous'), +] diff --git a/docs/index.rst b/docs/index.rst new file mode 100644 index 0000000..d8f8425 --- /dev/null +++ b/docs/index.rst @@ -0,0 +1,25 @@ +.. GenSVM documentation master file, created by + sphinx-quickstart on Tue Sep 26 00:11:33 2017. + You can adapt this file completely to your liking, but it should at least + contain the root `toctree` directive. + + +.. include:: ../README.rst + +Classes +======= + +The complete documentation of the available GenSVM classes is presented below. + +GenSVM +------ + +.. autoclass:: gensvm.core.GenSVM + +GenSVMGridSearchCV +------------------ + +.. autoclass:: gensvm.gridsearch.GenSVMGridSearchCV + + +.. include:: ../CHANGELOG.rst |
