diff options
| author | Gertjan van den Burg <gertjanvandenburg@gmail.com> | 2020-03-06 16:48:47 +0000 |
|---|---|---|
| committer | Gertjan van den Burg <gertjanvandenburg@gmail.com> | 2020-03-06 16:48:47 +0000 |
| commit | 95bca1b119088b500ad7539253e6025c2449bb98 (patch) | |
| tree | 308002e560b0144c17c057d807f5c6f488fa9d3e | |
| parent | Merge branch 'packaging' (diff) | |
| download | pygensvm-95bca1b119088b500ad7539253e6025c2449bb98.tar.gz pygensvm-95bca1b119088b500ad7539253e6025c2449bb98.zip | |
Update documentation to use markdown where possible
| -rw-r--r-- | CHANGELOG.md (renamed from CHANGELOG.rst) | 21 | ||||
| -rw-r--r-- | Makefile | 4 | ||||
| -rw-r--r-- | README.md | 291 | ||||
| -rw-r--r-- | README.rst | 313 | ||||
| -rw-r--r-- | docs/CHANGELOG.rst | 49 | ||||
| -rw-r--r-- | docs/README.rst | 300 | ||||
| -rw-r--r-- | docs/auto_functions.txt | 71 | ||||
| -rw-r--r-- | docs/cls_gensvm.txt | 139 | ||||
| -rw-r--r-- | docs/cls_gridsearch.txt | 285 | ||||
| -rw-r--r-- | docs/generate_autodocs.py | 10 | ||||
| -rw-r--r-- | docs/index.rst | 12 | ||||
| -rw-r--r-- | docs/kernels.txt (renamed from docs/kernels.rst) | 0 | ||||
| -rw-r--r-- | setup.py | 6 |
13 files changed, 1159 insertions, 342 deletions
diff --git a/CHANGELOG.rst b/CHANGELOG.md index 26b499d..0d4de78 100644 --- a/CHANGELOG.rst +++ b/CHANGELOG.md @@ -1,29 +1,23 @@ -Change Log ----------- +## Change Log -Version 0.2.4 -^^^^^^^^^^^^^ +### Version 0.2.4 - Add support for retrieving support vectors -Version 0.2.3 -^^^^^^^^^^^^^ +### Version 0.2.3 - Bugfix for prediction with gamma = 'auto' -Version 0.2.2 -^^^^^^^^^^^^^ +### Version 0.2.2 - Add error when unsupported ShuffleSplits are used -Version 0.2.1 -^^^^^^^^^^^^^ +### Version 0.2.1 - Update docs - Speed up unit tests -Version 0.2.0 -^^^^^^^^^^^^^ +### Version 0.2.0 - Add support for interrupting training and retreiving partial results - Allow specification of sample weights in GenSVM.fit() @@ -35,8 +29,7 @@ Version 0.2.0 - Minor bugfixes, documentation improvement, and code cleanup - Add continuous integration through Travis-CI. -Version 0.1.6 -^^^^^^^^^^^^^ +### Version 0.1.6 - Fix segfault caused by error function in C library. - Add "load_default_grid" function to gensvm.gridsearch @@ -36,6 +36,8 @@ dist: ## Make Python source distribution docs: doc doc: venv ## Build documentation with Sphinx + source $(VENV_DIR)/bin/activate && m2r README.md && mv README.rst $(DOC_DIR) + source $(VENV_DIR)/bin/activate && m2r CHANGELOG.md && mv CHANGELOG.rst $(DOC_DIR) source $(VENV_DIR)/bin/activate && $(MAKE) -C $(DOC_DIR) html clean: ## Clean build dist and egg directories left after install @@ -56,5 +58,5 @@ venv: $(VENV_DIR)/bin/activate $(VENV_DIR)/bin/activate: test -d $(VENV_DIR) || virtualenv $(VENV_DIR) - source $(VENV_DIR)/bin/activate && pip install -e .[dev] + source $(VENV_DIR)/bin/activate && pip install numpy && pip install -e .[dev] touch $(VENV_DIR)/bin/activate diff --git a/README.md b/README.md new file mode 100644 index 0000000..6fc5392 --- /dev/null +++ b/README.md @@ -0,0 +1,291 @@ +# GenSVM Python Package + +[](https://travis-ci.org/GjjvdBurg/PyGenSVM) +[](https://gensvm.readthedocs.io/en/latest/?badge=latest) + +This is the Python package for the GenSVM multiclass classifier by [Gerrit +J.J. van den Burg](https://gertjanvandenburg.com) and [Patrick J.F. +Groenen](https://personal.eur.nl/groenen/). + +**Useful links:** + +- [PyGenSVM on GitHub](https://github.com/GjjvdBurg/PyGenSVM) +- [PyGenSVM on PyPI](https://pypi.org/project/gensvm/) +- [Package documentation](https://gensvm.readthedocs.io/en/latest/) +- Journal paper: [GenSVM: A Generalized Multiclass Support Vector + Machine](http://www.jmlr.org/papers/v17/14-526.html) JMLR, 17(225):1−42, + 2016. +- There is also an [R package](https://github.com/GjjvdBurg/RGenSVM) +- Or you can directly use [the C library](https://github.com/GjjvdBurg/GenSVM) + + +## Installation + +**Before** GenSVM can be installed, a working NumPy installation is required. +so GenSVM can be installed using the following command: + +```bash +$ pip install numpy && pip install gensvm +``` + +If you encounter any errors, please [open an issue on +GitHub](https://github.com/GjjvdBurg/PyGenSVM). Don't hesitate, you're helping +to make this project better! + + +## Citing + +If you use this package in your research please cite the paper, for instance +using the following BibTeX entry:: + +```bib +@article{JMLR:v17:14-526, + author = {{van den Burg}, G. J. J. and Groenen, P. J. F.}, + title = {{GenSVM}: A Generalized Multiclass Support Vector Machine}, + journal = {Journal of Machine Learning Research}, + year = {2016}, + volume = {17}, + number = {225}, + pages = {1-42}, + url = {http://jmlr.org/papers/v17/14-526.html} +} +``` + +## Usage + +The package contains two classes to fit the GenSVM model: [GenSVM] and +[GenSVMGridSearchCV]. These classes respectively fit a single GenSVM model or +fit a series of models for a parameter grid search. The interface to these +classes is the same as that of classifiers in [Scikit-Learn] so users +familiar with Scikit-Learn should have no trouble using this package. Below +we will show some examples of using the GenSVM classifier and the +GenSVMGridSearchCV class in practice. + +In the examples we assume that we have loaded the [iris +dataset](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) +from Scikit-Learn as follows: + + +```python +>>> from sklearn.datasets import load_iris +>>> from sklearn.model_selection import train_test_split +>>> from sklearn.preprocessing import MaxAbsScaler +>>> X, y = load_iris(return_X_y=True) +>>> X_train, X_test, y_train, y_test = train_test_split(X, y) +>>> scaler = MaxAbsScaler().fit(X_train) +>>> X_train, X_test = scaler.transform(X_train), scaler.transform(X_test) +``` + +Note that we scale the data using the +[MaxAbsScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html) +function. This scales the columns of the data matrix to ``[-1, 1]`` without +breaking sparsity. Scaling the dataset can have a significant effect on the +computation time of GenSVM and is [generally recommended for +SVMs](https://stats.stackexchange.com/q/65094). + + +### Example 1: Fitting a single GenSVM model + +Let's start by fitting the most basic GenSVM model on the training data: + + +```python +>>> from gensvm import GenSVM +>>> clf = GenSVM() +>>> clf.fit(X_train, y_train) +GenSVM(coef=0.0, degree=2.0, epsilon=1e-06, gamma='auto', kappa=0.0, +kernel='linear', kernel_eigen_cutoff=1e-08, lmd=1e-05, +max_iter=100000000.0, p=1.0, random_state=None, verbose=0, +weights='unit') +``` + +With the model fitted, we can predict the test dataset: + +```python +>>> y_pred = clf.predict(X_test) +``` + +Next, we can compute a score for the predictions. The GenSVM class has a +``score`` method which computes the +[accuracy_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) +for the predictions. In the GenSVM paper, the [adjusted Rand +index](https://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index) is often +used to compare performance. We illustrate both options below (your results +may be different depending on the exact train/test split): + +```python +>>> clf.score(X_test, y_test) +1.0 +>>> from sklearn.metrics import adjusted_rand_score +>>> adjusted_rand_score(clf.predict(X_test), y_test) +1.0 +``` + +We can try this again by changing the model parameters, for instance we can +turn on verbosity and use the Euclidean norm in the GenSVM model by setting ``p = 2``: + +```python +>>> clf2 = GenSVM(verbose=True, p=2) +>>> clf2.fit(X_train, y_train) +Starting main loop. +Dataset: + n = 112 + m = 4 + K = 3 +Parameters: + kappa = 0.000000 + p = 2.000000 + lambda = 0.0000100000000000 + epsilon = 1e-06 + +iter = 0, L = 3.4499531579689533, Lbar = 7.3369415851139745, reldiff = 1.1266786095824437 +... +Optimization finished, iter = 4046, loss = 0.0230726364692517, rel. diff. = 0.0000009998645783 +Number of support vectors: 9 +GenSVM(coef=0.0, degree=2.0, epsilon=1e-06, gamma='auto', kappa=0.0, + kernel='linear', kernel_eigen_cutoff=1e-08, lmd=1e-05, + max_iter=100000000.0, p=2, random_state=None, verbose=True, + weights='unit') +``` + +For other parameters that can be tuned in the GenSVM model, see [GenSVM]. + +### Example 2: Fitting a GenSVM model with a "warm start" + +One of the key features of the GenSVM classifier is that training can be +accelerated by using so-called "warm-starts". This way the optimization can be +started in a location that is closer to the final solution than a random +starting position would be. To support this, the ``fit`` method of the GenSVM +class has an optional ``seed_V`` parameter. We'll illustrate how this can be +used below. + +We start with relatively large value for the ``epsilon`` parameter in the +model. This is the stopping parameter that determines how long the +optimization continues (and therefore how exact the fit is). + +```python +>>> clf1 = GenSVM(epsilon=1e-3) +>>> clf1.fit(X_train, y_train) +... +>>> clf1.n_iter_ +163 +``` + +The ``n_iter_`` attribute tells us how many iterations the model did. Now, we +can use the solution of this model to start the training for the next model: + +```python +>>> clf2 = GenSVM(epsilon=1e-8) +>>> clf2.fit(X_train, y_train, seed_V=clf1.combined_coef_) +... +>>> clf2.n_iter_ +3196 +``` + +Compare this to a model with the same stopping parameter, but without the warm +start: + +```python +>>> clf2.fit(X_train, y_train) +... +>>> clf2.n_iter_ +3699 +``` + +So we saved about 500 iterations! This effect will be especially significant +with large datasets and when you try out many parameter configurations. +Therefore this technique is built into the [GenSVMGridSearchCV] class that can +be used to do a grid search of parameters. + +### Example 3: Running a GenSVM grid search + +Often when we're fitting a machine learning model such as GenSVM, we have to +try several parameter configurations to figure out which one performs best on +our given dataset. This is usually combined with [cross +validation](http://scikit-learn.org/stable/modules/cross_validation.html) to +avoid overfitting. To do this efficiently and to make use of warm starts, the +[GenSVMGridSearchCV] class is available. This class works in the same way as +the +[GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) +class of [Scikit-Learn], but uses the GenSVM C library for speed. + +To do a grid search, we first have to define the parameters that we want to +vary and what values we want to try: + +```python +>>> from gensvm import GenSVMGridSearchCV +>>> param_grid = {'p': [1.0, 2.0], 'lmd': [1e-8, 1e-6, 1e-4, 1e-2, 1.0], 'kappa': [-0.9, 0.0] } +``` + +For the values that are not varied in the parameter grid, the default values +will be used. This means that if you want to change a specific value (such as +``epsilon`` for instance), you can add this to the parameter grid as a +parameter with a single value to try (e.g. ``'epsilon': [1e-8]``). + +Running the grid search is now straightforward: + +```python +>>> gg = GenSVMGridSearchCV(param_grid) +>>> gg.fit(X_train, y_train) +GenSVMGridSearchCV(cv=None, iid=True, + param_grid={'p': [1.0, 2.0], 'lmd': [1e-06, 0.0001, 0.01, 1.0], 'kappa': [-0.9, 0.0]}, + refit=True, return_train_score=True, scoring=None, verbose=0) +``` + +Note that if we have set ``refit=True`` (the default), then we can use the +[GenSVMGridSearchCV] instance to predict or score using the best estimator +found in the grid search: + +```python +>>> y_pred = gg.predict(X_test) +>>> gg.score(X_test, y_test) +1.0 +``` + +A nice feature borrowed from `Scikit-Learn`_ is that the results from the grid +search can be represented as a ``pandas`` DataFrame: + +```python +>>> from pandas import DataFrame +>>> df = DataFrame(gg.cv_results_) +``` + +This can make it easier to explore the results of the grid search. + +## Known Limitations + +The following are known limitations that are on the roadmap for a future +release of the package. If you need any of these features, please vote on them +on the linked GitHub issues (this can make us add them sooner!). + +1. [Support for sparse + matrices](https://github.com/GjjvdBurg/PyGenSVM/issues/1). NumPy supports + sparse matrices, as does the GenSVM C library. Getting them to work + together requires some additional effort. In the meantime, if you really + want to use sparse data with GenSVM (this can lead to significant + speedups!), check out the GenSVM C library. +2. [Specification of class misclassification + weights](https://github.com/GjjvdBurg/PyGenSVM/issues/3). Currently, + incorrectly classification an object from class A to class C is as bad as + incorrectly classifying an object from class B to class C. Depending on the + application, this may not be the desired effect. Adding class + misclassification weights can solve this issue. + + +## Questions and Issues + +If you have any questions or encounter any issues with using this package, +please ask them on [GitHub](https://github.com/GjjvdBurg/PyGenSVM). + +## License + +This package is licensed under the GNU General Public License version 3. + +Copyright (c) G.J.J. van den Burg, excluding the sections of the code that are +explicitly marked to come from Scikit-Learn. + +[Scikit-Learn]: http://scikit-learn.org/stable/index.html + +[GenSVM]: https://gensvm.readthedocs.io/en/latest/#gensvm + +[GenSVMGridSearchCV]: https://gensvm.readthedocs.io/en/latest/#gensvmgridsearchcv diff --git a/README.rst b/README.rst deleted file mode 100644 index 9f53d0c..0000000 --- a/README.rst +++ /dev/null @@ -1,313 +0,0 @@ -GenSVM Python Package -===================== - -.. image:: https://travis-ci.org/GjjvdBurg/PyGenSVM.svg?branch=master - :target: https://travis-ci.org/GjjvdBurg/PyGenSVM - -.. image:: https://readthedocs.org/projects/gensvm/badge/?version=latest - :target: https://gensvm.readthedocs.io/en/latest/?badge=latest - :alt: Documentation Status - - -This is the Python package for the GenSVM multiclass classifier by `Gerrit -J.J. van den Burg <https://gertjanvandenburg.com>`_ and `Patrick J.F. Groenen -<https://personal.eur.nl/groenen/>`_. - -**Important links:** - -- Source repository: `https://github.com/GjjvdBurg/PyGenSVM - <https://github.com/GjjvdBurg/PyGenSVM>`_. - -- Package on PyPI: `https://pypi.org/project/gensvm/ - <https://pypi.org/project/gensvm/>`_. - -- Journal paper: `GenSVM: A Generalized Multiclass Support Vector Machine - <http://www.jmlr.org/papers/v17/14-526.html>`_ JMLR, 17(225):1−42, 2016. - -- Package documentation: `Read The Docs - <https://gensvm.readthedocs.io/en/latest/>`_. - -- There is also an `R package <https://github.com/GjjvdBurg/RGenSVM>`_. - -- Or you can directly use `the C library - <https://github.com/GjjvdBurg/GenSVM>`_. - - -Installation ------------- - -**Before** GenSVM can be installed, a working NumPy installation is required, -so GenSVM can be installed using the following command: - -.. code:: bash - - pip install numpy && pip install gensvm - -If you encounter any errors, please open an issue on `GitHub -<https://github.com/GjjvdBurg/PyGenSVM>`_. - -Citing ------- - -If you use this package in your research please cite the paper, for instance -using the following BibTeX entry:: - - @article{JMLR:v17:14-526, - author = {{van den Burg}, G. J. J. and Groenen, P. J. F.}, - title = {{GenSVM}: A Generalized Multiclass Support Vector Machine}, - journal = {Journal of Machine Learning Research}, - year = {2016}, - volume = {17}, - number = {225}, - pages = {1-42}, - url = {http://jmlr.org/papers/v17/14-526.html} - } - -Usage ------ - -The package contains two classes to fit the GenSVM model: `GenSVM`_ and -`GenSVMGridSearchCV`_. These classes respectively fit a single GenSVM model -or fit a series of models for a parameter grid search. The interface to these -classes is the same as that of classifiers in `Scikit-Learn`_ so users -familiar with Scikit-Learn should have no trouble using this package. Below -we will show some examples of using the GenSVM classifier and the -GenSVMGridSearchCV class in practice. - -In the examples we assume that we have loaded the `iris dataset -<http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html>`_ -from Scikit-Learn as follows: - -.. code:: python - - >>> from sklearn.datasets import load_iris - >>> from sklearn.model_selection import train_test_split - >>> from sklearn.preprocessing import MaxAbsScaler - >>> X, y = load_iris(return_X_y=True) - >>> X_train, X_test, y_train, y_test = train_test_split(X, y) - >>> scaler = MaxAbsScaler().fit(X_train) - >>> X_train, X_test = scaler.transform(X_train), scaler.transform(X_test) - -Note that we scale the data using the `MaxAbsScaler -<http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html>`_ -function. This scales the columns of the data matrix to ``[-1, 1]`` without -breaking sparsity. Scaling the dataset can have a significant effect on the -computation time of GenSVM and is `generally recommended for SVMs -<https://stats.stackexchange.com/q/65094>`_. - - -Example 1: Fitting a single GenSVM model -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Let's start by fitting the most basic GenSVM model on the training data: - -.. code:: python - - >>> from gensvm import GenSVM - >>> clf = GenSVM() - >>> clf.fit(X_train, y_train) - GenSVM(coef=0.0, degree=2.0, epsilon=1e-06, gamma='auto', kappa=0.0, - kernel='linear', kernel_eigen_cutoff=1e-08, lmd=1e-05, - max_iter=100000000.0, p=1.0, random_state=None, verbose=0, - weights='unit') - - -With the model fitted, we can predict the test dataset: - -.. code:: python - - >>> y_pred = clf.predict(X_test) - -Next, we can compute a score for the predictions. The GenSVM class has a -``score`` method which computes the `accuracy_score -<http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html>`_ -for the predictions. In the GenSVM paper, the `adjusted Rand index -<https://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index>`_ is often used -to compare performance. We illustrate both options below (your results may be -different depending on the exact train/test split): - -.. code:: python - - >>> clf.score(X_test, y_test) - 1.0 - >>> from sklearn.metrics import adjusted_rand_score - >>> adjusted_rand_score(clf.predict(X_test), y_test) - 1.0 - -We can try this again by changing the model parameters, for instance we can -turn on verbosity and use the Euclidean norm in the GenSVM model by setting ``p = 2``: - -.. code:: python - - >>> clf2 = GenSVM(verbose=True, p=2) - >>> clf2.fit(X_train, y_train) - Starting main loop. - Dataset: - n = 112 - m = 4 - K = 3 - Parameters: - kappa = 0.000000 - p = 2.000000 - lambda = 0.0000100000000000 - epsilon = 1e-06 - - iter = 0, L = 3.4499531579689533, Lbar = 7.3369415851139745, reldiff = 1.1266786095824437 - ... - Optimization finished, iter = 4046, loss = 0.0230726364692517, rel. diff. = 0.0000009998645783 - Number of support vectors: 9 - GenSVM(coef=0.0, degree=2.0, epsilon=1e-06, gamma='auto', kappa=0.0, - kernel='linear', kernel_eigen_cutoff=1e-08, lmd=1e-05, - max_iter=100000000.0, p=2, random_state=None, verbose=True, - weights='unit') - -For other parameters that can be tuned in the GenSVM model, see `GenSVM`_. - - -Example 2: Fitting a GenSVM model with a "warm start" -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -One of the key features of the GenSVM classifier is that training can be -accelerated by using so-called "warm-starts". This way the optimization can be -started in a location that is closer to the final solution than a random -starting position would be. To support this, the ``fit`` method of the GenSVM -class has an optional ``seed_V`` parameter. We'll illustrate how this can be -used below. - -We start with relatively large value for the ``epsilon`` parameter in the -model. This is the stopping parameter that determines how long the -optimization continues (and therefore how exact the fit is). - -.. code:: python - - >>> clf1 = GenSVM(epsilon=1e-3) - >>> clf1.fit(X_train, y_train) - ... - >>> clf1.n_iter_ - 163 - -The ``n_iter_`` attribute tells us how many iterations the model did. Now, we -can use the solution of this model to start the training for the next model: - -.. code:: python - - >>> clf2 = GenSVM(epsilon=1e-8) - >>> clf2.fit(X_train, y_train, seed_V=clf1.combined_coef_) - ... - >>> clf2.n_iter_ - 3196 - -Compare this to a model with the same stopping parameter, but without the warm -start: - -.. code:: python - - >>> clf2.fit(X_train, y_train) - ... - >>> clf2.n_iter_ - 3699 - -So we saved about 500 iterations! This effect will be especially significant -with large datasets and when you try out many parameter configurations. -Therefore this technique is built into the `GenSVMGridSearchCV`_ class that -can be used to do a grid search of parameters. - - -Example 3: Running a GenSVM grid search -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Often when we're fitting a machine learning model such as GenSVM, we have to -try several parameter configurations to figure out which one performs best on -our given dataset. This is usually combined with `cross validation -<http://scikit-learn.org/stable/modules/cross_validation.html>`_ to avoid -overfitting. To do this efficiently and to make use of warm starts, the -`GenSVMGridSearchCV`_ class is available. This class works in the same way as -the `GridSearchCV -<http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html>`_ -class of `Scikit-Learn`_, but uses the GenSVM C library for speed. - -To do a grid search, we first have to define the parameters that we want to -vary and what values we want to try: - -.. code:: python - - >>> from gensvm import GenSVMGridSearchCV - >>> param_grid = {'p': [1.0, 2.0], 'lmd': [1e-8, 1e-6, 1e-4, 1e-2, 1.0], 'kappa': [-0.9, 0.0] } - -For the values that are not varied in the parameter grid, the default values -will be used. This means that if you want to change a specific value (such as -``epsilon`` for instance), you can add this to the parameter grid as a -parameter with a single value to try (e.g. ``'epsilon': [1e-8]``). - -Running the grid search is now straightforward: - -.. code:: python - - >>> gg = GenSVMGridSearchCV(param_grid) - >>> gg.fit(X_train, y_train) - GenSVMGridSearchCV(cv=None, iid=True, - param_grid={'p': [1.0, 2.0], 'lmd': [1e-06, 0.0001, 0.01, 1.0], 'kappa': [-0.9, 0.0]}, - refit=True, return_train_score=True, scoring=None, verbose=0) - -Note that if we have set ``refit=True`` (the default), then we can use the -`GenSVMGridSearchCV`_ instance to predict or score using the best estimator -found in the grid search: - -.. code:: python - - >>> y_pred = gg.predict(X_test) - >>> gg.score(X_test, y_test) - 1.0 - -A nice feature borrowed from `Scikit-Learn`_ is that the results from the grid -search can be represented as a ``pandas`` DataFrame: - -.. code:: python - - >>> from pandas import DataFrame - >>> df = DataFrame(gg.cv_results_) - -This can make it easier to explore the results of the grid search. - -Known Limitations ------------------ - -The following are known limitations that are on the roadmap for a future -release of the package. If you need any of these features, please vote on them -on the linked GitHub issues (this can make us add them sooner!). - -1. `Support for sparse matrices - <https://github.com/GjjvdBurg/PyGenSVM/issues/1>`_. NumPy supports sparse - matrices, as does the GenSVM C library. Getting them to work together - requires some time. In the meantime, if you really want to use sparse data - with GenSVM (this can lead to significant speedups!), check out the GenSVM - C library. -2. `Specification of class misclassification weights - <https://github.com/GjjvdBurg/PyGenSVM/issues/3>`_. Currently, incorrectly - classification an object from class A to class C is as bad as incorrectly - classifying an object from class B to class C. Depending on the - application, this may not be the desired effect. Adding class - misclassification weights can solve this issue. - -Questions and Issues --------------------- - -If you have any questions or encounter any issues with using this package, -please ask them on `GitHub <https://github.com/GjjvdBurg/PyGenSVM>`_. - -License -------- - -This package is licensed under the GNU General Public License version 3. - -Copyright G.J.J. van den Burg, excluding the sections of the code that are -explicitly marked to come from Scikit-Learn. - -.. _Scikit-Learn: - http://scikit-learn.org/stable/index.html - -.. _GenSVM: - https://gensvm.readthedocs.io/en/latest/#gensvm - -.. _GenSVMGridSearchCV: - https://gensvm.readthedocs.io/en/latest/#gensvmgridsearchcv diff --git a/docs/CHANGELOG.rst b/docs/CHANGELOG.rst new file mode 100644 index 0000000..c7e4ba6 --- /dev/null +++ b/docs/CHANGELOG.rst @@ -0,0 +1,49 @@ + +Change Log +---------- + +Version 0.2.4 +^^^^^^^^^^^^^ + + +* Add support for retrieving support vectors + +Version 0.2.3 +^^^^^^^^^^^^^ + + +* Bugfix for prediction with gamma = 'auto' + +Version 0.2.2 +^^^^^^^^^^^^^ + + +* Add error when unsupported ShuffleSplits are used + +Version 0.2.1 +^^^^^^^^^^^^^ + + +* Update docs +* Speed up unit tests + +Version 0.2.0 +^^^^^^^^^^^^^ + + +* Add support for interrupting training and retreiving partial results +* Allow specification of sample weights in GenSVM.fit() +* Extract per-split durations from grid search results +* Add pre-defined parameter grids 'tiny', 'small', and 'full' +* Add code for prediction with kernels +* Add unit tests +* Change default coef in poly kernel to 1.0 for inhomogeneous kernel +* Minor bugfixes, documentation improvement, and code cleanup +* Add continuous integration through Travis-CI. + +Version 0.1.6 +^^^^^^^^^^^^^ + + +* Fix segfault caused by error function in C library. +* Add "load_default_grid" function to gensvm.gridsearch diff --git a/docs/README.rst b/docs/README.rst new file mode 100644 index 0000000..70a27d2 --- /dev/null +++ b/docs/README.rst @@ -0,0 +1,300 @@ + +GenSVM Python Package +===================== + + +.. image:: https://travis-ci.org/GjjvdBurg/PyGenSVM.svg?branch=master + :target: https://travis-ci.org/GjjvdBurg/PyGenSVM + :alt: Build Status + + +.. image:: https://readthedocs.org/projects/gensvm/badge/?version=latest + :target: https://gensvm.readthedocs.io/en/latest/?badge=latest + :alt: Documentation Status + + +This is the Python package for the GenSVM multiclass classifier by `Gerrit +J.J. van den Burg <https://gertjanvandenburg.com>`_ and `Patrick J.F. +Groenen <https://personal.eur.nl/groenen/>`_. + +**Useful links:** + + +* `PyGenSVM on GitHub <https://github.com/GjjvdBurg/PyGenSVM>`_ +* `PyGenSVM on PyPI <https://pypi.org/project/gensvm/>`_ +* `Package documentation <https://gensvm.readthedocs.io/en/latest/>`_ +* Journal paper: `GenSVM: A Generalized Multiclass Support Vector + Machine <http://www.jmlr.org/papers/v17/14-526.html>`_ JMLR, 17(225):1−42, + 2016. +* There is also an `R package <https://github.com/GjjvdBurg/RGenSVM>`_ +* Or you can directly use `the C library <https://github.com/GjjvdBurg/GenSVM>`_ + +Installation +------------ + +**Before** GenSVM can be installed, a working NumPy installation is required. +so GenSVM can be installed using the following command: + +.. code-block:: bash + + $ pip install numpy && pip install gensvm + +If you encounter any errors, please `open an issue on +GitHub <https://github.com/GjjvdBurg/PyGenSVM>`_. Don't hesitate, you're helping +to make this project better! + +Citing +------ + +If you use this package in your research please cite the paper, for instance +using the following BibTeX entry: + +.. code-block:: bib + + @article{JMLR:v17:14-526, + author = {{van den Burg}, G. J. J. and Groenen, P. J. F.}, + title = {{GenSVM}: A Generalized Multiclass Support Vector Machine}, + journal = {Journal of Machine Learning Research}, + year = {2016}, + volume = {17}, + number = {225}, + pages = {1-42}, + url = {http://jmlr.org/papers/v17/14-526.html} + } + +Usage +----- + +The package contains two classes to fit the GenSVM model: `GenSVM <https://gensvm.readthedocs.io/en/latest/#gensvm>`_ and +`GenSVMGridSearchCV <https://gensvm.readthedocs.io/en/latest/#gensvmgridsearchcv>`_. These classes respectively fit a single GenSVM model or +fit a series of models for a parameter grid search. The interface to these +classes is the same as that of classifiers in `Scikit-Learn <http://scikit-learn.org/stable/index.html>`_ so users +familiar with Scikit-Learn should have no trouble using this package. Below +we will show some examples of using the GenSVM classifier and the +GenSVMGridSearchCV class in practice. + +In the examples we assume that we have loaded the `iris +dataset <http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html>`_ +from Scikit-Learn as follows: + +.. code-block:: python + + >>> from sklearn.datasets import load_iris + >>> from sklearn.model_selection import train_test_split + >>> from sklearn.preprocessing import MaxAbsScaler + >>> X, y = load_iris(return_X_y=True) + >>> X_train, X_test, y_train, y_test = train_test_split(X, y) + >>> scaler = MaxAbsScaler().fit(X_train) + >>> X_train, X_test = scaler.transform(X_train), scaler.transform(X_test) + +Note that we scale the data using the +`MaxAbsScaler <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html>`_ +function. This scales the columns of the data matrix to ``[-1, 1]`` without +breaking sparsity. Scaling the dataset can have a significant effect on the +computation time of GenSVM and is `generally recommended for +SVMs <https://stats.stackexchange.com/q/65094>`_. + +Example 1: Fitting a single GenSVM model +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Let's start by fitting the most basic GenSVM model on the training data: + +.. code-block:: python + + >>> from gensvm import GenSVM + >>> clf = GenSVM() + >>> clf.fit(X_train, y_train) + GenSVM(coef=0.0, degree=2.0, epsilon=1e-06, gamma='auto', kappa=0.0, + kernel='linear', kernel_eigen_cutoff=1e-08, lmd=1e-05, + max_iter=100000000.0, p=1.0, random_state=None, verbose=0, + weights='unit') + +With the model fitted, we can predict the test dataset: + +.. code-block:: python + + >>> y_pred = clf.predict(X_test) + +Next, we can compute a score for the predictions. The GenSVM class has a +``score`` method which computes the +`accuracy_score <http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html>`_ +for the predictions. In the GenSVM paper, the `adjusted Rand +index <https://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index>`_ is often +used to compare performance. We illustrate both options below (your results +may be different depending on the exact train/test split): + +.. code-block:: python + + >>> clf.score(X_test, y_test) + 1.0 + >>> from sklearn.metrics import adjusted_rand_score + >>> adjusted_rand_score(clf.predict(X_test), y_test) + 1.0 + +We can try this again by changing the model parameters, for instance we can +turn on verbosity and use the Euclidean norm in the GenSVM model by setting ``p = 2``\ : + +.. code-block:: python + + >>> clf2 = GenSVM(verbose=True, p=2) + >>> clf2.fit(X_train, y_train) + Starting main loop. + Dataset: + n = 112 + m = 4 + K = 3 + Parameters: + kappa = 0.000000 + p = 2.000000 + lambda = 0.0000100000000000 + epsilon = 1e-06 + + iter = 0, L = 3.4499531579689533, Lbar = 7.3369415851139745, reldiff = 1.1266786095824437 + ... + Optimization finished, iter = 4046, loss = 0.0230726364692517, rel. diff. = 0.0000009998645783 + Number of support vectors: 9 + GenSVM(coef=0.0, degree=2.0, epsilon=1e-06, gamma='auto', kappa=0.0, + kernel='linear', kernel_eigen_cutoff=1e-08, lmd=1e-05, + max_iter=100000000.0, p=2, random_state=None, verbose=True, + weights='unit') + +For other parameters that can be tuned in the GenSVM model, see `GenSVM <https://gensvm.readthedocs.io/en/latest/#gensvm>`_. + +Example 2: Fitting a GenSVM model with a "warm start" +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +One of the key features of the GenSVM classifier is that training can be +accelerated by using so-called "warm-starts". This way the optimization can be +started in a location that is closer to the final solution than a random +starting position would be. To support this, the ``fit`` method of the GenSVM +class has an optional ``seed_V`` parameter. We'll illustrate how this can be +used below. + +We start with relatively large value for the ``epsilon`` parameter in the +model. This is the stopping parameter that determines how long the +optimization continues (and therefore how exact the fit is). + +.. code-block:: python + + >>> clf1 = GenSVM(epsilon=1e-3) + >>> clf1.fit(X_train, y_train) + ... + >>> clf1.n_iter_ + 163 + +The ``n_iter_`` attribute tells us how many iterations the model did. Now, we +can use the solution of this model to start the training for the next model: + +.. code-block:: python + + >>> clf2 = GenSVM(epsilon=1e-8) + >>> clf2.fit(X_train, y_train, seed_V=clf1.combined_coef_) + ... + >>> clf2.n_iter_ + 3196 + +Compare this to a model with the same stopping parameter, but without the warm +start: + +.. code-block:: python + + >>> clf2.fit(X_train, y_train) + ... + >>> clf2.n_iter_ + 3699 + +So we saved about 500 iterations! This effect will be especially significant +with large datasets and when you try out many parameter configurations. +Therefore this technique is built into the `GenSVMGridSearchCV <https://gensvm.readthedocs.io/en/latest/#gensvmgridsearchcv>`_ class that can +be used to do a grid search of parameters. + +Example 3: Running a GenSVM grid search +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Often when we're fitting a machine learning model such as GenSVM, we have to +try several parameter configurations to figure out which one performs best on +our given dataset. This is usually combined with `cross +validation <http://scikit-learn.org/stable/modules/cross_validation.html>`_ to +avoid overfitting. To do this efficiently and to make use of warm starts, the +`GenSVMGridSearchCV <https://gensvm.readthedocs.io/en/latest/#gensvmgridsearchcv>`_ class is available. This class works in the same way as +the +`GridSearchCV <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html>`_ +class of `Scikit-Learn <http://scikit-learn.org/stable/index.html>`_\ , but uses the GenSVM C library for speed. + +To do a grid search, we first have to define the parameters that we want to +vary and what values we want to try: + +.. code-block:: python + + >>> from gensvm import GenSVMGridSearchCV + >>> param_grid = {'p': [1.0, 2.0], 'lmd': [1e-8, 1e-6, 1e-4, 1e-2, 1.0], 'kappa': [-0.9, 0.0] } + +For the values that are not varied in the parameter grid, the default values +will be used. This means that if you want to change a specific value (such as +``epsilon`` for instance), you can add this to the parameter grid as a +parameter with a single value to try (e.g. ``'epsilon': [1e-8]``\ ). + +Running the grid search is now straightforward: + +.. code-block:: python + + >>> gg = GenSVMGridSearchCV(param_grid) + >>> gg.fit(X_train, y_train) + GenSVMGridSearchCV(cv=None, iid=True, + param_grid={'p': [1.0, 2.0], 'lmd': [1e-06, 0.0001, 0.01, 1.0], 'kappa': [-0.9, 0.0]}, + refit=True, return_train_score=True, scoring=None, verbose=0) + +Note that if we have set ``refit=True`` (the default), then we can use the +`GenSVMGridSearchCV <https://gensvm.readthedocs.io/en/latest/#gensvmgridsearchcv>`_ instance to predict or score using the best estimator +found in the grid search: + +.. code-block:: python + + >>> y_pred = gg.predict(X_test) + >>> gg.score(X_test, y_test) + 1.0 + +A nice feature borrowed from `Scikit-Learn`_ is that the results from the grid +search can be represented as a ``pandas`` DataFrame: + +.. code-block:: python + + >>> from pandas import DataFrame + >>> df = DataFrame(gg.cv_results_) + +This can make it easier to explore the results of the grid search. + +Known Limitations +----------------- + +The following are known limitations that are on the roadmap for a future +release of the package. If you need any of these features, please vote on them +on the linked GitHub issues (this can make us add them sooner!). + + +#. `Support for sparse + matrices <https://github.com/GjjvdBurg/PyGenSVM/issues/1>`_. NumPy supports + sparse matrices, as does the GenSVM C library. Getting them to work + together requires some additional effort. In the meantime, if you really + want to use sparse data with GenSVM (this can lead to significant + speedups!), check out the GenSVM C library. +#. `Specification of class misclassification + weights <https://github.com/GjjvdBurg/PyGenSVM/issues/3>`_. Currently, + incorrectly classification an object from class A to class C is as bad as + incorrectly classifying an object from class B to class C. Depending on the + application, this may not be the desired effect. Adding class + misclassification weights can solve this issue. + +Questions and Issues +-------------------- + +If you have any questions or encounter any issues with using this package, +please ask them on `GitHub <https://github.com/GjjvdBurg/PyGenSVM>`_. + +License +------- + +This package is licensed under the GNU General Public License version 3. + +Copyright (c) G.J.J. van den Burg, excluding the sections of the code that are +explicitly marked to come from Scikit-Learn. diff --git a/docs/auto_functions.txt b/docs/auto_functions.txt new file mode 100644 index 0000000..2a6596f --- /dev/null +++ b/docs/auto_functions.txt @@ -0,0 +1,71 @@ + +.. py:function:: load_grid_tiny() + :noindex: + :module: gensvm.gridsearch + + Load a tiny parameter grid for the GenSVM grid search + + This function returns a parameter grid to use in the GenSVM grid search. + This grid was obtained by analyzing the experiments done for the GenSVM + paper and selecting the configurations that achieve accuracy within the + 95th percentile on over 90% of the datasets. It is a good start for a + parameter search with a reasonably high chance of achieving good + performance on most datasets. + + Note that this grid is only tested to work well in combination with the + linear kernel. + + :returns: **pg** -- List of 10 parameter configurations that are likely to perform + reasonably well. + :rtype: list + + +.. py:function:: load_grid_small() + :noindex: + :module: gensvm.gridsearch + + Load a small parameter grid for GenSVM + + This function loads a default parameter grid to use for the #' GenSVM + gridsearch. It contains all possible combinations of the following #' + parameter sets:: + + pg = { + 'p': [1.0, 1.5, 2.0], + 'lmd': [1e-8, 1e-6, 1e-4, 1e-2, 1], + 'kappa': [-0.9, 0.5, 5.0], + 'weights': ['unit', 'group'], + } + + :returns: **pg** -- Mapping from parameters to lists of values for those parameters. To be + used as input for the :class:`.GenSVMGridSearchCV` class. + :rtype: dict + + +.. py:function:: load_grid_full() + :noindex: + :module: gensvm.gridsearch + + Load the full parameter grid for GenSVM + + This is the parameter grid used in the GenSVM paper to run the grid search + experiments. It uses a large grid for the ``lmd`` regularization parameter + and converges with a stopping criterion of ``1e-8``. This is a relatively + small stopping criterion and in practice good classification results can be + obtained by using a larger stopping criterion. + + The function returns the following grid:: + + pg = { + 'lmd': [pow(2, x) for x in range(-18, 19, 2)], + 'kappa': [-0.9, 0.5, 5.0], + 'p': [1.0, 1.5, 2.0], + 'weights': ['unit', 'group'], + 'epsilon': [1e-8], + 'kernel': ['linear'] + } + + :returns: **pg** -- Mapping from parameters to lists of values for those parameters. To be + used as input for the :class:`.GenSVMGridSearchCV` class. + :rtype: dict + diff --git a/docs/cls_gensvm.txt b/docs/cls_gensvm.txt new file mode 100644 index 0000000..b4bc9a7 --- /dev/null +++ b/docs/cls_gensvm.txt @@ -0,0 +1,139 @@ + +.. py:class:: GenSVM(p=1.0, lmd=1e-05, kappa=0.0, epsilon=1e-06, weights='unit', kernel='linear', gamma='auto', coef=1.0, degree=2.0, kernel_eigen_cutoff=1e-08, verbose=0, random_state=None, max_iter=100000000.0) + :noindex: + :module: gensvm.core + + Generalized Multiclass Support Vector Machine Classification. + + This class implements the basic GenSVM classifier. GenSVM is a generalized + multiclass SVM which is flexible in the weighting of misclassification + errors. It is this flexibility that makes it perform well on diverse + datasets. + + The :func:`~GenSVM.fit` and :func:`~GenSVM.predict` methods of this class + use the GenSVM C library for the actual computations. + + :param p: Parameter for the L_p norm of the loss function (1.0 <= p <= 2.0) + :type p: float, optional (default=1.0) + :param lmd: Parameter for the regularization term of the loss function (lmd > 0) + :type lmd: float, optional (default=1e-5) + :param kappa: Parameter for the hinge function in the loss function (kappa > -1.0) + :type kappa: float, optional (default=0.0) + :param weights: Type of sample weights to use. Options are 'unit' for unit weights and + 'group' for group size correction weights (equation 4 in the paper). + + It is also possible to provide an explicit vector of sample weights + through the :func:`~GenSVM.fit` method. If so, it will override the + setting provided here. + :type weights: string, optional (default='unit') + :param kernel: Specify the kernel type to use in the classifier. It must be one of + 'linear', 'poly', 'rbf', or 'sigmoid'. + :type kernel: string, optional (default='linear') + :param gamma: Kernel parameter for the rbf, poly, and sigmoid kernel. If gamma is + 'auto' then 1/n_features will be used. See `Kernels in GenSVM + <gensvm_kernels_>`_ for the exact implementation of the kernels. + :type gamma: float, optional (default='auto') + :param coef: Kernel parameter for the poly and sigmoid kernel. See `Kernels in + GenSVM <gensvm_kernels_>`_ for the exact implementation of the kernels. + :type coef: float, optional (default=1.0) + :param degree: Kernel parameter for the poly kernel. See `Kernels in GenSVM + <gensvm_kernels_>`_ for the exact implementation of the kernels. + :type degree: float, optional (default=2.0) + :param kernel_eigen_cutoff: Cutoff point for the reduced eigendecomposition used with nonlinear + GenSVM. Eigenvectors for which the ratio between their corresponding + eigenvalue and the largest eigenvalue is smaller than the cutoff will + be dropped. + :type kernel_eigen_cutoff: float, optional (default=1e-8) + :param verbose: Enable verbose output + :type verbose: int, (default=0) + :param random_state: The seed for the random number generation used for initialization where + necessary. See the documentation of + ``sklearn.utils.check_random_state`` for more info. + :type random_state: None, int, instance of RandomState + :param max_iter: The maximum number of iterations to be run. + :type max_iter: int, (default=1e8) + + .. attribute:: coef_ + + *array, shape = [n_features, n_classes-1]* -- Weights assigned to the features (coefficients in the primal problem) + + .. attribute:: intercept_ + + *array, shape = [n_classes-1]* -- Constants in the decision function + + .. attribute:: combined_coef_ + + *array, shape = [n_features+1, n_classes-1]* -- Combined weights matrix for the seed_V parameter to the fit method + + .. attribute:: n_iter_ + + *int* -- The number of iterations that were run during training. + + .. attribute:: n_support_ + + *int* -- The number of support vectors that were found + + .. attribute:: SVs_ + + *array, shape = [n_observations, ]* -- Index vector that marks the support vectors (1 = SV, 0 = no SV) + + .. seealso:: + + :class:`.GenSVMGridSearchCV` + Helper class to run an efficient grid search for GenSVM. + + .. _gensvm_kernels: + https://gensvm.readthedocs.io/en/latest/#kernels-in-gensvm + + + + .. py:method:: GenSVM.fit(X, y, sample_weight=None, seed_V=None) + :noindex: + :module: gensvm.core + + Fit the GenSVM model on the given data + + The model can be fit with or without a seed matrix (``seed_V``). This + can be used to provide warm starts for the algorithm. + + :param X: The input data. It is expected that only numeric data is given. + :type X: array, shape = (n_observations, n_features) + :param y: The label vector, labels can be numbers or strings. + :type y: array, shape = (n_observations, ) + :param sample_weight: Array of weights that are assigned to individual samples. If not + provided, then the weight specification in the constructor is used + ('unit' or 'group'). + :type sample_weight: array, shape = (n_observations, ) + :param seed_V: Seed coefficient array to use as a warm start for the optimization. + It can for instance be the :attr:`combined_coef_ + <.GenSVM.combined_coef_>` attribute of a different GenSVM model. + This is only supported for the linear kernel. + + NOTE: the size of the seed_V matrix is ``n_features+1`` by + ``n_classes - 1``. The number of columns of ``seed_V`` is leading + for the number of classes in the model. For example, if ``y`` + contains 3 different classes and ``seed_V`` has 3 columns, we + assume that there are actually 4 classes in the problem but one + class is just represented in this training data. This can be useful + for problems were a certain class has only a few samples. + :type seed_V: array, shape = (n_features+1, n_classes-1), optional + + :returns: **self** -- Returns self. + :rtype: object + + + .. py:method:: GenSVM.predict(X, trainX=None) + :noindex: + :module: gensvm.core + + Predict the class labels on the given data + + :param X: Data for which to predict the labels + :type X: array, shape = [n_test_samples, n_features] + :param trainX: Only for nonlinear prediction with kernels: the training data used + to train the model. + :type trainX: array, shape = [n_train_samples, n_features] + + :returns: **y_pred** -- Predicted class labels of the data in X. + :rtype: array, shape = (n_samples, ) + diff --git a/docs/cls_gridsearch.txt b/docs/cls_gridsearch.txt new file mode 100644 index 0000000..6a2c05e --- /dev/null +++ b/docs/cls_gridsearch.txt @@ -0,0 +1,285 @@ + +.. py:class:: GenSVMGridSearchCV(param_grid='tiny', scoring=None, iid=True, cv=None, refit=True, verbose=0, return_train_score=True) + :noindex: + :module: gensvm.gridsearch + + GenSVM cross validated grid search + + This class implements efficient GenSVM grid search with cross validation. + One of the strong features of GenSVM is that seeding the classifier + properly can greatly reduce total training time. This class ensures that + the grid search is done in the most efficient way possible. + + The implementation of this class is based on the `GridSearchCV`_ class in + scikit-learn. The documentation of the various parameters is therefore + mostly the same. This is done to provide the user with a familiar and + easy-to-use interface to doing a grid search with GenSVM. A separate class + was needed to benefit from the fast low-level C implementation of grid + search in the GenSVM library. + + :param param_grid: If a string, it must be either 'tiny', 'small', or 'full' to load the + predefined parameter grids (see the functions :func:`load_grid_tiny`, + :func:`load_grid_small`, and :func:`load_grid_full`). + + Otherwise, a dictionary of parameter names (strings) as keys and lists + of parameter settings to evaluate as values, or a list of such dicts. + The GenSVM model will be evaluated at all combinations of the + parameters. + :type param_grid: string, dict, or list of dicts + :param scoring: A single string (see :ref:`scoring_parameter`) or a callable (see + :ref:`scoring`) to evaluate the predictions on the test set. + + For evaluating multiple metrics, either give a list of (unique) strings + or a dict with names as keys and callables as values. + + NOTE that when using custom scorers, each scorer should return a single + value. Metric functions returning a list/array of values can be wrapped + into multiple scorers that return one value each. + + If None, the `accuracy_score`_ is used. + :type scoring: string, callable, list/tuple, dict or None + :param iid: If True, the data is assumed to be identically distributed across the + folds, and the loss minimized is the total loss per sample and not the + mean loss across the folds. + :type iid: boolean, default=True + :param cv: Determines the cross-validation splitting strategy. Possible inputs for + cv are: + + - None, to use the default 5-fold cross validation, + - integer, to specify the number of folds in a `(Stratified)KFold`, + - An object to be used as a cross-validation generator. + - An iterable yielding train, test splits. + + For integer/None inputs, :class:`StratifiedKFold + <sklearn.model_selection.StratifiedKFold>` is used. In all other + cases, :class:`KFold <sklearn.model_selection.KFold>` is used. + + Refer to the `scikit-learn User Guide on cross validation`_ for the + various strategies that can be used here. + + NOTE: At the moment, the ShuffleSplit and StratifiedShuffleSplit are + not supported in this class. If you need these, you can use the GenSVM + classifier directly with the GridSearchCV object from scikit-learn. + (these methods require significant changes in the low-level library + before they can be supported). + :type cv: int, cross-validation generator or an iterable, optional + :param refit: Refit the GenSVM estimator with the best found parameters on the whole + dataset. + + For multiple metric evaluation, this needs to be a string denoting the + scorer to be used to find the best parameters for refitting the + estimator at the end. + + The refitted estimator is made available at the `:attr:best_estimator_ + <.GenSVMGridSearchCV.best_estimator_>` attribute and allows the user to + use the :func:`~GenSVMGridSearchCV.predict` method directly on this + :class:`.GenSVMGridSearchCV` instance. + + Also for multiple metric evaluation, the attributes :attr:`best_index_ + <.GenSVMGridSearchCV.best_index_>`, :attr:`best_score_ + <.GenSVMGridSearchCV.best_score_>` and :attr:`best_params_ + <.GenSVMGridSearchCV:best_params_>` will only be available if ``refit`` + is set and all of them will be determined w.r.t this specific scorer. + + See ``scoring`` parameter to know more about multiple metric + evaluation. + :type refit: boolean, or string, default=True + :param verbose: Controls the verbosity: the higher, the more messages. + :type verbose: integer + :param return_train_score: If ``False``, the :attr:`cv_results_ <.GenSVMGridSearchCV.cv_results_>` + attribute will not include training scores. + :type return_train_score: boolean, default=True + + .. rubric:: Examples + + >>> from gensvm import GenSVMGridSearchCV + >>> from sklearn.datasets import load_iris + >>> iris = load_iris() + >>> param_grid = {'p': [1.0, 2.0], 'kappa': [-0.9, 0.0, 1.0]} + >>> clf = GenSVMGridSearchCV(param_grid) + >>> clf.fit(iris.data, iris.target) + GenSVMGridSearchCV(cv=None, iid=True, + param_grid={'p': [1.0, 2.0], 'kappa': [-0.9, 0.0, 1.0]}, + refit=True, return_train_score=True, scoring=None, verbose=0) + + .. attribute:: cv_results_ + + *dict of numpy (masked) ndarrays* -- A dict with keys as column headers and values as columns, that can be + imported into a pandas `DataFrame`_. + + For instance the below given table + + +------------+-----------+------------+-----------------+---+---------+ + |param_kernel|param_gamma|param_degree|split0_test_score|...|rank_t...| + +============+===========+============+=================+===+=========+ + | 'poly' | -- | 2 | 0.8 |...| 2 | + +------------+-----------+------------+-----------------+---+---------+ + | 'poly' | -- | 3 | 0.7 |...| 4 | + +------------+-----------+------------+-----------------+---+---------+ + | 'rbf' | 0.1 | -- | 0.8 |...| 3 | + +------------+-----------+------------+-----------------+---+---------+ + | 'rbf' | 0.2 | -- | 0.9 |...| 1 | + +------------+-----------+------------+-----------------+---+---------+ + + will be represented by a ``cv_results_`` dict of:: + + { + 'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'], + mask = [False False False False]...) + 'param_gamma': masked_array(data = [-- -- 0.1 0.2], + mask = [ True True False False]...), + 'param_degree': masked_array(data = [2.0 3.0 -- --], + mask = [False False True True]...), + 'split0_test_score' : [0.8, 0.7, 0.8, 0.9], + 'split1_test_score' : [0.82, 0.5, 0.7, 0.78], + 'mean_test_score' : [0.81, 0.60, 0.75, 0.82], + 'std_test_score' : [0.02, 0.01, 0.03, 0.03], + 'rank_test_score' : [2, 4, 3, 1], + 'split0_train_score' : [0.8, 0.9, 0.7], + 'split1_train_score' : [0.82, 0.5, 0.7], + 'mean_train_score' : [0.81, 0.7, 0.7], + 'std_train_score' : [0.03, 0.03, 0.04], + 'mean_fit_time' : [0.73, 0.63, 0.43, 0.49], + 'std_fit_time' : [0.01, 0.02, 0.01, 0.01], + 'mean_score_time' : [0.007, 0.06, 0.04, 0.04], + 'std_score_time' : [0.001, 0.002, 0.003, 0.005], + 'params' : [{'kernel': 'poly', 'degree': 2}, ...], + } + + NOTE: + + The key ``'params'`` is used to store a list of parameter settings + dicts for all the parameter candidates. + + The ``mean_fit_time``, ``std_fit_time``, ``mean_score_time`` and + ``std_score_time`` are all in seconds. + + For multi-metric evaluation, the scores for all the scorers are + available in the :attr:`cv_results_ <.GenSVMGridSearchCV.cv_results_>` + dict at the keys ending with that scorer's name (``'_<scorer_name>'``) + instead of ``'_score'`` shown above. ('split0_test_precision', + 'mean_train_precision' etc.) + + .. attribute:: best_estimator_ + + *estimator or dict* -- Estimator that was chosen by the search, i.e. estimator which gave + highest score (or smallest loss if specified) on the left out data. Not + available if ``refit=False``. + + See ``refit`` parameter for more information on allowed values. + + .. attribute:: best_score_ + + *float* -- Mean cross-validated score of the best_estimator + + For multi-metric evaluation, this is present only if ``refit`` is + specified. + + .. attribute:: best_params_ + + *dict* -- Parameter setting that gave the best results on the hold out data. + + For multi-metric evaluation, this is present only if ``refit`` is + specified. + + .. attribute:: best_index_ + + *int* -- The index (of the ``cv_results_`` arrays) which corresponds to the best + candidate parameter setting. + + The dict at ``search.cv_results_['params'][search.best_index_]`` gives + the parameter setting for the best model, that gives the highest mean + score (``search.best_score_``). + + For multi-metric evaluation, this is present only if ``refit`` is + specified. + + .. attribute:: scorer_ + + *function or a dict* -- Scorer function used on the held out data to choose the best parameters + for the model. + + For multi-metric evaluation, this attribute holds the validated + ``scoring`` dict which maps the scorer key to the scorer callable. + + .. attribute:: n_splits_ + + *int* -- The number of cross-validation splits (folds/iterations). + + .. rubric:: Notes + + The parameters selected are those that maximize the score of the left out + data, unless an explicit score is passed in which case it is used instead. + + .. seealso:: + + `ParameterGrid`_: + Generates all the combinations of a hyperparameter grid. + + :class:`.GenSVM`: + The GenSVM classifier + + .. _GridSearchCV: + http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html + .. _accuracy_score: + http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html + .. _scikit-learn User Guide on cross validation: + http://scikit-learn.org/stable/modules/cross_validation.html + + .. _ParameterGrid: + http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterGrid.html + .. _DataFrame: + https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html + + + .. py:method:: GenSVMGridSearchCV.fit(X, y, groups=None) + :noindex: + :module: gensvm.gridsearch + + Run GenSVM grid search with all sets of parameters + + :param X: Training data, where n_samples is the number of observations and + n_features is the number of features. + :type X: array-like, shape = (n_samples, n_features) + :param y: Target vector for the training data. + :type y: array-like, shape = (n_samples, ) + :param groups: Group labels for the samples used while splitting the dataset into + train/test sets. + :type groups: array-like, with shape (n_samples, ), optional + + :returns: **self** -- Return self. + :rtype: object + + + .. py:method:: GenSVMGridSearchCV.predict(X, trainX=None) + :noindex: + :module: gensvm.gridsearch + + Predict the class labels on the test data + + :param X: Test data, where n_samples is the number of observations and + n_features is the number of features. + :type X: array-like, shape = (n_samples, n_features) + :param trainX: Only for nonlinear prediction with kernels: the training data used + to train the model. + :type trainX: array, shape = [n_train_samples, n_features] + + :returns: **y_pred** -- Predicted class labels of the data in X. + :rtype: array-like, shape = (n_samples, ) + + + .. py:method:: GenSVMGridSearchCV.score(X, y) + :noindex: + :module: gensvm.gridsearch + + Compute the score on the test data given the true labels + + :param X: Test data, where n_samples is the number of observations and + n_features is the number of features. + :type X: array-like, shape = (n_samples, n_features) + :param y: True labels for the test data. + :type y: array-like, shape = (n_samples, ) + + :returns: **score** + :rtype: float + diff --git a/docs/generate_autodocs.py b/docs/generate_autodocs.py index 1aa8f7d..a0544ef 100644 --- a/docs/generate_autodocs.py +++ b/docs/generate_autodocs.py @@ -42,11 +42,11 @@ FULL_NAMES = { } OUTPUT_FILES = { - "GenSVMGridSearchCV": os.path.join(DOCDIR, "cls_gridsearch.rst"), - "GenSVM": os.path.join(DOCDIR, "cls_gensvm.rst"), - "load_grid_tiny": os.path.join(DOCDIR, "auto_functions.rst"), - "load_grid_small": os.path.join(DOCDIR, "auto_functions.rst"), - "load_grid_full": os.path.join(DOCDIR, "auto_functions.rst"), + "GenSVMGridSearchCV": os.path.join(DOCDIR, "cls_gridsearch.txt"), + "GenSVM": os.path.join(DOCDIR, "cls_gensvm.txt"), + "load_grid_tiny": os.path.join(DOCDIR, "auto_functions.txt"), + "load_grid_small": os.path.join(DOCDIR, "auto_functions.txt"), + "load_grid_full": os.path.join(DOCDIR, "auto_functions.txt"), } diff --git a/docs/index.rst b/docs/index.rst index 403dc8b..6845d73 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -5,7 +5,7 @@ .. toctree:: -.. include:: ../README.rst +.. include:: ./README.rst Classes ------- @@ -13,19 +13,19 @@ Classes GenSVM ^^^^^^ -.. include:: ./cls_gensvm.rst +.. include:: ./cls_gensvm.txt GenSVMGridSearchCV ^^^^^^^^^^^^^^^^^^ -.. include:: ./cls_gridsearch.rst +.. include:: ./cls_gridsearch.txt Functions --------- -.. include:: ./auto_functions.rst +.. include:: ./auto_functions.txt -.. include:: ./kernels.rst +.. include:: ./kernels.txt -.. include:: ../CHANGELOG.rst +.. include:: ./CHANGELOG.rst diff --git a/docs/kernels.rst b/docs/kernels.txt index 479b6c0..479b6c0 100644 --- a/docs/kernels.rst +++ b/docs/kernels.txt @@ -3,7 +3,6 @@ import os import re -import sys # Package meta-data AUTHOR = "Gertjan van den Burg" @@ -20,7 +19,7 @@ VERSION = None REQUIRED = ["scikit-learn", "numpy"] -docs_require = ["Sphinx==1.6.5", "sphinx_rtd_theme>=0.4.3"] +docs_require = ["Sphinx==1.6.5", "sphinx_rtd_theme>=0.4.3", "m2r"] test_require = [] dev_require = ["Cython"] @@ -333,7 +332,8 @@ if __name__ == "__main__": attr["version"] = version attr["description"] = DESCRIPTION - attr["long_description"] = read("README.rst") + attr["long_description"] = read("README.md") + attr["long_description"] = "text/markdown" attr["packages"] = [NAME] attr["url"] = URL attr["author"] = AUTHOR |
