aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGertjan van den Burg <gertjanvandenburg@gmail.com>2020-06-23 16:45:18 +0100
committerGertjan van den Burg <gertjanvandenburg@gmail.com>2020-06-23 16:45:18 +0100
commit6f35564b83a9facf0c468742ce8d000427a58b97 (patch)
treeecbc0ea50cd271a8a642e431f1705e41b7162331
parentMerge branch 'update' (diff)
downloadTCPDBench-6f35564b83a9facf0c468742ce8d000427a58b97.tar.gz
TCPDBench-6f35564b83a9facf0c468742ce8d000427a58b97.zip
Add additional documentation on using the code
-rw-r--r--README.md253
-rw-r--r--execs/R/utils.R72
-rw-r--r--execs/python/cpdbench_utils.py48
3 files changed, 372 insertions, 1 deletions
diff --git a/README.md b/README.md
index 9256f01b..e72fec40 100644
--- a/README.md
+++ b/README.md
@@ -188,6 +188,259 @@ on your machine where you want to store the files (so that results are not
lost when the docker container closes, see [docker
volumes](https://docs.docker.com/storage/volumes/)).
+## Extending the Benchmark
+
+It should be relatively straightforward to extend the benchmark with your own
+methods and datasets.
+
+### Adding a new method
+
+To add a new method to the benchmark, you'll need to write a script in the
+``execs`` folder that takes a dataset file as input and computes the change
+point locations. Currently the methods are organized by language (R and
+python), but you don't necessarily need to follow this structure when adding a
+new method. Please do check the existing code for inspiration though, as
+adding a new method is probably easiest when following the same structure.
+
+Experiments are managed using the [abed](https://github.com/GjjvdBurg/abed)
+command line application. This facilitates running all the methods with all
+their hyperparameter settings on all datasets.
+
+Note that currently the methods print the output file to stdout, so if you
+want to print from your script, use stderr.
+
+#### Python
+
+When adding a method in Python, you can start with the
+[cpdbench_zero.py](./execs/python/cpdbench_zero.py) file as a template, as
+this contains most of the boilerplate code. A script should take command line
+arguments where ``-i/--input`` marks the path to a dataset file and optionally
+can take further command line arguments for hyperparameter settings.
+Specifying these items from the command line facilitates reproducibility.
+
+Roughly, the main function of a Python method could look like this:
+
+```python
+# Adding a new Python method to CPDBench
+
+def main():
+ args = parse_args()
+
+ # data is the raw dataset dictionary, mat is a T x d matrix of observations
+ data, mat = load_dataset(args.input)
+
+ # set algorithm parameters that are not varied in the grid search
+ defaults = {
+ 'param_1': value_1,
+ 'param_2': value_2
+ }
+
+ # combine command line arguments with defaults
+ parameters = make_param_dict(args, defaults)
+
+ # start the timer
+ start_time = time.time()
+ error = None
+ status = 'fail' # if not overwritten, it must have failed
+
+ # run the algorithm in a try/except
+ try:
+ locations = your_custom_method(mat, parameters)
+ status = 'success'
+ except Exception as err:
+ error = repr(err)
+
+ stop_time = time.time()
+ runtime = stop_time - start_time
+
+ # exit with error if the run failed
+ if status == 'fail':
+ exit_with_error(data, args, parameters, error, __file__)
+
+ # make sure locations are 0-based and integer!
+
+ exit_success(data, args, parameters, locations, runtime, __file__)
+```
+
+Remember to add the following to the bottom of the script so it can be run
+from the command line:
+
+```python
+if __name__ == '__main__':
+ main()
+```
+
+If you need to add a timeout to your method, take a look at the
+[BOCPDMS](./execs/python/cpdbench_bocpdms.py) example.
+
+#### R
+
+Adding a method implemented in R to the benchmark can be done similarly to how
+it is done for Python. Again, the input file path and the hyperparameters are
+specified by command line arguments, which are parsed using
+[argparse](https://cran.r-project.org/web/packages/argparse/index.html). For R
+scripts we use a number of utility functions in the
+[utils.R](./execs/R/utils.R) file. To reliably load this file you can use the
+``load.utils()`` function available in all R scripts.
+
+The main function of a method implemented in R could be roughly as follows:
+
+```R
+main <- function()
+{
+ args <- parse.args()
+
+ # load the data
+ data <- load.dataset(args$input)
+
+ # create list of default algorithm parameters
+ defaults <- list(param_1=value_1, param_2=value_2)
+
+ # combine defaults and command line arguments
+ params <- make.param.list(args, defaults)
+
+ # Start the timer
+ start.time <- Sys.time()
+
+ # call the detection function in a tryCatch
+ result <- tryCatch({
+ locs <- your.custom.method(data$mat, params)
+ list(locations=locs, error=NULL)
+ }, error=function(e) {
+ return(list(locations=NULL, error=e$message))
+ })
+
+ stop.time <- Sys.time()
+
+ # Compute runtime, note units='secs' is not optional!
+ runtime <- difftime(stop.time, start.time, units='secs')
+
+ if (!is.null(result$error))
+ exit.with.error(data$original, args, params, result$error)
+
+ # convert result$locations to 0-based if needed
+
+ exit.success(data$original, args, params, locations, runtime)
+```
+
+Remember to add the following to the bottom of the script so it can be run
+from the command line:
+
+```R
+load.utils()
+main()
+```
+
+#### Adding the method to the experimental configuration
+
+When you've written the command line script to run your method and verified
+that it works correctly, it's time to add it to the experiment configuration.
+For this, we'll have to edit the [abed_conf.py](./abed_conf.py) file.
+
+1. To add your method, located the ``METHODS`` list in the configuration file
+ and add an entry ``best_<yourmethod>`` and ``default_<yourmethod>``,
+ replacing ``<yourmethod>`` with the name of your method (without spaces or
+ underscores).
+2. Next, add the method to the ``PARAMS`` dictionary. This is where you
+ specify all the hyperparameters that your method takes (for the ``best``
+ experiment). The hyperparameters are specified with a name and a list of
+ values to explore (see the current configuration for examples). For the
+ default experiment, add an entry ``"default_<yourmethod>" : {"no_param":
+ [0]}``. This ensures it will be run without any parameters.
+3. Finally, add the command that needs to be executed to run your method to
+ the ``COMMANDS`` dictionary. You'll need an entry for ``best_<yourmethod>``
+ and for ``default_<yourmethod>``. Please use the existing entries as
+ examples. Methods implemented in R are run with Rscript. The ``{execdir}``,
+ ``{datadir}``, and ``{dataset}`` values will be filled in by abed based on
+ the other settings. Use curly braces to specify hyperparameters, matching
+ the names of the fields in the ``PARAMS`` dictionary.
+
+
+#### Dependencies
+
+If your method needs external R or Python packages to operate, you can add
+them to the respective dependency lists.
+
+* For R, simply add the package name to the [Rpackages.txt](./Rpackages.txt)
+ file. Next, run ``make clean_R_venv`` and ``make R_venv`` to add the package
+ to the R virtual environment. It is recommended to be specific in the
+ version of the package you want to use in the ``Rpackages.txt`` file, for
+ future reference and reproducibility.
+* For Python, individual methods use individual virtual environments, as can
+ be seen from the bocpdms and rbocpdms examples. These virtual environments
+ need to be activated in the ``COMMANDS`` section of the ``abed_conf.py``
+ file. Setting up these environments is done through the Makefile. Simply add
+ a ``requirements.txt`` file in your package similarly to what is done for
+ bocpdms and rbocpdms, copy and edit the corresponding lines in the Makefile,
+ and run ``make venv_<yourmethod>`` to build the virtual environment.
+
+
+#### Running experiments
+
+When you've added the method and set up the environment, run
+
+```
+$ abed reload_tasks
+```
+
+to have abed generate the new tasks for your method (see above under [Getting
+Started](#getting-started)). Note that abed automatically does a Git commit
+when you do this, so you may want to switch to a separate branch. You can see
+the tasks that abed has generated (and thus the command that will be executed)
+using the command:
+
+```
+$ abed explain_tbd_tasks
+```
+
+If you're satisfied with the commands, you can run the experiments using:
+
+```
+$ mpiexec -np 4 abed local
+```
+
+You can subsequently use the Makefile to generate updated figures and tables
+with your method or dataset.
+
+### Adding a new dataset
+
+To add a new dataset to the benchmark you'll need both a dataset file (in JSON
+format) and annotations (for evaluation). More information on how the datasets
+are constructed can be found in the
+[TCPD](https://github.com/alan-turing-institute/TCPD) repository, which also
+includes a schema file. A high-level overview is as follows:
+
+* Each dataset has a short name in the ``name`` field and a longer more
+ descriptive name in the ``longname`` field. The ``name`` field must be
+ unique.
+* The number of observations and dimensions is defined in the ``n_obs`` and
+ ``n_dim`` fields.
+* The time axis is defined in the ``time`` field. This has at least an
+ ``index`` field to mark the indices of each data point. At the moment, these
+ indices need to be consecutive integers. This entry mainly exist for a
+ future scenario where we may want to consider non-consecutive time axes. If
+ the time axis can be mapped to a date or time, then a type and format of
+ this field can be specified (see e.g. the [nile
+ dataset](https://github.com/alan-turing-institute/TCPD/blob/master/datasets/nile/nile.json#L8),
+ which has year labels).
+* The actual observations are specified in the ``series`` field. This is an
+ ordered list of JSON objects, one for each dimension. Every dimension has a
+ label, a data type, and a ``"raw"`` field with the actual observations.
+ Missing values in the time series can be marked with ``null`` (see e.g.
+ [uk_coal_employ](https://github.com/alan-turing-institute/TCPD/blob/master/datasets/uk_coal_employ/uk_coal_employ.json#L236)
+ for an example).
+
+If you want to evaluate the methods in the benchmark on a new dataset, you may
+want to collect annotations for the dataset. These annotations can be
+collected in the [annotations.json](./analysis/annotations/annotations.json)
+file, which is an object that maps each dataset name to a map from the
+annotator ID to the marked change points. You can collect annotations using
+the [annotation tool](https://github.com/alan-turing-institute/annotatechange)
+created for this project.
+
+Finally, add your method to the ``DATASETS`` field in the ``abed_conf.py``
+file. Proceed with running the experiments as described above.
+
## License
The code in this repository is licensed under the MIT license, unless
diff --git a/execs/R/utils.R b/execs/R/utils.R
index 504b5373..a170a1c0 100644
--- a/execs/R/utils.R
+++ b/execs/R/utils.R
@@ -10,6 +10,16 @@ library(RJSONIO)
printf <- function(...) invisible(cat(sprintf(...)));
+#' Load a TCPDBench dataset
+#'
+#' This function reads in a JSON dataset in TCPDBench format (see TCPD
+#' repository for schema) and creates a matrix representation of the dataset.
+#' The dataset is scaled in the process.
+#'
+#' @param filename Path to the JSON file
+#' @return List object with the raw data in the \code{original} field, the time
+#' index in the \code{time} field, and the data matrix in the \code{mat} field.
+#'
load.dataset <- function(filename)
{
data <- fromJSON(filename)
@@ -48,6 +58,28 @@ load.dataset <- function(filename)
return(out)
}
+#' Prepare the experiment output
+#'
+#' This function creates a list of the necessary output data. This includes the
+#' exact command that was run, dataset and script information, the hostname,
+#' output status, any errors if present, and the detected change point location
+#' and runtime.
+#'
+#' @param data the raw data loaded from the JSON file
+#' @param data.filename the path to the dataset filename
+#' @param status the output status code of the experiment. Currently in use are
+#' 'SUCCESS' for when an experiment exited successfully, 'TIMEOUT' if the
+#' experiment exceeded a limit on runtime, 'SKIP' if the method was supplied
+#' with improper hyperparameters, and 'FAIL' if an error occurred.
+#' @param error a description of the error, if one occurred
+#' @param params input parameters (including defaults) to the method
+#' @param locations detected change point locations (important: these locations
+#' are 0-based, whereas R array indices are 1-based. It is important to convert
+#' them accordingly. Change point locations should be integers on the interval
+#' [0, T-1], including both endpoints).
+#' @param runtime the runtime of the method.
+#'
+#' @return list with all the necessary output fields.
prepare.result <- function(data, data.filename, status, error,
params, locations, runtime) {
out <- list(error=NULL)
@@ -94,6 +126,13 @@ prepare.result <- function(data, data.filename, status, error,
return(out)
}
+#' Combine default parameters and command line arguments
+#'
+#' @param args the command line arguments
+#' @param defaults default algorithm parameters
+#' @return a combined list with both the default parameter settings and those
+#' provided on the command line. If a parameter is in the default list that is
+#' specified on the command line the command line parameter takes precedence.
make.param.list <- function(args, defaults)
{
params <- defaults
@@ -106,6 +145,14 @@ make.param.list <- function(args, defaults)
return(params)
}
+#' Write output to a file or stdout
+#'
+#' This function takes an output list generated by \code{\link{prepare.result}}
+#' and writes it out as JSON to a file if provided or stdout otherwise.
+#'
+#' @param out experimental results as a list
+#' @param filename (optional) output file to write to
+#'
dump.output <- function(out, filename) {
json.out <- toJSON(out, pretty=T)
if (!is.null(filename))
@@ -114,6 +161,16 @@ dump.output <- function(out, filename) {
cat(json.out, '\n')
}
+#' Exit with SKIP status due to multidimensional data
+#'
+#' This is a shorthand for \code{\link{exit.with.error}} where the error is
+#' already set for methods that don't handle multidimensional data. Writes out
+#' the data and exits.
+#'
+#' @param data original data loaded by \code{\link{load.dataset}}
+#' @param args command line arguments
+#' @param params combined hyperparameters generated by
+#' \code{\link{make.param.list}}
exit.error.multidim <- function(data, args, params) {
status = 'SKIP'
error = 'This method has no support for multidimensional data.'
@@ -122,6 +179,13 @@ exit.error.multidim <- function(data, args, params) {
quit(save='no')
}
+#' Exit with FAIL status and a custom error message
+#'
+#' @param data original data loaded by \code{\link{load.dataset}}
+#' @param args command line arguments
+#' @param params combined hyperparameters generated by
+#' \code{\link{make.param.list}}
+#' @param error custom error message
exit.with.error <- function(data, args, params, error) {
status = 'FAIL'
out <- prepare.result(data, args$input, status, error, params, NULL, NULL)
@@ -129,6 +193,14 @@ exit.with.error <- function(data, args, params, error) {
quit(save='no')
}
+#' Exit with SUCCESS status
+#'
+#' @param data original data loaded by \code{\link{load.dataset}}
+#' @param args command line arguments
+#' @param params combined hyperparameters generated by
+#' \code{\link{make.param.list}}
+#' @param locations detected change point locations (0-based!)
+#' @param runtime runtime in seconds
exit.success <- function(data, args, params, locations, runtime) {
status = 'SUCCESS'
error = NULL
diff --git a/execs/python/cpdbench_utils.py b/execs/python/cpdbench_utils.py
index cb074c69..65e632c1 100644
--- a/execs/python/cpdbench_utils.py
+++ b/execs/python/cpdbench_utils.py
@@ -19,6 +19,7 @@ import sys
def md5sum(filename):
+ """Compute the MD5 checksum of a given file"""
blocksize = 65536
hasher = hashlib.md5()
with open(filename, "rb") as fp:
@@ -30,6 +31,7 @@ def md5sum(filename):
def load_dataset(filename):
+ """ Load a CPDBench dataset """
with open(filename, "r") as fp:
data = json.load(fp)
@@ -58,6 +60,45 @@ def prepare_result(
runtime,
script_filename,
):
+ """Prepare the experiment output as a dictionary
+
+ Parameters
+ ----------
+ data : dict
+ The CPDBench dataset object
+
+ data_filename : str
+ Absolute path to the dataset file
+
+ status : str
+ Status of the experiments. Commonly used status codes are: SUCCESS if
+ the experiment was succesful, SKIP is the method was provided improper
+ parameters, FAIL if the method failed for whatever reason, and TIMEOUT
+ if the method ran too long.
+
+ error : str
+ If an error occurred, this field can be used to describe what it is.
+
+ params : dict
+ Dictionary of parameters provided to the method. It is good to be as
+ complete as possible, so even default methods should be added to this
+ field. This enhances reproducibility.
+
+ locations : list
+ Detected change point locations. Remember that change locations are
+ indices of time points and are 0-based (start counting at zero, thus
+ change locations are integers on the interval [0, T-1], including both
+ endpoints).
+
+ runtime : float
+ Runtime of the method. This should be computed as accurately as
+ possible, excluding any method-specific setup code.
+
+ script_filename :
+ Path to the script of the method. This is hashed to enable rough
+ versioning.
+
+ """
out = {}
# record the command that was used
@@ -88,7 +129,7 @@ def prepare_result(
def dump_output(output, filename=None):
- """Save result to output file or write to stdout """
+ """Save result to output file or write to stdout (json format)"""
if filename is None:
print(json.dumps(output, sort_keys=True, indent="\t"))
else:
@@ -97,6 +138,7 @@ def dump_output(output, filename=None):
def make_param_dict(args, defaults):
+ """Create the parameter dict combining CLI arguments and defaults"""
params = copy.deepcopy(vars(args))
del params["input"]
if "output" in params:
@@ -106,6 +148,7 @@ def make_param_dict(args, defaults):
def exit_with_error(data, args, parameters, error, script_filename):
+ """Exit and save result using the 'FAIL' exit status"""
status = "FAIL"
out = prepare_result(
data,
@@ -120,7 +163,9 @@ def exit_with_error(data, args, parameters, error, script_filename):
dump_output(out, args.output)
raise SystemExit
+
def exit_with_timeout(data, args, parameters, runtime, script_filename):
+ """Exit and save result using the 'TIMEOUT' exit status"""
status = "TIMEOUT"
out = prepare_result(
data,
@@ -137,6 +182,7 @@ def exit_with_timeout(data, args, parameters, runtime, script_filename):
def exit_success(data, args, parameters, locations, runtime, script_filename):
+ """Exit and save result using the 'SUCCESS' exit status"""
status = "SUCCESS"
error = None
out = prepare_result(