diff options
| author | Gertjan van den Burg <gertjanvandenburg@gmail.com> | 2020-06-23 16:45:18 +0100 |
|---|---|---|
| committer | Gertjan van den Burg <gertjanvandenburg@gmail.com> | 2020-06-23 16:45:18 +0100 |
| commit | 6f35564b83a9facf0c468742ce8d000427a58b97 (patch) | |
| tree | ecbc0ea50cd271a8a642e431f1705e41b7162331 | |
| parent | Merge branch 'update' (diff) | |
| download | TCPDBench-6f35564b83a9facf0c468742ce8d000427a58b97.tar.gz TCPDBench-6f35564b83a9facf0c468742ce8d000427a58b97.zip | |
Add additional documentation on using the code
| -rw-r--r-- | README.md | 253 | ||||
| -rw-r--r-- | execs/R/utils.R | 72 | ||||
| -rw-r--r-- | execs/python/cpdbench_utils.py | 48 |
3 files changed, 372 insertions, 1 deletions
@@ -188,6 +188,259 @@ on your machine where you want to store the files (so that results are not lost when the docker container closes, see [docker volumes](https://docs.docker.com/storage/volumes/)). +## Extending the Benchmark + +It should be relatively straightforward to extend the benchmark with your own +methods and datasets. + +### Adding a new method + +To add a new method to the benchmark, you'll need to write a script in the +``execs`` folder that takes a dataset file as input and computes the change +point locations. Currently the methods are organized by language (R and +python), but you don't necessarily need to follow this structure when adding a +new method. Please do check the existing code for inspiration though, as +adding a new method is probably easiest when following the same structure. + +Experiments are managed using the [abed](https://github.com/GjjvdBurg/abed) +command line application. This facilitates running all the methods with all +their hyperparameter settings on all datasets. + +Note that currently the methods print the output file to stdout, so if you +want to print from your script, use stderr. + +#### Python + +When adding a method in Python, you can start with the +[cpdbench_zero.py](./execs/python/cpdbench_zero.py) file as a template, as +this contains most of the boilerplate code. A script should take command line +arguments where ``-i/--input`` marks the path to a dataset file and optionally +can take further command line arguments for hyperparameter settings. +Specifying these items from the command line facilitates reproducibility. + +Roughly, the main function of a Python method could look like this: + +```python +# Adding a new Python method to CPDBench + +def main(): + args = parse_args() + + # data is the raw dataset dictionary, mat is a T x d matrix of observations + data, mat = load_dataset(args.input) + + # set algorithm parameters that are not varied in the grid search + defaults = { + 'param_1': value_1, + 'param_2': value_2 + } + + # combine command line arguments with defaults + parameters = make_param_dict(args, defaults) + + # start the timer + start_time = time.time() + error = None + status = 'fail' # if not overwritten, it must have failed + + # run the algorithm in a try/except + try: + locations = your_custom_method(mat, parameters) + status = 'success' + except Exception as err: + error = repr(err) + + stop_time = time.time() + runtime = stop_time - start_time + + # exit with error if the run failed + if status == 'fail': + exit_with_error(data, args, parameters, error, __file__) + + # make sure locations are 0-based and integer! + + exit_success(data, args, parameters, locations, runtime, __file__) +``` + +Remember to add the following to the bottom of the script so it can be run +from the command line: + +```python +if __name__ == '__main__': + main() +``` + +If you need to add a timeout to your method, take a look at the +[BOCPDMS](./execs/python/cpdbench_bocpdms.py) example. + +#### R + +Adding a method implemented in R to the benchmark can be done similarly to how +it is done for Python. Again, the input file path and the hyperparameters are +specified by command line arguments, which are parsed using +[argparse](https://cran.r-project.org/web/packages/argparse/index.html). For R +scripts we use a number of utility functions in the +[utils.R](./execs/R/utils.R) file. To reliably load this file you can use the +``load.utils()`` function available in all R scripts. + +The main function of a method implemented in R could be roughly as follows: + +```R +main <- function() +{ + args <- parse.args() + + # load the data + data <- load.dataset(args$input) + + # create list of default algorithm parameters + defaults <- list(param_1=value_1, param_2=value_2) + + # combine defaults and command line arguments + params <- make.param.list(args, defaults) + + # Start the timer + start.time <- Sys.time() + + # call the detection function in a tryCatch + result <- tryCatch({ + locs <- your.custom.method(data$mat, params) + list(locations=locs, error=NULL) + }, error=function(e) { + return(list(locations=NULL, error=e$message)) + }) + + stop.time <- Sys.time() + + # Compute runtime, note units='secs' is not optional! + runtime <- difftime(stop.time, start.time, units='secs') + + if (!is.null(result$error)) + exit.with.error(data$original, args, params, result$error) + + # convert result$locations to 0-based if needed + + exit.success(data$original, args, params, locations, runtime) +``` + +Remember to add the following to the bottom of the script so it can be run +from the command line: + +```R +load.utils() +main() +``` + +#### Adding the method to the experimental configuration + +When you've written the command line script to run your method and verified +that it works correctly, it's time to add it to the experiment configuration. +For this, we'll have to edit the [abed_conf.py](./abed_conf.py) file. + +1. To add your method, located the ``METHODS`` list in the configuration file + and add an entry ``best_<yourmethod>`` and ``default_<yourmethod>``, + replacing ``<yourmethod>`` with the name of your method (without spaces or + underscores). +2. Next, add the method to the ``PARAMS`` dictionary. This is where you + specify all the hyperparameters that your method takes (for the ``best`` + experiment). The hyperparameters are specified with a name and a list of + values to explore (see the current configuration for examples). For the + default experiment, add an entry ``"default_<yourmethod>" : {"no_param": + [0]}``. This ensures it will be run without any parameters. +3. Finally, add the command that needs to be executed to run your method to + the ``COMMANDS`` dictionary. You'll need an entry for ``best_<yourmethod>`` + and for ``default_<yourmethod>``. Please use the existing entries as + examples. Methods implemented in R are run with Rscript. The ``{execdir}``, + ``{datadir}``, and ``{dataset}`` values will be filled in by abed based on + the other settings. Use curly braces to specify hyperparameters, matching + the names of the fields in the ``PARAMS`` dictionary. + + +#### Dependencies + +If your method needs external R or Python packages to operate, you can add +them to the respective dependency lists. + +* For R, simply add the package name to the [Rpackages.txt](./Rpackages.txt) + file. Next, run ``make clean_R_venv`` and ``make R_venv`` to add the package + to the R virtual environment. It is recommended to be specific in the + version of the package you want to use in the ``Rpackages.txt`` file, for + future reference and reproducibility. +* For Python, individual methods use individual virtual environments, as can + be seen from the bocpdms and rbocpdms examples. These virtual environments + need to be activated in the ``COMMANDS`` section of the ``abed_conf.py`` + file. Setting up these environments is done through the Makefile. Simply add + a ``requirements.txt`` file in your package similarly to what is done for + bocpdms and rbocpdms, copy and edit the corresponding lines in the Makefile, + and run ``make venv_<yourmethod>`` to build the virtual environment. + + +#### Running experiments + +When you've added the method and set up the environment, run + +``` +$ abed reload_tasks +``` + +to have abed generate the new tasks for your method (see above under [Getting +Started](#getting-started)). Note that abed automatically does a Git commit +when you do this, so you may want to switch to a separate branch. You can see +the tasks that abed has generated (and thus the command that will be executed) +using the command: + +``` +$ abed explain_tbd_tasks +``` + +If you're satisfied with the commands, you can run the experiments using: + +``` +$ mpiexec -np 4 abed local +``` + +You can subsequently use the Makefile to generate updated figures and tables +with your method or dataset. + +### Adding a new dataset + +To add a new dataset to the benchmark you'll need both a dataset file (in JSON +format) and annotations (for evaluation). More information on how the datasets +are constructed can be found in the +[TCPD](https://github.com/alan-turing-institute/TCPD) repository, which also +includes a schema file. A high-level overview is as follows: + +* Each dataset has a short name in the ``name`` field and a longer more + descriptive name in the ``longname`` field. The ``name`` field must be + unique. +* The number of observations and dimensions is defined in the ``n_obs`` and + ``n_dim`` fields. +* The time axis is defined in the ``time`` field. This has at least an + ``index`` field to mark the indices of each data point. At the moment, these + indices need to be consecutive integers. This entry mainly exist for a + future scenario where we may want to consider non-consecutive time axes. If + the time axis can be mapped to a date or time, then a type and format of + this field can be specified (see e.g. the [nile + dataset](https://github.com/alan-turing-institute/TCPD/blob/master/datasets/nile/nile.json#L8), + which has year labels). +* The actual observations are specified in the ``series`` field. This is an + ordered list of JSON objects, one for each dimension. Every dimension has a + label, a data type, and a ``"raw"`` field with the actual observations. + Missing values in the time series can be marked with ``null`` (see e.g. + [uk_coal_employ](https://github.com/alan-turing-institute/TCPD/blob/master/datasets/uk_coal_employ/uk_coal_employ.json#L236) + for an example). + +If you want to evaluate the methods in the benchmark on a new dataset, you may +want to collect annotations for the dataset. These annotations can be +collected in the [annotations.json](./analysis/annotations/annotations.json) +file, which is an object that maps each dataset name to a map from the +annotator ID to the marked change points. You can collect annotations using +the [annotation tool](https://github.com/alan-turing-institute/annotatechange) +created for this project. + +Finally, add your method to the ``DATASETS`` field in the ``abed_conf.py`` +file. Proceed with running the experiments as described above. + ## License The code in this repository is licensed under the MIT license, unless diff --git a/execs/R/utils.R b/execs/R/utils.R index 504b5373..a170a1c0 100644 --- a/execs/R/utils.R +++ b/execs/R/utils.R @@ -10,6 +10,16 @@ library(RJSONIO) printf <- function(...) invisible(cat(sprintf(...))); +#' Load a TCPDBench dataset +#' +#' This function reads in a JSON dataset in TCPDBench format (see TCPD +#' repository for schema) and creates a matrix representation of the dataset. +#' The dataset is scaled in the process. +#' +#' @param filename Path to the JSON file +#' @return List object with the raw data in the \code{original} field, the time +#' index in the \code{time} field, and the data matrix in the \code{mat} field. +#' load.dataset <- function(filename) { data <- fromJSON(filename) @@ -48,6 +58,28 @@ load.dataset <- function(filename) return(out) } +#' Prepare the experiment output +#' +#' This function creates a list of the necessary output data. This includes the +#' exact command that was run, dataset and script information, the hostname, +#' output status, any errors if present, and the detected change point location +#' and runtime. +#' +#' @param data the raw data loaded from the JSON file +#' @param data.filename the path to the dataset filename +#' @param status the output status code of the experiment. Currently in use are +#' 'SUCCESS' for when an experiment exited successfully, 'TIMEOUT' if the +#' experiment exceeded a limit on runtime, 'SKIP' if the method was supplied +#' with improper hyperparameters, and 'FAIL' if an error occurred. +#' @param error a description of the error, if one occurred +#' @param params input parameters (including defaults) to the method +#' @param locations detected change point locations (important: these locations +#' are 0-based, whereas R array indices are 1-based. It is important to convert +#' them accordingly. Change point locations should be integers on the interval +#' [0, T-1], including both endpoints). +#' @param runtime the runtime of the method. +#' +#' @return list with all the necessary output fields. prepare.result <- function(data, data.filename, status, error, params, locations, runtime) { out <- list(error=NULL) @@ -94,6 +126,13 @@ prepare.result <- function(data, data.filename, status, error, return(out) } +#' Combine default parameters and command line arguments +#' +#' @param args the command line arguments +#' @param defaults default algorithm parameters +#' @return a combined list with both the default parameter settings and those +#' provided on the command line. If a parameter is in the default list that is +#' specified on the command line the command line parameter takes precedence. make.param.list <- function(args, defaults) { params <- defaults @@ -106,6 +145,14 @@ make.param.list <- function(args, defaults) return(params) } +#' Write output to a file or stdout +#' +#' This function takes an output list generated by \code{\link{prepare.result}} +#' and writes it out as JSON to a file if provided or stdout otherwise. +#' +#' @param out experimental results as a list +#' @param filename (optional) output file to write to +#' dump.output <- function(out, filename) { json.out <- toJSON(out, pretty=T) if (!is.null(filename)) @@ -114,6 +161,16 @@ dump.output <- function(out, filename) { cat(json.out, '\n') } +#' Exit with SKIP status due to multidimensional data +#' +#' This is a shorthand for \code{\link{exit.with.error}} where the error is +#' already set for methods that don't handle multidimensional data. Writes out +#' the data and exits. +#' +#' @param data original data loaded by \code{\link{load.dataset}} +#' @param args command line arguments +#' @param params combined hyperparameters generated by +#' \code{\link{make.param.list}} exit.error.multidim <- function(data, args, params) { status = 'SKIP' error = 'This method has no support for multidimensional data.' @@ -122,6 +179,13 @@ exit.error.multidim <- function(data, args, params) { quit(save='no') } +#' Exit with FAIL status and a custom error message +#' +#' @param data original data loaded by \code{\link{load.dataset}} +#' @param args command line arguments +#' @param params combined hyperparameters generated by +#' \code{\link{make.param.list}} +#' @param error custom error message exit.with.error <- function(data, args, params, error) { status = 'FAIL' out <- prepare.result(data, args$input, status, error, params, NULL, NULL) @@ -129,6 +193,14 @@ exit.with.error <- function(data, args, params, error) { quit(save='no') } +#' Exit with SUCCESS status +#' +#' @param data original data loaded by \code{\link{load.dataset}} +#' @param args command line arguments +#' @param params combined hyperparameters generated by +#' \code{\link{make.param.list}} +#' @param locations detected change point locations (0-based!) +#' @param runtime runtime in seconds exit.success <- function(data, args, params, locations, runtime) { status = 'SUCCESS' error = NULL diff --git a/execs/python/cpdbench_utils.py b/execs/python/cpdbench_utils.py index cb074c69..65e632c1 100644 --- a/execs/python/cpdbench_utils.py +++ b/execs/python/cpdbench_utils.py @@ -19,6 +19,7 @@ import sys def md5sum(filename): + """Compute the MD5 checksum of a given file""" blocksize = 65536 hasher = hashlib.md5() with open(filename, "rb") as fp: @@ -30,6 +31,7 @@ def md5sum(filename): def load_dataset(filename): + """ Load a CPDBench dataset """ with open(filename, "r") as fp: data = json.load(fp) @@ -58,6 +60,45 @@ def prepare_result( runtime, script_filename, ): + """Prepare the experiment output as a dictionary + + Parameters + ---------- + data : dict + The CPDBench dataset object + + data_filename : str + Absolute path to the dataset file + + status : str + Status of the experiments. Commonly used status codes are: SUCCESS if + the experiment was succesful, SKIP is the method was provided improper + parameters, FAIL if the method failed for whatever reason, and TIMEOUT + if the method ran too long. + + error : str + If an error occurred, this field can be used to describe what it is. + + params : dict + Dictionary of parameters provided to the method. It is good to be as + complete as possible, so even default methods should be added to this + field. This enhances reproducibility. + + locations : list + Detected change point locations. Remember that change locations are + indices of time points and are 0-based (start counting at zero, thus + change locations are integers on the interval [0, T-1], including both + endpoints). + + runtime : float + Runtime of the method. This should be computed as accurately as + possible, excluding any method-specific setup code. + + script_filename : + Path to the script of the method. This is hashed to enable rough + versioning. + + """ out = {} # record the command that was used @@ -88,7 +129,7 @@ def prepare_result( def dump_output(output, filename=None): - """Save result to output file or write to stdout """ + """Save result to output file or write to stdout (json format)""" if filename is None: print(json.dumps(output, sort_keys=True, indent="\t")) else: @@ -97,6 +138,7 @@ def dump_output(output, filename=None): def make_param_dict(args, defaults): + """Create the parameter dict combining CLI arguments and defaults""" params = copy.deepcopy(vars(args)) del params["input"] if "output" in params: @@ -106,6 +148,7 @@ def make_param_dict(args, defaults): def exit_with_error(data, args, parameters, error, script_filename): + """Exit and save result using the 'FAIL' exit status""" status = "FAIL" out = prepare_result( data, @@ -120,7 +163,9 @@ def exit_with_error(data, args, parameters, error, script_filename): dump_output(out, args.output) raise SystemExit + def exit_with_timeout(data, args, parameters, runtime, script_filename): + """Exit and save result using the 'TIMEOUT' exit status""" status = "TIMEOUT" out = prepare_result( data, @@ -137,6 +182,7 @@ def exit_with_timeout(data, args, parameters, runtime, script_filename): def exit_success(data, args, parameters, locations, runtime, script_filename): + """Exit and save result using the 'SUCCESS' exit status""" status = "SUCCESS" error = None out = prepare_result( |
