Contributing Guide#

scikit-activeml is a library that implements the most important query strategies of active learning. It is built upon the well-known machine learning framework scikit-learn.

Overview#

Our philosophy is to extend the sklearn eco-system with the most relevant query strategies for active learning and to implement tools for working with partially unlabeled data. An overview of our repository’s structure is given in the image below. Each node represents a class or interface. The arrows illustrate the inheritance hierarchy among them. The functionality of a dashed node is not yet available in our library.

https://raw.githubusercontent.com/scikit-activeml/scikit-activeml/master/docs/logos/scikit-activeml-structure.png

In our package skactiveml, there three major components, i.e., SkactivemlClassifier, SkactivemlRegressor, and the QueryStrategy. The classifier and regressor modules are necessary to deal with partially unlabeled data and to implement active-learning specific estimators. This way, an active learning cycle can be easily implemented to start with zero initial labels. Regarding the active learning query strategies, we currently differ between the pool-based (a large pool of unlabeled samples is available) and stream-based (unlabeled samples arrive sequentially, i.e., as a stream) paradigm. On top of both paradigms, we also distinguish the single- and multi-annotator setting. In the latter setting, multiple error-prone annotators are queried to provide labels. As a result, an active learning query strategy not only decides which samples but also which annotators should be queried.

Thank you, contributors!#

A big thank you to all contributors who provide the scikit-activeml project with new enhancements and bug fixes.

Getting Help#

If you have any questions, please reach out to other developers via the following channels:

Roadmap#

Our roadmap is summarized in the issue Upcoming Features.

Get Started#

Before you can contribute to this project, you might execute the following steps.

Setup Development Environment#

There are several ways to create a local Python environment, such as virtualenv, pipenv, miniconda, etc. One possible workflow is to install miniconda and use it to create a Python environment.

Example With miniconda#

Create a new Python environment named scikit-activeml:

conda create -n scikit-activeml

To be sure that the correct environment is active:

conda activate scikit-activeml

Then install pip:

conda install pip

Install Dependencies#

Now we can install some required project dependencies, which are defined in the requirements.txt and requirements_extra.txt (for development) files.

# Make sure your scikit-activeml python environment is active!
cd <project-root>
pip install -r requirements.txt
pip install -r requirements_extra.txt

After the pip installation was successful, we have to install pandoc and ghostscript if it is not already installed.

Example with MacOS (Homebrew)#

brew install pandoc ghostscript

Contributing Code#

General Coding Conventions#

As this library conforms to the convention of scikit-learn, the code should conform to PEP 8 Style Guide for Python Code. For linting, the use of flake8 is recommended. The Python package black provides a simple solution for this formatting. Concretely, you can install it and format the code via the following commands:

pip install black
black --line-length 79 example_file.py

Example for C3 (Code Contribution Cycle) and Pull Requests#

1. Fork the repository using the Github Fork button.

  1. Then, clone your fork to your local machine:

git clone https://github.com/<your-username>/scikit-activeml.git
  1. Create a new branch for your changes from the development branch:

git checkout -b <branch-name>
  1. After you have finished implementing the feature, make sure that all the tests pass. The tests can be run as

$ pytest

Make sure, you covered all lines by tests.

$ pytest --cov=./skactiveml
  1. Commit and push the changes.

$ git add <modified-files>
$ git commit -m "<commit-message>"
$ git push
  1. Create a pull request.

Query Strategies#

All query strategies inherit from skactiveml.base.QueryStrategy as abstract superclass implemented in skactiveml/base.py. This superclass inherits from sklearn.base.Estimator. The __init__ method requires by default a random_state parameter and the abstract method query is to enforce the implementation of the sample selection logic.

Single-annotator Pool-based Query Strategies#

General#

Single-annotator pool-based query strategies are stored in a file skactiveml/pool/*.py and inherit from skactiveml.base.SingleAnnotatorPoolQueryStrategy.

The class must implement the following methods:

Method

Description

init

Method for initialization.

query

Select the samples whose labels are to be queried.

__init__ method#

For typical class parameters, we use standard names:

Parameter

Description

random_state | Number or np.random.RandomState
like sklearn.
prior, optional | Prior probabilities for the
distribution of probabilistic
strategies.

method, optional

String for classes that implement multiple methods.

cost_matrix, optional

Cost matrix defining the cost of interchanging classes.

query method#

Required Parameters:

Parameter

Description

X

Training data set, usually complete, i.e. including the labeled and unlabeled samples.

y

Labels of the training data set (possibly including unlabeled ones indicated by MISSING_LABEL.)

candidates, optional

If candidates is None, the unlabeled samples from (X, y) are considered as candidates. If candidates is of shape (n_candidates) and of type int, candidates is considered as the indices of the samples in (X,y). If candidates is of shape (n_candidates, n_features), the candidates are directly given in candidates (not necessarily contained in X). This is not supported by all query strategies.

batch_size, optional

Number of samples to be selected in one AL cycle.

return_utilities, optional

If true, additionally return the utilities of the query strategy.`

Returns:

Parameter

Description

query_indices

The query_indices indicate for which candidate sample a label is to be queried, e.g., query_indices[0] indicates the first selected sample. If candidates is None or of shape (n_candidates), the indexing refers to samples in X. If candidates is of shape (n_candidates, n_features), the indexing refers to samples in candidates.

utilities, optional

The utilities of samples after each selected sample of the batch, e.g., utilities[0] indicates the utilities used for selecting the first sample (with index query_indices[0]) of the batch. Utilities for labeled samples will be set to np.nan. If candidates is None or of shape (n_candidates), the indexing refers to samples in X. If candidates is of shape (n_candidates, n_features), the indexing refers to samples in candidates.

General advice#

Use self._validate_data method (implemented in the superclass). Check the input X and y only once. Fit the classifier or regressors if it is not yet fitted (may use fit_if_not_fitted from utils). Calculate utilities via an extra function that should be public. Use simple_batch function from utils for determining query_indices and setting utilities in naive batch query strategies.

Testing#

The test classes skactiveml.pool.test.TestQueryStrategy of single-annotator pool-based query strategies need to inherit from the test template skactiveml.tests.template_query_strategy.TemplateSingleAnnotatorPoolQueryStrategy. As a result, many required functionalities will be automatically tested. As a requirement, one needs to specify the parameters of qs_class, init_default_params of the __init__ accordingly. Depending on whether the query strategy can handle regression/classification or both settings, one needs to additionally define the parameters query_default_params_reg/query_default_params_clf. Once, the parameters are set, the developer needs to adjust the test until all errors are resolved. In particular, the method test_query must be implemented. We refer to the test template for more detailed information.

Single-annotator Stream-based Query Strategies#

General#

All query strategies are stored in a file skactivml/stream/*.py. Every query strategy inherits from SingleAnnotatorStreamQueryStrategy. Every query strategy has either an internal budget handling or an outsourced budget_manager.

For typical class parameters we use standard names:

Parameter

Description

random_state

Integer that acts as random seed or np.random.RandomState like sklearn

budget

The share of labels that thestrategy is allowed to query

budget_manager, optional

Enforces the budget constraint

The class must implement the following methods:

Function

Description

init

Function for initialization

query

Identify the instances whose labels to select without adapting the internal state

update

Adapting the budget monitoring according to the queried labels

query method#

Required Parameters:

Parameter

Description

candidates

Set of candidate instances, inherited from SingleAnnotatorStreamBasedQueryStrategy

clf, optional

The classifier used by the strategy

X, optional

Set of labeled and unlabeled instances

y, optional

Labels of X (it may be set to MISSING_LABEL if y is unknown)

sample_weight, optional

Weights for each instance in X or None if all are equally weighted

fit_clf, optional

uses X and y to fit the classifier

return_utilities

Whether to return the candidates’ utilities, inherited from SingleAnnotatorStreamBasedQueryStrategy

Returns:

Parameter

Description

queried_indices

Indices of the best instances from X_Cand

utilities

Utilities of all candidate instances, only if return_utilities is True

General advice#

The query method must not change the internal state of the query strategy (budget, budget_manager and random_state included) to allow for assessing multiple instances with the same state. Update the internal state in the update() method. If the class implements a classifier (clf) the optional attributes need to be implement. Use self._validate_data method (is implemented in superclass). Check the input X and y only once. Fit classifier if fit_clf is set to True.

update method#

Required Parameters:

Parameter

Description

candidates

Set of candidate instances, inherited from SingleAnnotatorStreamBasedQueryStrategy

queried_indices

Typically the return value of query

budget_manager_param_dict

Provides additional parameters to the update method of the budget_manager (only include if a budget_manager is used)

General advice#

Use self._validate_data in case the strategy is used without using the query method (if parameters need to be initialized before the update). If a budget_manager is used forward the update call to the budget_manager.update method.

Testing#

All stream query strategies are tested by a general unittest (stream/tests/test_stream.py) -For every class ExampleQueryStrategy that inherits from SingleAnnotatorStreamQueryStrategy (stored in _example.py), it is automatically tested if there exists a file test/test_example.py. It is necessary that both filenames are the same. Moreover, the test class must be called TestExampleQueryStrategy and inherit from unittest.TestCase. Every parameter in init() will be tested if it is written the same as a class variable. Every parameter arg in init() will be evaluated if there exists a method in the testclass TestExampleQueryStrategy that is called test_init_param_arg(). Every parameter arg in query() will be evaluated if there exists a method in the testclass TestExampleQueryStrategy that is called test_query_param_arg(). It is tested if the internal state of query() is unchanged after multiple calls without using update().

General advice for the budget_manager#

All budget managers are stored in skactivml/stream/budget_manager/*.py. The class must implement the following methods:

Parameter

Description

__init__

Function for initialization

query_by_utilities

Identify which instances to query based on the assessed utility

update

Adapting the budget monitoring according to the queried labels

update method#

The update method of the budget manager has the same functionality as the query strategy update.

Required Parameters:

Parameter

Description

budget

% of labels that the strategy is allowed to query

random_state

Integer that acts as random seed or np.random.RandomState like sklearn

query_by_utilities method#

Required Parameters:

Parameter

Description

utilities

The utilities of candidates calculated by the query strategy, inherited from BudgetManager

General advice for working with a budget_manager:#

If a budget_manager is used, the _validate_data of the query strategy needs to be adapted accordingly:

  • If only a budget is given use the default budget_manager with the given budget

  • If only a budget_manager is given use the budget_manager

  • If both are not given use the default budget_manager with the default budget

  • If both are given and the budget differs from budget_manager.budget throw an error

All budget managers are tested by a general unittest (stream/budget_manager/tests/test_budget_manager.py). For every class ExampleBudgetManager that inherits from BudgetManager (stored in _example.py), it is automatically tested if there exists a file test/test_example.py. It is necessary that both filenames are the same.

Testing#

Moreover, the test class must be called TestExampleBudgetManager and inheriting from unittest.TestCase. Every parameter in __init__() will be tested if it is written the same as a class variable. Every parameter arg in __init__() will be evaluated if there exists a method in the testclass TestExampleQueryStrategy that is called test_init_param_arg(). Every parameter arg in query_by_utility() will be evaluated if there exists a method in the testclass TestExampleQueryStrategy that is called test_query_by_utility _param_arg(). It is tested if the internal state of query() is unchanged after multiple calls without using update().

Multi-Annotator Pool-based Query Strategies#

All query strategies are stored in a file skactiveml/pool/multi/*.py and inherit skactiveml.base.MultiAnnotatorPoolQueryStrategy.

The class must implement the following methods:

Method

Description

init

Method for initialization.

query

Select the annotator-sample pairs to decide which sample’s class label is to be queried from which annotator.

query method#

Required Parameters:

Parameter

Description

X

Training data set, usually complete, i.e. including the labeled and unlabeled samples.

y

Labels of the training data set for each annotator (possibly including unlabeled ones indicated by self.MISSING_LABEL), meaning that y[i, j] contains the label annotated by annotator i for sample j.

candidates, optional

If candidates is None, the samples from (X, y), for which an annotator exists such that the annotator sample pair is unlabeled are considered as sample candidates. If candidates is of shape (n_candidates,) and of type int, candidates is considered as the indices of the sample candidates in (X, y). If candidates is of shape (n_candidates, n_features), the sample candidates are directly given in candidates (not necessarily contained in X). This is not supported by all query strategies.

annotators, optional

If annotators is None, all annotators are considered as available annotators. If annotators is of shape (n_avl_annotators), and of type int, annotators is considered as the indices of the available annotators. If candidate samples and available annotators are specified: The annotator-sample pairs, for which the sample is a candidate sample and the annotator is an available annotator are considered as candidate annotator-sample-pairs. If annotators is a boolean array of shape (n_candidates, n_avl_annotators) the annotator-sample pairs, for which the sample is a candidate sample and the boolean matrix has entry True are considered as candidate annotator-sample pairs.

batch_size, optional

The number of annotator-sample pairs to be selected in one AL cycle.

return_utilities, optional

If True, also return the utilities based on the query strategy.

Returns:

Parameter

Description

query_indices

The query_indices indicate for which candidate sample a label is to be queried, e.g., query_indices[0] indicates the first selected sample. If candidates is None or of shape (n_candidates), the indexing refers to samples in X. If candidates is of shape (n_candidates, n_features), the indexing refers to samples in candidates.

utilities

The utilities of samples after each selected sample of the batch, e.g., utilities[0] indicates the utilities used for selecting the first sample (with index query_indices[0]) of the batch. Utilities for labeled samples will be set to np.nan. If candidates is None or of shape (n_candidates), the indexing refers to samples in X. If candidates is of shape (n_candidates, n_features), the indexing refers to samples in candidates.

General advice#

Use self._validate_data method (is implemented in superclass). Check the input X and y only once. Fit classifier if it is not yet fitted (may use fit_if_not_fitted form utils). If the strategy combines a single annotator query strategy with a performance estimate:

  • define an aggregation function,

  • evaluate the performance for each sample-annotator pair,

  • use the SingleAnnotatorWrapper.

If the strategy is a greedy method regarding the utilities:

  • calculate utilities (in an extra function),

  • use skactiveml.utils.simple_batch function for returning values.

Testing#

The test classes skactiveml.pool.multiannotator.test.TestQueryStrategy of multi-annotator pool-based query strategies need inherit form unittest.TestCase. In this class, each parameter a of the __init__ method needs to be tested via a method test_init_param_a. This applies also for a parameter a of the query method, which is tested via a method test_query_param_a. The main logic of the query strategy is test via the method test_query.

Classifiers#

Standard classifier implementations are part of the subpackage skactiveml.classifier and classifiers learning from multiple annotators are implemented in its subpackage skactiveml.classifier.multiannotator. Every class of a classifier inherits from skactiveml.base.SkactivemlClassifier.

The class must implement the following methods:

Method

Description

init

Method for initialization.

fit

Method to fit the classifier for given training data.

predict_proba

Method predicting class-membership probabilities for samples.

predict

Method predicting class labels for samples. The super already provides an implementation using predict_proba.

init method#

Required Parameters:

Parameter

Description

classes, optional

Holds the label for each class. If None, the classes are determined during the fit.

missing_label, optional

Value to represent a missing label.

cost_matrix, optional

Cost matrix with cost_matrix[i,j] indicating cost of predicting class classes[j] for a sample of class classes[i]. Can be only set, if classes is not None.

random_state, optional

Ensures reproducibility (cf. scikit-learn).

fit method#

Required Parameters:

Parameter

Description

X

Is a matrix of feature values representing the samples.

y

Contains the class labels of the training samples. Missing labels are represented through the attribute missing_label. Usually, y is a column array except for multi-annotator classifiers which expect a matrix with columns containing the class labels provided by a specific annotator.

sample_weight, optional

Contains the weights of the training samples’ class labels. It must have the same shape as y.

Returns:

Parameter

Description

self

The fitted classifier object.

General advice#

Use self._validate_data method (is implemented in superclass) to check standard parameters of __init__ and fit method. If the classes parameter was provided, the classifier can be fitted with training sample of which each was assigned a missing_label. In this case, the classifier should make random predictions, i.e., outputting uniform class-membership probabilities when calling predict_proba. Ensure that the classifier can handle missing labels also in other cases.

predict_proba method#

Required Parameters:

Parameter

Description

X

Is a matrix of feature values representing the samples, for which the classifier will make predictions.

Returns:

Parameter

Description

P

The estimated class-membership probabilities per sample.

General advice#

Check parameter X regarding its shape, i.e., use superclass method self._check_n_features to ensure a correct number of features. Check that the classifier has been fitted. If the classifier is a skactiveml.base.ClassFrequencyEstimator, this method is already implemented in the superclass.

predict method#

Required Parameters:

Parameter

Description

X

Is a matrix of feature values representing the samples, for which the classifier will make predictions.

Returns:

Parameter

Description

y_pred

The estimated class label of each per sample.

General advice#

Usually, this method is already implemented by the superclass through calling the predict_proba method. If the superclass method is overwritten, ensure that it can handle imbalanced costs and missing labels.

score method#

Required Parameters:

Parameter

Description

X

Is a matrix of feature values representing the samples, for which the classifier will make predictions.

y

Contains the true label of each sample.

sample_weight, optional

Defines the importance of each sample when computing the accuracy of the classifier.

Returns:

Parameter

Description

score

Mean accuracy of self.predict(X) regarding y.

General advice#

Usually, this method is already implemented by the superclass. If the superclass method is overwritten, ensure that it checks the parameters and that the classifier has been fitted.

Testing#

All classifiers are tested by a general unittest (skactiveml/classifier/tests/test_classifier.py). For every class ExampleClassifier that inherits from skactiveml.base.SkactivemlClassifier (stored in _example_classifier.py), it is automatically tested if there exists a file tests/test_example_classifier.py. It is necessary that both filenames are the same. Moreover, the test class must be called TestExampleClassifier and inherit from unittest.TestCase. For each parameter of an implemented method, there must be a test method called test_methodname_parametername in the Python file tests/test_example_classifier.py. It is to check whether invalid parameters are handled correctly. For each implemented method, there must be a test method called test_methodname in the Python file tests/test_example_classifier.py. It is to check whether the method works as intended.

Regressors#

Standard regressors implementations are part of the subpackage skactiveml.regressor. Every class of a regressor inherits from skactiveml.base.SkactivemlRegressor.

The class must implement the following methods:

Method

Description

init

Method for initialization.

fit

Method to fit the regressor for given training data.

predict

Method predicting the target values (labels) for samples.

init method#

Required Parameters:

Parameter

Description

random_state, optional

Ensures reproducibility (cf. scikit-learn).

missing_label, optional

Value to represent a missing label.

fit method#

Required Parameters:

Parameter

Description

X

Is a matrix of feature values representing the samples.

y

Contains the target values of the training samples. Missing labels are represented through the attribute missing_label. Usually, y is a column array except for multi-target regressors which expect a matrix with columns containing the different target types.

sample_weight, optional

Contains the weights of the training samples’ targets. It must have the same shape as y.

Returns:

Parameter

Description

self

The fitted regressor object.

General advice#

Use self._validate_data method (is implemented in superclass) to check standard parameters of __init__ and fit method. If the regressor was fitted on training sample of which each was assigned a missing_label, the regressor should predict a default value of zero when calling predict. Ensure that the regressor can handle missing labels also in other cases.

predict method#

Required Parameters:

Parameter

Description

X

Is a matrix of feature values representing the samples, for which the regressor will make predictions.

Returns:

Parameter

Description

y_pred

The estimated targets per sample.

General advice#

Check parameter X regarding its shape, i.e., use superclass method self._check_n_features to ensure a correct number of features. Check that the regressor has been fitted. If the classifier is a skactiveml.base.ProbabilisticRegressor, this method is already implemented in the superclass.

score method#

Required Parameters:

Parameter

Description

X

Is a matrix of feature values representing the samples, for which the regressor will make predictions.

y

Contains the true target of each sample.

sample_weight, optional

Defines the importance of each sample when computing the R2 score of the regressor.

Returns:

Parameter

Description

score

R2 score of self.predict(X)

regarding y.

General advice#

Usually, this method is already implemented by the superclass. If the superclass method is overwritten, ensure that it checks the parameters and that the regressor has been fitted.

Testing#

For every class ExampleRegressor that inherits from skactiveml.base.SkactivemlRegressor (stored in _example_regressor.py), there need to be a file tests/test_example_classifier.py. It is necessary that both filenames are the same. Moreover, the test class must be called TestExampleRegressor and inherit from unittest.TestCase. For each parameter of an implemented method, there must be a test method called test_methodname_parametername in the Python file tests/test_example_regressor.py. It is to check whether invalid parameters are handled correctly. For each implemented method, there must be a test method called test_methodname in the Python file tests/test_example_regressor.py. It is to check whether the method works as intended.

Annotators Models#

Annotator models are marked by implementing the interface skactiveml.base.AnnotatorModelMixin. These models can estimate the performances of annotators for given samples. The class of an annotator model must implement the predict_annotator_perf method estimating the performances per sample of each annotator as proxies of the provided annotations’ qualities.

predict_annotator_perf method#

Required Parameters:

Parameter

Description

X

Is a matrix of feature values representing the samples.

Returns:

Parameter

Description

P_annot

The estimated performances per sample-annotator pair.

General advice#

Check parameter X regarding its shape and check that the annotator model has been fitted. If no samples or class labels were provided during the previous call of the fit method, the maximum value of annotator performance should be outputted for each sample-annotator pair.

Examples#

Two of our main goals are to make active learning more understandable and improve our framework’s usability. Therefore, we require the implementation of an example for each query strategy. To do so, one needs to create a file name scikit-activeml/docs/examples/query_strategy.json. Currently, we support examples for single-annotator pool-based query strategies and single-annotator stream-based query strategies.

The .json file supports the following entries:

Entry

Description

class

Query strategy’s class name.

package

Name of the sub-package, e.g., pool.

method

Query strategy’s official name.

category

The methodological category of this query strategy, i.e., Expected Error Reduction, Model Change, Query-by-Committee, Random Sampling, Uncertainty Sampling, or Others.

template

Defines the general setup/setting of the example. Supported templates are examples/template_pool.py,

examples/template_pool_regression.py,

examples/template_stream.py, and examples/template_pool_batch.py

tags

Defines search categories. Supported tags are pool, stream, single-annotator, multi-annotator, classification, and regression.

title

Title of the example, usually named after the query strategy.

text_0

Placeholder for additional explanations.

refs

References (BibTeX key) to the paper(s) of the query strategy.

sequence

Order in which content is displayed, usually [“title”, “text_0”, “plot”, “refs”].

import_misc

Python code for imports, e.g., “from skactiveml.pool import RandomSampling”.

n_samples

Number of samples of the example data set.

init_qs

Python code to initialize the query strategy object, e.g., “RandomSampling()”.

query_params

Python code of parameters passed to the query method of the query strategy, e.g., “X=X, y=y”.

preproc

Python code for preprocessing before executing the AL cycle, e.g., “X = (X-X.min())/(X.max()-X.min())”.

n_cycles

Number of AL cycles.

init_clf

Python code to initialize the classifier object, e.g., “ParzenWindowClassifier(classes=[0, 1])”. Only supported for examples/template_pool.py, examples/template_pool_batch.py, and examples/template_stream.py.

init_reg

Python code to initialize the regressor object, e.g., “NICKernelRegressor()”. Only supported for examples/template_pool_regression.py.

Testing and code coverage#

Please ensure test coverage is close to 100%. The current code coverage can be viewed here.

Documentation#

Guidelines for writing documentation#

In scikit-activeml, the guidelines for writing the documentation are adopted from scikit-learn.

Building the documentation#

To ensure the documentation of your work is well formatted, build the sphinx documentation by executing the following line.

sphinx-build -b html docs docs/_build

Issue Tracking#

We use Github Issues as our issue tracker. If you think you have found a bug in scikit-activeml, you can report it to the issue tracker. Documentation bugs can also be reported there.

Checking If A Bug Already Exists#

The first step before filing an issue report is to see whether the problem has already been reported. Checking if the problem is an existing issue will:

  1. Help you see if the problem has already been resolved or has been fixed for the next release

  2. Save time for you and the developers

  3. Help you learn what needs to be done to fix it

  4. Determine if additional information, such as how to replicate the issue, is needed

To see if the issue already exists, search the issue database (bug label) using the search box on the top of the issue tracker page.

Reporting an issue#

Use the following labels to report an issue:

Label

Usecase

bug

Something isn’t working

enhancement

New feature

documentation

Improvement or additions to document

question

General questions