Contributing Guide#

scikit-activeml is a library that implements the most important query strategies for active learning. It is built upon the well-known machine learning framework scikit-learn.

Overview#

Our philosophy is to extend the sklearn ecosystem with the most relevant query strategies for active learning and to implement tools for working with partially unlabeled data. An overview of our repository’s structure is provided in the image below. Each node represents a class or interface, and the arrows illustrate the inheritance hierarchy among them. Dashed nodes indicate functionality that is not yet available in our library. scikit-learn is used in this image for identification only and does not imply endorsement.

https://raw.githubusercontent.com/scikit-activeml/scikit-activeml/master/docs/logos/scikit-activeml-structure.png

In our package skactiveml, there are three major components: SkactivemlClassifier, SkactivemlRegressor, and QueryStrategy. The classifier and regressor modules are necessary to handle partially unlabeled data and to implement active-learning–specific estimators. This way, an active learning cycle can be easily implemented starting with zero initial labels. Regarding active learning query strategies, we currently differentiate between the pool-based paradigm (a large pool of unlabeled samples is available) and the stream-based paradigm (unlabeled samples arrive sequentially, i.e., as a stream). Furthermore, we distinguish between the single-annotator and multi-annotator settings. In the latter case, multiple error-prone annotators are queried to provide labels. As a result, an active learning query strategy not only decides which samples to query but also which annotators should be queried.

Thank You, Contributors!#

A big thank you to all contributors who provide the scikit-activeml project with new enhancements and bug fixes.

Getting Help#

If you have any questions, please reach out to other developers via the following channels:

Roadmap#

Our roadmap is summarized in the issue Upcoming Features.

Get Started#

Before you contribute to this project, please follow the steps below.

Setup Development Environment#

There are several ways to create a local Python environment, such as virtualenv, pipenv, or miniconda. One possible workflow is to install miniconda and use it to create a Python environment.

Example with miniconda#

Create a new Python environment named scikit-activeml:

conda create -n scikit-activeml

To ensure that the correct environment is active:

conda activate scikit-activeml

Then install pip:

conda install pip

Install Dependencies#

Now, install the required project dependencies, which are defined in the requirements.txt and requirements_extra.txt (for development) files.

# Make sure your scikit-activeml Python environment is active!
cd <project-root>
pip install -e .[dev]

After the pip installation is successful, you must install pandoc and ghostscript if they are not already installed.

Example with macOS (Homebrew)#

brew install pandoc ghostscript

Contributing Code#

General Coding Conventions#

This library follows the conventions of scikit-learn and should conform to the PEP 8 Style Guide for Python code. For linting, the use of flake8 is recommended. The Python package black provides a simple solution for code formatting. For example, you can format your code using the following commands:

black skactiveml

Example for Code Contribution Cycle (C3) and Pull Requests#

  1. Fork the repository using the GitHub Fork button.

  2. Clone your fork to your local machine:

git clone https://github.com/<your-username>/scikit-activeml.git
  1. Create a new branch for your changes from the development branch:

git checkout -b <branch-name>
  1. After you have finished implementing the feature, ensure that all tests pass. You can run the tests using:

pytest

Make sure you have covered all lines with tests.

pytest --cov=./skactiveml
  1. Commit and push your changes.

git add <modified-files>
git commit -m "<commit-message>"
git push
  1. Create a pull request.

Query Strategies#

All query strategies inherit from the abstract superclass skactiveml.base.QueryStrategy, which is implemented in skactiveml/base.py. This superclass inherits from sklearn.base.Estimator. By default, its __init__ method requires a random_state parameter, and the abstract query method enforces the implementation of the sample selection logic.

Single-annotator Pool-based Query Strategies#

General#

Single-annotator pool-based query strategies are stored in the file skactiveml/pool/*.py and inherit from skactiveml.base.SingleAnnotatorPoolQueryStrategy.

The class must implement the following methods:

Method

Description

__init__

Method for initialization.

query

Select the samples whose labels are to be queried.

__init__#

For typical class parameters, we use standard names:

Parameter

Description

random_state

An integer or a np.random.RandomState, similar to scikit-learn.

prior, optional

Prior probabilities for the distribution in probabilistic strategies.

method, optional

A string for classes that implement multiple methods.

cost_matrix, optional

A cost matrix defining the cost of misclassifying samples.

query#

Required Parameters:

Parameter

Description

X

Training dataset, usually complete (i.e., including both labeled and unlabeled samples).

y

Labels of the training dataset. (May include unlabeled samples, indicated by a MISSING_LABEL.)

candidates, optional

If candidates is None, the unlabeled samples from (X, y) are considered as candidates. If candidates is an array of integers with shape (n_candidates,), it is considered as indices of the samples in (X, y). If it is an array with shape (n_candidates, n_features), the candidates are directly provided (and may not be contained in X). This is not supported by all query strategies.

batch_size, optional

Number of samples to be selected in one AL cycle.

return_utilities, optional

If True, additionally return the utilities computed by the query strategy.

Returns:

Parameter

Description

query_indices

Indices indicating which candidate sample’s label is to be queried. For example, query_indices[0] indicates the first selected sample. Depending on the shape of candidates, the indexing refers either to samples in X or directly to candidates.

utilities, optional

Utilities of the samples after selection. For example, utilities[0] indicates the utility for selecting the first sample. For labeled samples, the utility will be set to np.nan.

General Advice#

Use the self._validate_data method (implemented in the superclass) to check the inputs X and y only once. Fit the classifier or regressor if it is not yet fitted (using fit_if_not_fitted from utils). Calculate utilities via an extra public function. Use the simple_batch function from utils to determine the query indices and set the utilities in naive batch query strategies.

Testing#

The test classes in skactiveml.pool.test.TestQueryStrategy for single-annotator pool-based query strategies must inherit from the test template skactiveml.tests.template_query_strategy.TemplateSingleAnnotatorPoolQueryStrategy. As a result, many required functionalities will be automatically tested. You must specify the parameters of qs_class and init_default_params in the __init__ accordingly. Depending on whether the query strategy can handle regression, classification, or both, you also need to define the parameters query_default_params_reg or query_default_params_clf. Once the parameters are set, adjust the tests until all errors are resolved. Please refer to the test template for more detailed information.

Single-annotator Stream-based Query Strategies#

General#

All query strategies are stored in a file skactivml/stream/*.py. Every query strategy inherits from skactiveml.base.SingleAnnotatorStreamQueryStrategy. Every query strategy has either an internal budget handling or an outsourced budget_manager.

For typical class parameters we use standard names:

Parameter

Description

random_state

Integer that acts as random seed or np.random.RandomState like sklearn.

budget

The share of labels that the strategy is allowed to query.

budget_manager, optional

Enforces the budget constraint.

The class must implement the following methods:

Function

Description

init

Function for initialization.

query

Identify the instances whose labels to select without adapting the internal state.

update

Adapting the budget monitoring according to the queried labels.

query#

Required Parameters:

Parameter

Description

candidates

Set of candidate instances, inherited from SingleAnnotatorStreamBasedQueryStrategy.

clf, optional

The classifier used by the strategy.

X, optional

Set of labeled and unlabeled instances.

y, optional

Labels of X (it may be set to MISSING_LABEL if y is unknown).

sample_weight, optional

Weights for each instance in X or None if all are equally weighted.

fit_clf, optional

Uses X and y to fit the classifier.

return_utilities

Whether to return the candidates’ utilities, inherited from SingleAnnotatorStreamBasedQueryStrategy.

Returns:

Parameter

Description

queried_indices

Indices of the best instances from X_Cand.

utilities

Utilities of all candidate instances, only if return_utilities is True.

General advice#

The query method must not change the internal state of the query strategy (budget, budget_manager and random_state included) to allow for assessing multiple instances with the same state. Update the internal state in the update() method. If the class implements a classifier (clf) the optional attributes need to be implement. Use self._validate_data method (is implemented in superclass). Check the input X and y only once. Fit classifier if fit_clf is set to True.

update#

Required Parameters:

Parameter

Description

candidates

Set of candidate instances, inherited from SingleAnnotatorStreamBasedQueryStrategy.

queried_indices

Typically the return value of query.

budget_manager_param_dict

Provides additional parameters to the update method of the budget_manager (only include if a budget_manager is used).

General advice#

Use self._validate_data in the case the strategy is used without using the query method (if parameters need to be initialized before the update). If a budget_manager is used forward the update call to the budget_manager.update method.

Testing#

The test classes in skactiveml.stream.test.TestQueryStrategy for single-annotator stream-based query strategies must inherit from the test template skactiveml.tests.template_query_strategy.TemplateSingleAnnotatorStreamQueryStrategy. As a result, many required functionalities will be automatically tested. You must specify the parameters of qs_class and init_default_params in the __init__ accordingly. Depending on whether the query strategy can handle regression, classification, or both, you also need to define the parameters query_default_params_reg or query_default_params_clf. Once the parameters are set, adjust the tests until all errors are resolved. Please refer to the test template for more detailed information.

budget_manager#

All budget managers are stored in skactivml/stream/budget_manager/*.py. The class must implement the following methods:

Parameter

Description

__init__

Function for initialization

query_by_utilities

Identify which instances to query based on the assessed utility

update

Adapting the budget monitoring according to the queried labels

update#

The update method of the budget manager has the same functionality as the query strategy update.

Required Parameters:

Parameter

Description

budget

% of labels that the strategy is allowed to query

random_state

Integer that acts as random seed or np.random.RandomState like sklearn

query_by_utilities#

Required Parameters:

Parameter

Description

utilities

The utilities of candidates calculated by the query strategy, inherited from BudgetManager

Returns:

Parameter

Description

queried_indices

The indices of samples in candidates whose labels are queried, with 0 <= queried_indices <= n_candidates.

General advice for working with a budget_manager:#

If a budget_manager is used, the _validate_data of the query strategy needs to be adapted accordingly:

  • If only a budget is given use the default budget_manager with the given budget

  • If only a budget_manager is given use the budget_manager

  • If both are not given use the default budget_manager with the default budget

  • If both are given and the budget differs from budget_manager.budget throw an error

Testing#

The test classes skactiveml.stream.budgetmanager.test.TestBudgetManager of budget managers need to inherit from the test template skactiveml.tests.template_budget_manager.TemplateBudgetManager. As a result, many required functionalities will be automatically tested. As a requirement, one needs to specify the parameters of bm_class, init_default_params and query_by_utility_params of the __init__ accordingly. Once, the parameters are set, the developer needs to adjust the test until all errors are resolved. We refer to the test template for more detailed information.

Multi-Annotator Pool-based Query Strategies#

All query strategies are stored in a file skactiveml/pool/multi/*.py and inherit skactiveml.base.MultiAnnotatorPoolQueryStrategy.

The class must implement the following methods:

Method

Description

init

Method for initialization.

query

Select the annotator-sample pairs to decide which sample’s class label is to be queried from which annotator.

query#

Required Parameters:

Parameter

Description

X

Training data set, usually complete, i.e. including the labeled and unlabeled samples.

y

Labels of the training data set for each annotator (possibly including unlabeled ones indicated by self.MISSING_LABEL), meaning that y[i, j] contains the label annotated by annotator i for sample j.

candidates, optional

If candidates is None, the samples from (X, y), for which an annotator exists such that the annotator sample pair is unlabeled are considered as sample candidates. If candidates is of shape (n_candidates,) and of type int, candidates is considered as the indices of the sample candidates in (X, y). If candidates is of shape (n_candidates, n_features), the sample candidates are directly given in candidates (not necessarily contained in X). This is not supported by all query strategies.

annotators, optional

If annotators is None, all annotators are considered as available annotators. If annotators is of shape (n_avl_annotators), and of type int, annotators is considered as the indices of the available annotators. If candidate samples and available annotators are specified: The annotator-sample pairs, for which the sample is a candidate sample and the annotator is an available annotator are considered as candidate annotator-sample-pairs. If annotators is a boolean array of shape (n_candidates, n_avl_annotators) the annotator-sample pairs, for which the sample is a candidate sample and the boolean matrix has entry True are considered as candidate annotator-sample pairs.

batch_size, optional

The number of annotator-sample pairs to be selected in one AL cycle.

return_utilities, optional

If True, also return the utilities based on the query strategy.

Returns:

Parameter

Description

query_indices

The query_indices indicate for which candidate sample a label is to be queried, e.g., query_indices[0] indicates the first selected sample. If candidates is None or of shape (n_candidates), the indexing refers to samples in X. If candidates is of shape (n_candidates, n_features), the indexing refers to samples in candidates.

utilities

The utilities of samples after each selected sample of the batch, e.g., utilities[0] indicates the utilities used for selecting the first sample (with index query_indices[0]) of the batch. Utilities for labeled samples will be set to np.nan. If candidates is None or of shape (n_candidates), the indexing refers to samples in X. If candidates is of shape (n_candidates, n_features), the indexing refers to samples in candidates.

General advice#

Use self._validate_data method (is implemented in superclass). Check the input X and y only once. Fit classifier if it is not yet fitted (may use fit_if_not_fitted form utils). If the strategy combines a single annotator query strategy with a performance estimate:

  • define an aggregation function,

  • evaluate the performance for each sample-annotator pair,

  • use the SingleAnnotatorWrapper.

If the strategy is a greedy method regarding the utilities:

  • calculate utilities (in an extra function),

  • use skactiveml.utils.simple_batch function for returning values.

Testing#

The test classes skactiveml.pool.multiannotator.test.TestQueryStrategy of multi-annotator pool-based query strategies need inherit form unittest.TestCase. In this class, each parameter a of the __init__ method needs to be tested via a method test_init_param_a. This applies also for a parameter a of the query method, which is tested via a method test_query_param_a. The main logic of the query strategy is test via the method test_query.

Classifiers#

Standard classifier implementations are part of the subpackage skactiveml.classifier, and classifiers learning from multiple annotators are implemented in the subpackage skactiveml.classifier.multiannotator. Every classifier inherits from skactiveml.base.SkactivemlClassifier and must implement the following methods:

Method

Description

__init__

Method for initialization.

fit

Method to fit the classifier for given training data.

predict_proba

Method predicting class-membership probabilities for samples.

predict

Method predicting class labels for samples. The super implementation uses predict_proba.

__init__#

Parameter

Description

classes, optional

Holds the label for each class. If None, the classes are determined during fitting.

missing_label, optional

Value representing a missing label.

cost_matrix, optional

A cost matrix where cost_matrix[i,j] indicates the cost of predicting class classes[j] for a sample of class classes[i]. Only set if classes is not None.

random_state, optional

Ensures reproducibility (cf. scikit-learn).

fit#

Required Parameters:

Parameter

Description

X

Matrix of feature values representing the samples.

y

Contains the class labels of the training samples. Missing labels are represented by the attribute missing_label. Usually, y is a column array except for multi- annotator classifiers, which expect a matrix with columns for each annotator.

sample_weight, optional

Contains weights for the training samples’ class labels. Must have the same shape as y.

Returns:

Parameter

Description

self

The fitted classifier object.

General advice#

Use self._validate_data method (is implemented in superclass) to check standard parameters of __init__ and fit method. If the classes parameter was provided, the classifier can be fitted with training sample of which each was assigned a missing_label. In this case, the classifier should make random predictions, i.e., outputting uniform class-membership probabilities when calling predict_proba. Ensure that the classifier can handle missing labels also in other cases.

predict_proba#

Required Parameters:

Parameter

Description

X

Matrix of feature values representing the samples for which predictions are made.

Returns:

Parameter

Description

P

The estimated class-membership probabilities per sample.

General advice#

Check parameter X regarding its shape, i.e., use superclass method self._check_n_features to ensure a correct number of features. Check that the classifier has been fitted. If the classifier is a skactiveml.base.ClassFrequencyEstimator, this method is already implemented in the superclass.

predict#

Required Parameters:

Parameter

Description

X

Matrix of feature values representing the samples for which predictions are made.

Returns:

Parameter

Description

y_pred

The estimated class label of each per sample.

General advice#

Usually, this method is already implemented by the superclass through calling the predict_proba method. If the superclass method is overwritten, ensure that it can handle imbalanced costs and missing labels.

score#

Required Parameters:

Parameter

Description

X

Matrix of feature values representing the samples for which predictions are made.

y

Contains the true labels for each sample.

sample_weight, optional

Defines the importance of each sample when computing accuracy.

Returns:

Parameter

Description

score

Mean accuracy of self.predict(X) regarding y.

General advice#

Usually, this method is already implemented by the superclass. If the superclass method is overwritten, ensure that it checks the parameters and that the classifier has been fitted.

Testing#

The test classes skactiveml.classifier.TestClassifier of classifiers need to inherit from the test template skactiveml.tests.template_estimators.TemplateSkactivemlClassifier. As a result, many required functionalities will be automatically tested. As a requirement, one needs to specify the parameters of estimator_class, init_default_params, fit_default_params, and predict_default_params of the __init__ accordingly. Once, the parameters are set, the developer needs to adjust the test until all errors are resolved. We refer to the test template for more detailed information.

Regressors#

Standard regressor implementations are part of the subpackage skactiveml.regressor. Every regressor inherits from skactiveml.base.SkactivemlRegressor and must implement the following methods:

Method

Description

__init__

Method for initialization.

fit

Method to fit the regressor for given training data.

predict

Method predicting the target values for samples.

__init__#

Required Parameters:

Parameter

Description

random_state, optional

Ensures reproducibility (cf. scikit-learn).

missing_label, optional

Value representing a missing label.

fit#

Required Parameters:

Parameter

Description

X

Matrix of feature values representing the samples.

y

Contains the target values of the training samples. Missing labels are represented by the attribute missing_label. Usually, y is a column array except for multi- target regressors, which expect a matrix with columns for each target type.

sample_weight, optional

Contains weights for the training samples’ targets. Must have the same shape as y.

Returns:

Parameter

Description

self

The fitted regressor object.

General advice#

Use self._validate_data method (is implemented in superclass) to check standard parameters of __init__ and fit method. If the regressor was fitted on training sample of which each was assigned a missing_label, the regressor should predict a default value of zero when calling predict. Ensure that the regressor can handle missing labels also in other cases.

predict#

Required Parameters:

Parameter

Description

X

Matrix of feature values representing the samples for which predictions are made.

Returns:

Parameter

Description

y_pred

The estimated targets per sample.

General advice#

Check parameter X regarding its shape, i.e., use method skactiveml.utils.check_n_features to ensure a correct number of features. Check that the regressor has been fitted. If the classifier is a skactiveml.base.ProbabilisticRegressor, this method is already implemented in the superclass.

score#

Required Parameters:

Parameter

Description

X

Matrix of feature values representing the samples for which predictions are made.

y

Contains the true target values for each sample.

sample_weight, optional

Defines the importance of each sample when computing the R2 score.

Returns:

Parameter

Description

score

R2 score of self.predict(X)

regarding y.

General advice#

Usually, this method is already implemented by the superclass. If the superclass method is overwritten, ensure that it checks the parameters and that the regressor has been fitted.

Testing#

The test classes skactiveml.classifier.TestRegressor of regressors need to inherit from the test template skactiveml.tests.template_estimators.TemplateSkactivemlRegressor. As a result, many required functionalities will be automatically tested. As a requirement, one needs to specify the parameters of estimator_class, init_default_params, fit_default_params, and predict_default_params of the __init__ accordingly. Once, the parameters are set, the developer needs to adjust the test until all errors are resolved. We refer to the test template for more detailed information.

Annotator Models#

Annotator models implement the interface skactiveml.base.AnnotatorModelMixin. These models can estimate the performance of annotators for given samples. Each annotator model must implement the predict_annotator_perf method, which estimates the performance per sample for each annotator as a proxy for the quality of the provided annotations.

predict_annotator_perf#

Required Parameters:

Parameter

Description

X

Matrix of feature values representing the samples.

Returns:

Parameter

Description

P_annot

The estimated performance per sample-annotator pair.

General advice#

Check parameter X regarding its shape and check that the annotator model has been fitted. If no samples or class labels were provided during the previous call of the fit method, the maximum value of annotator performance should be outputted for each sample-annotator pair.

Examples#

Two of our main goals are to make active learning more understandable and improve our framework’s usability. Therefore, we require an example for each query strategy. To do so, create a file named scikit-activeml/docs/examples/query_strategy.json. Currently, we support examples for single-annotator pool-based and stream-based query strategies.

The JSON file supports the following entries:

Entry

Description

class

Query strategy’s class name.

package

Name of the sub-package (e.g., pool).

method

Query strategy’s official name.

category

The methodological category of this query strategy, e.g., Expected Error Reduction, Model Change, Query-by-Committee, Random Sampling, Uncertainty Sampling, or Others.

template

Defines the general setup/setting of the example. Supported templates include: examples/template_pool.py, examples/template_pool_regression.py, examples/template_stream.py, and examples/template_pool_batch.py.

tags

Search categories. Supported tags include pool, stream, single-annotator, multi-annotator, classification, and regression.

title

Title of the example, usually named after the query strategy.

text_0

Placeholder for additional explanations.

refs

References (BibTeX keys) to the paper(s) describing the query strategy.

sequence

Order in which content is displayed, usually ["title", "text_0", "plot", "refs"].

import_misc

Python code for imports (e.g., from skactiveml.pool import RandomSampling).

n_samples

Number of samples in the example dataset.

init_qs

Python code to initialize the query strategy object, e.g., RandomSampling().

query_params

Python code for parameters passed to the query method, e.g., X=X, y=y.

preproc

Python code for preprocessing before executing the AL cycle, e.g., X = (X-X.min())/(X.max()-X.min()).

n_cycles

Number of active learning cycles.

init_clf

Python code to initialize the classifier object, e.g., ParzenWindowClassifier(classes=[0, 1]). (Only supported for certain templates.)

init_reg

Python code to initialize the regressor object, e.g., NICKernelRegressor(). (Only supported for the regression template.)

Testing and Code Coverage#

Please ensure test coverage is close to 100%. The current code coverage can be viewed here.

Documentation#

Guidelines for writing documentation in scikit-activeml adopt the scikit-learn guidelines used by scikit-learn.

Building the Documentation#

To ensure your documentation is well formatted, build it using Sphinx:

sphinx-build -b html docs docs/_build

Issue Tracking#

We use GitHub Issues as our issue tracker. If you believe you have found a bug in scikit-activeml, please report it there. Documentation bugs can also be reported.

Checking If a Bug Already Exists#

Before filing an issue, please check whether the problem has already been reported. This will help determine if the problem is resolved or fixed in an upcoming release, save time, and provide guidance on how to fix it. Search the issue database using the search box at the top of the issue tracker page (filter by the bug label).

Reporting an Issue#

Use the following labels when reporting an issue:

Label

Use Case

bug

Something isn’t working

enhancement

Request for a new feature

documentation

Improvement or additions to documentation

question

General questions