Contributing Guide#
scikit-activeml is a library that implements the most important query strategies for active learning. It is built upon the well-known machine learning framework scikit-learn.
Overview#
Our philosophy is to extend the sklearn
ecosystem with the most relevant
query strategies for active learning and to implement tools for working with
partially unlabeled data. An overview of our repository’s structure is provided
in the image below. Each node represents a class or interface, and the arrows
illustrate the inheritance hierarchy among them. Dashed nodes indicate
functionality that is not yet available in our library. scikit-learn
is used in this image for identification only and does not imply endorsement.

In our package skactiveml
, there are three major components:
SkactivemlClassifier
, SkactivemlRegressor
, and QueryStrategy
.
The classifier and regressor modules are necessary to handle partially unlabeled
data and to implement active-learning–specific estimators. This way, an active
learning cycle can be easily implemented starting with zero initial labels.
Regarding active learning query strategies, we currently differentiate between
the pool-based paradigm (a large pool of unlabeled samples is available) and the
stream-based paradigm (unlabeled samples arrive sequentially, i.e., as a stream).
Furthermore, we distinguish between the single-annotator and multi-annotator
settings. In the latter case, multiple error-prone annotators are queried to
provide labels. As a result, an active learning query strategy not only decides
which samples to query but also which annotators should be queried.
Thank You, Contributors!#
A big thank you to all contributors who provide the scikit-activeml project with new enhancements and bug fixes.
Getting Help#
If you have any questions, please reach out to other developers via the following channels:
Roadmap#
Our roadmap is summarized in the issue Upcoming Features.
Get Started#
Before you contribute to this project, please follow the steps below.
Setup Development Environment#
There are several ways to create a local Python environment, such as
virtualenv,
pipenv, or
miniconda. One possible
workflow is to install miniconda
and use it to create a Python environment.
Example with miniconda#
Create a new Python environment named scikit-activeml:
conda create -n scikit-activeml
To ensure that the correct environment is active:
conda activate scikit-activeml
Then install pip
:
conda install pip
Install Dependencies#
Now, install the required project dependencies, which are defined in the
requirements.txt
and requirements_extra.txt
(for development) files.
# Make sure your scikit-activeml Python environment is active!
cd <project-root>
pip install -e .[dev]
After the pip installation is successful, you must install pandoc
and
ghostscript
if they are not already installed.
Example with macOS (Homebrew)#
brew install pandoc ghostscript
Contributing Code#
General Coding Conventions#
This library follows the conventions of scikit-learn and should conform to the PEP 8 Style Guide for Python code. For linting, the use of flake8 is recommended. The Python package black provides a simple solution for code formatting. For example, you can format your code using the following commands:
black skactiveml
Example for Code Contribution Cycle (C3) and Pull Requests#
Fork the repository using the GitHub Fork button.
Clone your fork to your local machine:
git clone https://github.com/<your-username>/scikit-activeml.git
Create a new branch for your changes from the
development
branch:
git checkout -b <branch-name>
After you have finished implementing the feature, ensure that all tests pass. You can run the tests using:
pytest
Make sure you have covered all lines with tests.
pytest --cov=./skactiveml
Commit and push your changes.
git add <modified-files>
git commit -m "<commit-message>"
git push
Create a pull request.
Query Strategies#
All query strategies inherit from the abstract superclass
skactiveml.base.QueryStrategy
, which is implemented in skactiveml/base.py
.
This superclass inherits from sklearn.base.Estimator
. By default, its
__init__
method requires a random_state
parameter, and the abstract
query
method enforces the implementation of the sample selection logic.
Single-annotator Pool-based Query Strategies#
General#
Single-annotator pool-based query strategies are stored in the file
skactiveml/pool/*.py
and inherit from
skactiveml.base.SingleAnnotatorPoolQueryStrategy
.
The class must implement the following methods:
Method |
Description |
---|---|
|
Method for initialization. |
|
Select the samples whose labels are to be queried. |
__init__
#
For typical class parameters, we use standard names:
Parameter |
Description |
---|---|
|
An integer or a np.random.RandomState, similar to scikit-learn. |
|
Prior probabilities for the distribution in probabilistic strategies. |
|
A string for classes that implement multiple methods. |
|
A cost matrix defining the cost of misclassifying samples. |
query
#
Required Parameters:
Parameter |
Description |
---|---|
|
Training dataset, usually complete (i.e., including both labeled and unlabeled samples). |
|
Labels of the training dataset. (May include unlabeled samples, indicated by a MISSING_LABEL.) |
|
If |
|
Number of samples to be selected in one AL cycle. |
|
If True, additionally return the utilities computed by the query strategy. |
Returns:
Parameter |
Description |
---|---|
|
Indices indicating which candidate sample’s
label is to be queried. For example,
|
|
Utilities of the samples after selection.
For example, |
General Advice#
Use the self._validate_data
method (implemented in the superclass)
to check the inputs X
and y
only once. Fit the classifier or
regressor if it is not yet fitted (using fit_if_not_fitted
from utils
).
Calculate utilities via an extra public function. Use the
simple_batch
function from utils
to determine the query indices and set
the utilities in naive batch query strategies.
Testing#
The test classes in skactiveml.pool.test.TestQueryStrategy
for
single-annotator pool-based query strategies must inherit from the test
template skactiveml.tests.template_query_strategy.TemplateSingleAnnotatorPoolQueryStrategy
.
As a result, many required functionalities will be automatically tested.
You must specify the parameters of qs_class
and init_default_params
in
the __init__
accordingly. Depending on whether the query strategy can handle
regression, classification, or both, you also need to define the parameters
query_default_params_reg
or query_default_params_clf
. Once the parameters
are set, adjust the tests until all errors are resolved. Please refer to the test
template for more detailed information.
Single-annotator Stream-based Query Strategies#
General#
All query strategies are stored in a file skactivml/stream/*.py
.
Every query strategy inherits from
skactiveml.base.SingleAnnotatorStreamQueryStrategy
. Every query strategy has
either an internal budget handling or an outsourced budget_manager
.
For typical class parameters we use standard names:
Parameter |
Description |
---|---|
|
Integer that acts as random seed
or |
|
The share of labels that the strategy is allowed to query. |
|
Enforces the budget constraint. |
The class must implement the following methods:
Function |
Description |
---|---|
|
Function for initialization. |
|
Identify the instances whose labels to select without adapting the internal state. |
|
Adapting the budget monitoring according to the queried labels. |
query
#
Required Parameters:
Parameter |
Description |
---|---|
|
Set of candidate instances,
inherited from
|
|
The classifier used by the strategy. |
|
Set of labeled and unlabeled instances. |
|
Labels of |
|
Weights for each instance in
|
|
Uses |
|
Whether to return the candidates’ utilities,
inherited from |
Returns:
Parameter |
Description |
---|---|
|
Indices of the best instances
from |
|
Utilities of all candidate
instances, only if
|
General advice#
The query
method must not change the internal state of the query
strategy (budget
, budget_manager
and random_state
included) to allow
for assessing multiple instances with the same state. Update the internal state
in the update()
method. If the class implements a classifier (clf
) the
optional attributes need to be implement. Use self._validate_data
method
(is implemented in superclass). Check the input X
and y
only once. Fit
classifier if fit_clf
is set to True
.
update
#
Required Parameters:
Parameter |
Description |
---|---|
|
Set of candidate instances,
inherited from
|
|
Typically the return value of
|
|
Provides additional parameters to
the |
General advice#
Use self._validate_data
in the case the strategy is used without using
the query
method (if parameters need to be initialized before the
update). If a budget_manager
is used forward the update call to the
budget_manager.update
method.
Testing#
The test classes in skactiveml.stream.test.TestQueryStrategy
for
single-annotator stream-based query strategies must inherit from the test
template skactiveml.tests.template_query_strategy.TemplateSingleAnnotatorStreamQueryStrategy
.
As a result, many required functionalities will be automatically tested.
You must specify the parameters of qs_class
and init_default_params
in
the __init__
accordingly. Depending on whether the query strategy can handle
regression, classification, or both, you also need to define the parameters
query_default_params_reg
or query_default_params_clf
. Once the parameters
are set, adjust the tests until all errors are resolved. Please refer to the test
template for more detailed information.
budget_manager
#
All budget managers are stored in
skactivml/stream/budget_manager/*.py
. The class must implement the
following methods:
Parameter |
Description |
---|---|
|
Function for initialization |
|
Identify which instances to query based on the assessed utility |
|
Adapting the budget monitoring according to the queried labels |
update
#
The update method of the budget manager has the same functionality as the query strategy update.
Required Parameters:
Parameter |
Description |
---|---|
|
% of labels that the strategy is allowed to query |
|
Integer that acts as random seed
or |
query_by_utilities
#
Required Parameters:
Parameter |
Description |
---|---|
|
The |
Returns:
Parameter |
Description |
---|---|
|
The indices of samples in
candidates whose labels are
queried, with
|
General advice for working with a budget_manager
:#
If a budget_manager
is used, the _validate_data
of the query
strategy needs to be adapted accordingly:
If only a
budget
is given use the defaultbudget_manager
with the given budgetIf only a
budget_manager
is given use thebudget_manager
If both are not given use the default
budget_manager
with the default budgetIf both are given and the budget differs from
budget_manager.budget
throw an error
Testing#
The test classes skactiveml.stream.budgetmanager.test.TestBudgetManager
of budget managers need to inherit from the test template
skactiveml.tests.template_budget_manager.TemplateBudgetManager
.
As a result, many required functionalities will be automatically tested.
As a requirement, one needs to specify the parameters of bm_class
,
init_default_params
and query_by_utility_params
of the __init__
accordingly. Once, the parameters are set, the developer needs to adjust the
test until all errors are resolved. We refer to the test template for more
detailed information.
Multi-Annotator Pool-based Query Strategies#
All query strategies are stored in a file
skactiveml/pool/multi/*.py
and inherit
skactiveml.base.MultiAnnotatorPoolQueryStrategy
.
The class must implement the following methods:
Method |
Description |
---|---|
|
Method for initialization. |
|
Select the annotator-sample pairs to decide which sample’s class label is to be queried from which annotator. |
query
#
Required Parameters:
Parameter |
Description |
---|---|
|
Training data set, usually complete, i.e. including the labeled and unlabeled samples. |
|
Labels of the training data set
for each annotator (possibly
including unlabeled ones
indicated by self.MISSING_LABEL),
meaning that |
|
If |
|
If |
|
The number of annotator-sample pairs to be selected in one AL cycle. |
|
If |
Returns:
Parameter |
Description |
---|---|
|
The |
|
The utilities of samples after
each selected sample of the
batch, e.g., |
General advice#
Use self._validate_data method
(is implemented in superclass).
Check the input X
and y
only once. Fit classifier if it is not
yet fitted (may use fit_if_not_fitted
form utils
). If the
strategy combines a single annotator query strategy with a performance
estimate:
define an aggregation function,
evaluate the performance for each sample-annotator pair,
use the
SingleAnnotatorWrapper
.
If the strategy is a greedy
method regarding the utilities:
calculate utilities (in an extra function),
use
skactiveml.utils.simple_batch
function for returning values.
Testing#
The test classes skactiveml.pool.multiannotator.test.TestQueryStrategy
of
multi-annotator pool-based query strategies need inherit form
unittest.TestCase
. In this class, each parameter a
of the
__init__
method needs to be tested via a method test_init_param_a
.
This applies also for a parameter a
of the query
method, which is
tested via a method test_query_param_a
. The main logic of the query
strategy is test via the method test_query
.
Classifiers#
Standard classifier implementations are part of the subpackage
skactiveml.classifier
, and classifiers learning from multiple
annotators are implemented in the subpackage
skactiveml.classifier.multiannotator
. Every classifier inherits from
skactiveml.base.SkactivemlClassifier
and must implement the following methods:
Method |
Description |
---|---|
|
Method for initialization. |
|
Method to fit the classifier for given training data. |
|
Method predicting class-membership probabilities for samples. |
|
Method predicting class labels for samples. The super
implementation uses |
__init__
#
Parameter |
Description |
---|---|
|
Holds the label for each class. If None, the classes are determined during fitting. |
|
Value representing a missing label. |
|
A cost matrix where |
|
Ensures reproducibility (cf. scikit-learn). |
fit
#
Required Parameters:
Parameter |
Description |
---|---|
|
Matrix of feature values representing the samples. |
|
Contains the class labels of the training
samples. Missing labels are represented by
the attribute |
|
Contains weights for the training samples’
class labels. Must have the same shape as
|
Returns:
Parameter |
Description |
---|---|
|
The fitted classifier object. |
General advice#
Use self._validate_data
method (is implemented in superclass) to
check standard parameters of __init__
and fit
method. If the
classes
parameter was provided, the classifier can be fitted with
training sample of which each was assigned a missing_label
.
In this case, the classifier should make random predictions, i.e.,
outputting uniform class-membership probabilities when calling
predict_proba
. Ensure that the classifier can handle missing labels
also in other cases.
predict_proba
#
Required Parameters:
Parameter |
Description |
---|---|
|
Matrix of feature values representing the samples for which predictions are made. |
Returns:
Parameter |
Description |
---|---|
|
The estimated class-membership probabilities per sample. |
General advice#
Check parameter X
regarding its shape, i.e., use superclass method
self._check_n_features
to ensure a correct number of features. Check
that the classifier has been fitted. If the classifier is a
skactiveml.base.ClassFrequencyEstimator
, this method is already
implemented in the superclass.
predict
#
Required Parameters:
Parameter |
Description |
---|---|
|
Matrix of feature values representing the samples for which predictions are made. |
Returns:
Parameter |
Description |
---|---|
|
The estimated class label of each per sample. |
General advice#
Usually, this method is already implemented by the superclass through
calling the predict_proba
method. If the superclass method is
overwritten, ensure that it can handle imbalanced costs and missing
labels.
score
#
Required Parameters:
Parameter |
Description |
---|---|
|
Matrix of feature values representing the samples for which predictions are made. |
|
Contains the true labels for each sample. |
|
Defines the importance of each sample when computing accuracy. |
Returns:
Parameter |
Description |
---|---|
|
Mean accuracy of
|
General advice#
Usually, this method is already implemented by the superclass. If the superclass method is overwritten, ensure that it checks the parameters and that the classifier has been fitted.
Testing#
The test classes skactiveml.classifier.TestClassifier
of classifiers need to inherit from the test template
skactiveml.tests.template_estimators.TemplateSkactivemlClassifier
.
As a result, many required functionalities will be automatically tested.
As a requirement, one needs to specify the parameters of estimator_class
,
init_default_params
, fit_default_params
, and predict_default_params
of the __init__
accordingly. Once, the parameters are set, the developer
needs to adjust the test until all errors are resolved. We refer to the test
template for more detailed information.
Regressors#
Standard regressor implementations are part of the subpackage
skactiveml.regressor
. Every regressor inherits from
skactiveml.base.SkactivemlRegressor
and must implement the following methods:
Method |
Description |
---|---|
|
Method for initialization. |
|
Method to fit the regressor for given training data. |
|
Method predicting the target values for samples. |
__init__
#
Required Parameters:
Parameter |
Description |
---|---|
|
Ensures reproducibility (cf. scikit-learn). |
|
Value representing a missing label. |
fit
#
Required Parameters:
Parameter |
Description |
---|---|
|
Matrix of feature values representing the samples. |
|
Contains the target values of the training
samples. Missing labels are represented by
the attribute |
|
Contains weights for the training samples’
targets. Must have the same shape as |
Returns:
Parameter |
Description |
---|---|
|
The fitted regressor object. |
General advice#
Use self._validate_data
method (is implemented in superclass) to
check standard parameters of __init__
and fit
method. If the regressor
was fitted on training sample of which each was assigned a missing_label
,
the regressor should predict a default value of zero when calling predict
.
Ensure that the regressor can handle missing labels
also in other cases.
predict
#
Required Parameters:
Parameter |
Description |
---|---|
|
Matrix of feature values representing the samples for which predictions are made. |
Returns:
Parameter |
Description |
---|---|
|
The estimated targets per sample. |
General advice#
Check parameter X
regarding its shape, i.e., use method
skactiveml.utils.check_n_features
to ensure a correct number of
features. Check that the regressor has been fitted. If the classifier is a
skactiveml.base.ProbabilisticRegressor
, this method is already
implemented in the superclass.
score
#
Required Parameters:
Parameter |
Description |
---|---|
|
Matrix of feature values representing the samples for which predictions are made. |
|
Contains the true target values for each sample. |
|
Defines the importance of each sample when computing the R2 score. |
Returns:
Parameter |
Description |
---|---|
|
|
General advice#
Usually, this method is already implemented by the superclass. If the superclass method is overwritten, ensure that it checks the parameters and that the regressor has been fitted.
Testing#
The test classes skactiveml.classifier.TestRegressor
of regressors need to inherit from the test template
skactiveml.tests.template_estimators.TemplateSkactivemlRegressor
.
As a result, many required functionalities will be automatically tested.
As a requirement, one needs to specify the parameters of estimator_class
,
init_default_params
, fit_default_params
, and predict_default_params
of the __init__
accordingly. Once, the parameters are set, the developer
needs to adjust the test until all errors are resolved. We refer to the test
template for more detailed information.
Annotator Models#
Annotator models implement the interface
skactiveml.base.AnnotatorModelMixin
. These models can estimate the
performance of annotators for given samples. Each annotator model must implement
the predict_annotator_perf
method, which estimates the performance per
sample for each annotator as a proxy for the quality of the provided annotations.
predict_annotator_perf
#
Required Parameters:
Parameter |
Description |
---|---|
|
Matrix of feature values representing the samples. |
Returns:
Parameter |
Description |
---|---|
|
The estimated performance per sample-annotator pair. |
General advice#
Check parameter X
regarding its shape and check that the annotator
model has been fitted. If no samples or class labels were provided
during the previous call of the fit
method, the maximum value of
annotator performance should be outputted for each sample-annotator
pair.
Examples#
Two of our main goals are to make active learning more understandable and
improve our framework’s usability. Therefore, we require an example for each
query strategy. To do so, create a file named
scikit-activeml/docs/examples/query_strategy.json
. Currently, we support
examples for single-annotator pool-based and stream-based query strategies.
The JSON file supports the following entries:
Entry |
Description |
---|---|
|
Query strategy’s class name. |
|
Name of the sub-package (e.g., pool). |
|
Query strategy’s official name. |
|
The methodological category of this query strategy, e.g., Expected Error Reduction, Model Change, Query-by-Committee, Random Sampling, Uncertainty Sampling, or Others. |
|
Defines the general setup/setting of the example.
Supported templates include:
|
|
Search categories. Supported tags include |
|
Title of the example, usually named after the query strategy. |
|
Placeholder for additional explanations. |
|
References (BibTeX keys) to the paper(s) describing the query strategy. |
|
Order in which content is displayed, usually
|
|
Python code for imports (e.g.,
|
|
Number of samples in the example dataset. |
|
Python code to initialize the query strategy object,
e.g., |
|
Python code for parameters passed to the query method,
e.g., |
|
Python code for preprocessing before executing the AL
cycle, e.g., |
|
Number of active learning cycles. |
|
Python code to initialize the classifier object, e.g.,
|
|
Python code to initialize the regressor object, e.g.,
|
Testing and Code Coverage#
Please ensure test coverage is close to 100%. The current code coverage can be viewed here.
Documentation#
Guidelines for writing documentation in scikit-activeml
adopt the
scikit-learn guidelines
used by scikit-learn.
Building the Documentation#
To ensure your documentation is well formatted, build it using Sphinx:
sphinx-build -b html docs docs/_build
Issue Tracking#
We use GitHub Issues
as our issue tracker. If you believe you have found a bug in
scikit-activeml
, please report it there. Documentation bugs can also be reported.
Checking If a Bug Already Exists#
Before filing an issue, please check whether the problem has already been reported.
This will help determine if the problem is resolved or fixed in an upcoming release,
save time, and provide guidance on how to fix it. Search the issue database using
the search box at the top of the issue tracker page (filter by the bug
label).
Reporting an Issue#
Use the following labels when reporting an issue:
Label |
Use Case |
---|---|
|
Something isn’t working |
|
Request for a new feature |
|
Improvement or additions to documentation |
|
General questions |