Contributing Guide#
scikit-activeml is a library that implements the most important query strategies of active learning. It is built upon the well-known machine learning framework scikit-learn.
Overview#
Our philosophy is to extend the sklearn
eco-system with the most relevant
query strategies for active learning and to implement tools for working with
partially unlabeled data. An overview of our repository’s structure is given in
the image below. Each node represents a class or interface. The arrows
illustrate the inheritance hierarchy among them. The functionality of a dashed
node is not yet available in our library.

In our package skactiveml
, there three major components, i.e.,
SkactivemlClassifier
, SkactivemlRegressor
, and the QueryStrategy
.
The classifier and regressor modules are necessary to deal with partially
unlabeled data and to implement active-learning specific estimators. This way,
an active learning cycle can be easily implemented to start with zero initial
labels. Regarding the active learning query strategies, we currently differ
between the pool-based (a large pool of unlabeled samples is available) and
stream-based (unlabeled samples arrive sequentially, i.e., as a stream)
paradigm. On top of both paradigms, we also distinguish the single- and
multi-annotator setting. In the latter setting, multiple error-prone annotators
are queried to provide labels. As a result, an active learning query strategy
not only decides which samples but also which annotators should be queried.
Thank you, contributors!#
A big thank you to all contributors who provide the scikit-activeml project with new enhancements and bug fixes.
Getting Help#
If you have any questions, please reach out to other developers via the following channels:
Roadmap#
Our roadmap is summarized in the issue Upcoming Features.
Get Started#
Before you can contribute to this project, you might execute the following steps.
Setup Development Environment#
There are several ways to create a local Python environment, such as
virtualenv,
pipenv,
miniconda, etc. One
possible workflow is to install miniconda
and use it to create a
Python environment.
Example With miniconda#
Create a new Python environment named scikit-activeml:
conda create -n scikit-activeml
To be sure that the correct environment is active:
conda activate scikit-activeml
Then install pip
:
conda install pip
Install Dependencies#
Now we can install some required project dependencies, which are defined
in the requirements.txt
and requirements_extra.txt
(for development)
files.
# Make sure your scikit-activeml python environment is active!
cd <project-root>
pip install -r requirements.txt
pip install -r requirements_extra.txt
After the pip installation was successful, we have to install pandoc
and ghostscript
if it is not already installed.
Example with MacOS (Homebrew)#
brew install pandoc ghostscript
Contributing Code#
General Coding Conventions#
As this library conforms to the convention of scikit-learn, the code should conform to PEP 8 Style Guide for Python Code. For linting, the use of flake8 is recommended. The Python package black provides a simple solution for this formatting. Concretely, you can install it and format the code via the following commands:
pip install black
black --line-length 79 example_file.py
Example for C3 (Code Contribution Cycle) and Pull Requests#
1. Fork the repository using the Github Fork button.
Then, clone your fork to your local machine:
git clone https://github.com/<your-username>/scikit-activeml.git
Create a new branch for your changes from the
development
branch:
git checkout -b <branch-name>
After you have finished implementing the feature, make sure that all the tests pass. The tests can be run as
$ pytest
Make sure, you covered all lines by tests.
$ pytest --cov=./skactiveml
Commit and push the changes.
$ git add <modified-files>
$ git commit -m "<commit-message>"
$ git push
Create a pull request.
Query Strategies#
All query strategies inherit from skactiveml.base.QueryStrategy
as abstract
superclass implemented in skactiveml/base.py
. This superclass inherits from
sklearn.base.Estimator
. The __init__
method requires by default a
random_state
parameter and the abstract method query
is to enforce the
implementation of the sample selection logic.
Single-annotator Pool-based Query Strategies#
General#
Single-annotator pool-based query strategies are stored in a file
skactiveml/pool/*.py
and inherit from
skactiveml.base.SingleAnnotatorPoolQueryStrategy
.
The class must implement the following methods:
Method |
Description |
---|---|
|
Method for initialization. |
|
Select the samples whose labels are to be queried. |
__init__
method#
For typical class parameters, we use standard names:
Parameter |
Description |
---|---|
|
|
|
|
|
String for classes that implement multiple methods. |
|
Cost matrix defining the cost of interchanging classes. |
query
method#
Required Parameters:
Parameter |
Description |
---|---|
|
Training data set, usually complete, i.e. including the labeled and unlabeled samples. |
|
Labels of the training data set (possibly including unlabeled ones indicated by MISSING_LABEL.) |
|
If candidates is None, the unlabeled samples from (X, y) are considered as candidates. If candidates is of shape (n_candidates) and of type int, candidates is considered as the indices of the samples in (X,y). If candidates is of shape (n_candidates, n_features), the candidates are directly given in candidates (not necessarily contained in X). This is not supported by all query strategies. |
|
Number of samples to be selected in one AL cycle. |
|
If true, additionally return the utilities of the query strategy.` |
Returns:
Parameter |
Description |
---|---|
|
The |
|
The utilities of samples after
each selected sample of the
batch, e.g., |
General advice#
Use self._validate_data
method (implemented in the superclass).
Check the input X
and y
only once. Fit the classifier or regressors if
it is not yet fitted (may use fit_if_not_fitted
from utils
). Calculate
utilities via an extra function that should be public. Use simple_batch
function from utils
for determining query_indices and setting utilities
in naive batch query strategies.
Testing#
The test classes skactiveml.pool.test.TestQueryStrategy
of single-annotator
pool-based query strategies need to inherit from the test template
skactiveml.tests.template_query_strategy.TemplateSingleAnnotatorPoolQueryStrategy
.
As a result, many required functionalities will be automatically tested.
As a requirement, one needs to specify the parameters of qs_class
,
init_default_params
of the __init__
accordingly. Depending on whether
the query strategy can handle regression/classification or both settings, one
needs to additionally define the parameters
query_default_params_reg/query_default_params_clf
.
Once, the parameters are set, the developer needs to adjust the test until
all errors are resolved. In particular, the method test_query
must
be implemented. We refer to the test template for more detailed information.
Single-annotator Stream-based Query Strategies#
General#
All query strategies are stored in a file skactivml/stream/*.py
.
Every query strategy inherits from
SingleAnnotatorStreamQueryStrategy
. Every query strategy has
either an internal budget handling or an outsourced budget_manager
.
For typical class parameters we use standard names:
Parameter |
Description |
---|---|
|
Integer that acts as random seed
or |
|
The share of labels that thestrategy is allowed to query |
|
Enforces the budget constraint |
The class must implement the following methods:
Function |
Description |
---|---|
|
Function for initialization |
|
Identify the instances whose labels to select without adapting the internal state |
|
Adapting the budget monitoring according to the queried labels |
query
method#
Required Parameters:
Parameter |
Description |
---|---|
|
Set of candidate instances,
inherited from
|
|
The classifier used by the strategy |
|
Set of labeled and unlabeled instances |
|
Labels of |
|
Weights for each instance in
|
|
uses |
|
Whether to return the candidates’ utilities,
inherited from |
Returns:
Parameter |
Description |
---|---|
|
Indices of the best instances
from |
|
Utilities of all candidate
instances, only if
|
General advice#
The query
method must not change the internal state of the query
strategy (budget
, budget_manager
and random_state
included) to allow
for assessing multiple instances with the same state. Update the internal state
in the update()
method. If the class implements a classifier (clf
) the
optional attributes need to be implement. Use self._validate_data
method
(is implemented in superclass). Check the input X
and y
only once. Fit
classifier if fit_clf
is set to True
.
update
method#
Required Parameters:
Parameter |
Description |
---|---|
|
Set of candidate instances,
inherited from
|
|
Typically the return value of
|
|
Provides additional parameters to
the |
General advice#
Use self._validate_data
in case the strategy is used without using
the query
method (if parameters need to be initialized before the
update). If a budget_manager
is used forward the update call to the
budget_manager.update
method.
Testing#
All stream query strategies are tested by a general unittest
(stream/tests/test_stream.py
) -For every class
ExampleQueryStrategy
that inherits from
SingleAnnotatorStreamQueryStrategy
(stored in _example.py
), it
is automatically tested if there exists a file test/test_example.py
.
It is necessary that both filenames are the same. Moreover, the test
class must be called TestExampleQueryStrategy
and inherit from
unittest.TestCase
. Every parameter in init()
will be tested if
it is written the same as a class variable. Every parameter arg in
init()
will be evaluated if there exists a method in the testclass
TestExampleQueryStrategy
that is called test_init_param_arg()
.
Every parameter arg in query()
will be evaluated if there exists a
method in the testclass TestExampleQueryStrategy
that is called
test_query_param_arg()
. It is tested if the internal state of query()
is unchanged after multiple calls without using update()
.
General advice for the budget_manager
#
All budget managers are stored in
skactivml/stream/budget_manager/*.py
. The class must implement the
following methods:
Parameter |
Description |
---|---|
|
Function for initialization |
|
Identify which instances to query based on the assessed utility |
|
Adapting the budget monitoring according to the queried labels |
update
method#
The update method of the budget manager has the same functionality as the query strategy update.
Required Parameters:
Parameter |
Description |
---|---|
|
% of labels that the strategy is allowed to query |
|
Integer that acts as random seed
or |
query_by_utilities
method#
Required Parameters:
Parameter |
Description |
---|---|
|
The |
General advice for working with a budget_manager
:#
If a budget_manager
is used, the _validate_data
of the query
strategy needs to be adapted accordingly:
If only a
budget
is given use the defaultbudget_manager
with the given budgetIf only a
budget_manager
is given use thebudget_manager
If both are not given use the default
budget_manager
with the default budgetIf both are given and the budget differs from
budget_manager.budget
throw an error
All budget managers are tested by a general unittest
(stream/budget_manager/tests/test_budget_manager.py
). For every
class ExampleBudgetManager
that inherits from BudgetManager
(stored in _example.py
), it is automatically tested if there exists
a file test/test_example.py
. It is necessary that both filenames are
the same.
Testing#
Moreover, the test class must be called TestExampleBudgetManager
and
inheriting from unittest.TestCase
. Every parameter in __init__()
will be tested if it is written the same as a class variable. Every
parameter arg
in __init__()
will be evaluated if there exists a
method in the testclass TestExampleQueryStrategy
that is called
test_init_param_arg()
. Every parameter arg
in
query_by_utility()
will be evaluated if there exists a method in the
testclass TestExampleQueryStrategy
that is called
test_query_by_utility
_param_arg()
. It is tested if the internal state
of query()
is unchanged after multiple calls without using update()
.
Multi-Annotator Pool-based Query Strategies#
All query strategies are stored in a file
skactiveml/pool/multi/*.py
and inherit
skactiveml.base.MultiAnnotatorPoolQueryStrategy
.
The class must implement the following methods:
Method |
Description |
---|---|
|
Method for initialization. |
|
Select the annotator-sample pairs to decide which sample’s class label is to be queried from which annotator. |
query
method#
Required Parameters:
Parameter |
Description |
---|---|
|
Training data set, usually complete, i.e. including the labeled and unlabeled samples. |
|
Labels of the training data set
for each annotator (possibly
including unlabeled ones
indicated by self.MISSING_LABEL),
meaning that |
|
If |
|
If |
|
The number of annotator-sample pairs to be selected in one AL cycle. |
|
If |
Returns:
Parameter |
Description |
---|---|
|
The |
|
The utilities of samples after
each selected sample of the
batch, e.g., |
General advice#
Use self._validate_data method
(is implemented in superclass).
Check the input X
and y
only once. Fit classifier if it is not
yet fitted (may use fit_if_not_fitted
form utils
). If the
strategy combines a single annotator query strategy with a performance
estimate:
define an aggregation function,
evaluate the performance for each sample-annotator pair,
use the
SingleAnnotatorWrapper
.
If the strategy is a greedy
method regarding the utilities:
calculate utilities (in an extra function),
use
skactiveml.utils.simple_batch
function for returning values.
Testing#
The test classes skactiveml.pool.multiannotator.test.TestQueryStrategy
of
multi-annotator pool-based query strategies need inherit form
unittest.TestCase
. In this class, each parameter a
of the
__init__
method needs to be tested via a method test_init_param_a
.
This applies also for a parameter a
of the query
method, which is
tested via a method test_query_param_a
. The main logic of the query
strategy is test via the method test_query
.
Classifiers#
Standard classifier implementations are part of the subpackage
skactiveml.classifier
and classifiers learning from multiple
annotators are implemented in its subpackage
skactiveml.classifier.multiannotator
. Every class of a classifier inherits
from skactiveml.base.SkactivemlClassifier
.
The class must implement the following methods:
Method |
Description |
---|---|
|
Method for initialization. |
|
Method to fit the classifier for given training data. |
|
Method predicting class-membership probabilities for samples. |
|
Method predicting class labels for samples. The super
already provides an implementation using
|
init
method#
Required Parameters:
Parameter |
Description |
---|---|
|
Holds the label for each class.
If |
|
Value to represent a missing label. |
|
Cost matrix with
|
|
Ensures reproducibility (cf. scikit-learn). |
fit
method#
Required Parameters:
Parameter |
Description |
---|---|
|
Is a matrix of feature values representing the samples. |
|
Contains the class labels of the
training samples. Missing labels
are represented through the
attribute |
|
Contains the weights of the
training samples’ class labels.
It must have the same shape as
|
Returns:
Parameter |
Description |
---|---|
|
The fitted classifier object. |
General advice#
Use self._validate_data
method (is implemented in superclass) to
check standard parameters of __init__
and fit
method. If the
classes
parameter was provided, the classifier can be fitted with
training sample of which each was assigned a missing_label
.
In this case, the classifier should make random predictions, i.e.,
outputting uniform class-membership probabilities when calling
predict_proba
. Ensure that the classifier can handle missing labels
also in other cases.
predict_proba
method#
Required Parameters:
Parameter |
Description |
---|---|
|
Is a matrix of feature values representing the samples, for which the classifier will make predictions. |
Returns:
Parameter |
Description |
---|---|
|
The estimated class-membership probabilities per sample. |
General advice#
Check parameter X
regarding its shape, i.e., use superclass method
self._check_n_features
to ensure a correct number of features. Check
that the classifier has been fitted. If the classifier is a
skactiveml.base.ClassFrequencyEstimator
, this method is already
implemented in the superclass.
predict
method#
Required Parameters:
Parameter |
Description |
---|---|
|
Is a matrix of feature values representing the samples, for which the classifier will make predictions. |
Returns:
Parameter |
Description |
---|---|
|
The estimated class label of each per sample. |
General advice#
Usually, this method is already implemented by the superclass through
calling the predict_proba
method. If the superclass method is
overwritten, ensure that it can handle imbalanced costs and missing
labels.
score
method#
Required Parameters:
Parameter |
Description |
---|---|
|
Is a matrix of feature values representing the samples, for which the classifier will make predictions. |
|
Contains the true label of each sample. |
|
Defines the importance of each sample when computing the accuracy of the classifier. |
Returns:
Parameter |
Description |
---|---|
|
Mean accuracy of
|
General advice#
Usually, this method is already implemented by the superclass. If the superclass method is overwritten, ensure that it checks the parameters and that the classifier has been fitted.
Testing#
All classifiers are tested by a general unittest
(skactiveml/classifier/tests/test_classifier.py
). For every class
ExampleClassifier
that inherits from
skactiveml.base.SkactivemlClassifier
(stored in
_example_classifier.py
), it is automatically tested if there exists
a file tests/test_example_classifier.py
. It is necessary that both
filenames are the same. Moreover, the test class must be called
TestExampleClassifier
and inherit from unittest.TestCase
. For
each parameter of an implemented method, there must be a test method
called test_methodname_parametername
in the Python file
tests/test_example_classifier.py
. It is to check whether invalid parameters
are handled correctly. For each implemented method, there must be a test
method called test_methodname
in the Python file
tests/test_example_classifier.py
. It is to check whether the method works
as intended.
Regressors#
Standard regressors implementations are part of the subpackage
skactiveml.regressor
. Every class of a regressor inherits
from skactiveml.base.SkactivemlRegressor
.
The class must implement the following methods:
Method |
Description |
---|---|
|
Method for initialization. |
|
Method to fit the regressor for given training data. |
|
Method predicting the target values (labels) for samples. |
init
method#
Required Parameters:
Parameter |
Description |
---|---|
|
Ensures reproducibility (cf. scikit-learn). |
|
Value to represent a missing label. |
fit
method#
Required Parameters:
Parameter |
Description |
---|---|
|
Is a matrix of feature values representing the samples. |
|
Contains the target values of the
training samples. Missing labels
are represented through the
attribute |
|
Contains the weights of the
training samples’ targets.
It must have the same shape as
|
Returns:
Parameter |
Description |
---|---|
|
The fitted regressor object. |
General advice#
Use self._validate_data
method (is implemented in superclass) to
check standard parameters of __init__
and fit
method. If the regressor
was fitted on training sample of which each was assigned a missing_label
,
the regressor should predict a default value of zero when calling predict
.
Ensure that the regressor can handle missing labels
also in other cases.
predict
method#
Required Parameters:
Parameter |
Description |
---|---|
|
Is a matrix of feature values representing the samples, for which the regressor will make predictions. |
Returns:
Parameter |
Description |
---|---|
|
The estimated targets per sample. |
General advice#
Check parameter X
regarding its shape, i.e., use superclass method
self._check_n_features
to ensure a correct number of features. Check
that the regressor has been fitted. If the classifier is a
skactiveml.base.ProbabilisticRegressor
, this method is already
implemented in the superclass.
score
method#
Required Parameters:
Parameter |
Description |
---|---|
|
Is a matrix of feature values representing the samples, for which the regressor will make predictions. |
|
Contains the true target of each sample. |
|
Defines the importance of each sample when computing the R2 score of the regressor. |
Returns:
Parameter |
Description |
---|---|
|
|
General advice#
Usually, this method is already implemented by the superclass. If the superclass method is overwritten, ensure that it checks the parameters and that the regressor has been fitted.
Testing#
For every class ExampleRegressor
that inherits from
skactiveml.base.SkactivemlRegressor
(stored in
_example_regressor.py
), there need to be a file
tests/test_example_classifier.py
. It is necessary that both
filenames are the same. Moreover, the test class must be called
TestExampleRegressor
and inherit from unittest.TestCase
. For
each parameter of an implemented method, there must be a test method
called test_methodname_parametername
in the Python file
tests/test_example_regressor.py
. It is to check whether invalid parameters
are handled correctly. For each implemented method, there must be a test
method called test_methodname
in the Python file
tests/test_example_regressor.py
. It is to check whether the method works
as intended.
Annotators Models#
Annotator models are marked by implementing the interface
skactiveml.base.AnnotatorModelMixin
. These models can estimate the
performances of annotators for given samples. The class of an annotator model
must implement the predict_annotator_perf
method estimating the
performances per sample of each annotator as proxies of the provided
annotations’ qualities.
predict_annotator_perf
method#
Required Parameters:
Parameter |
Description |
---|---|
|
Is a matrix of feature values representing the samples. |
Returns:
Parameter |
Description |
---|---|
|
The estimated performances per sample-annotator pair. |
General advice#
Check parameter X
regarding its shape and check that the annotator
model has been fitted. If no samples or class labels were provided
during the previous call of the fit
method, the maximum value of
annotator performance should be outputted for each sample-annotator
pair.
Examples#
Two of our main goals are to make active learning more understandable and
improve our framework’s usability.
Therefore, we require the implementation of an example for each query strategy.
To do so, one needs to create a file name
scikit-activeml/docs/examples/query_strategy.json
. Currently, we support
examples for single-annotator pool-based query strategies and single-annotator
stream-based query strategies.
The .json
file supports the following entries:
Entry |
Description |
---|---|
|
Query strategy’s class name. |
|
Name of the sub-package, e.g., pool. |
|
Query strategy’s official name. |
|
The methodological category of this query strategy, i.e., Expected Error Reduction, Model Change, Query-by-Committee, Random Sampling, Uncertainty Sampling, or Others. |
|
Defines the general setup/setting of the example.
Supported templates are
|
|
Defines search categories. Supported tags are |
|
Title of the example, usually named after the query strategy. |
|
Placeholder for additional explanations. |
|
References (BibTeX key) to the paper(s) of the query strategy. |
|
Order in which content is displayed, usually [“title”, “text_0”, “plot”, “refs”]. |
|
Python code for imports, e.g., “from skactiveml.pool import RandomSampling”. |
|
Number of samples of the example data set. |
|
Python code to initialize the query strategy object, e.g., “RandomSampling()”. |
|
Python code of parameters passed to the query method of the query strategy, e.g., “X=X, y=y”. |
|
Python code for preprocessing before executing the AL cycle, e.g., “X = (X-X.min())/(X.max()-X.min())”. |
|
Number of AL cycles. |
|
Python code to initialize the classifier object, e.g.,
“ParzenWindowClassifier(classes=[0, 1])”. Only supported
for |
|
Python code to initialize the regressor object, e.g.,
“NICKernelRegressor()”. Only supported for
|
Testing and code coverage#
Please ensure test coverage is close to 100%. The current code coverage can be viewed here.
Documentation#
Guidelines for writing documentation#
In scikit-activeml
, the
guidelines
for writing the documentation are adopted from
scikit-learn.
Building the documentation#
To ensure the documentation of your work is well formatted, build the sphinx documentation by executing the following line.
sphinx-build -b html docs docs/_build
Issue Tracking#
We use Github
Issues as
our issue tracker. If you think you have found a bug in scikit-activeml
,
you can report it to the issue tracker. Documentation bugs can also be reported
there.
Checking If A Bug Already Exists#
The first step before filing an issue report is to see whether the problem has already been reported. Checking if the problem is an existing issue will:
Help you see if the problem has already been resolved or has been fixed for the next release
Save time for you and the developers
Help you learn what needs to be done to fix it
Determine if additional information, such as how to replicate the issue, is needed
To see if the issue already exists, search the issue database (bug
label) using the search box on the top of the issue tracker page.
Reporting an issue#
Use the following labels to report an issue:
Label |
Usecase |
---|---|
|
Something isn’t working |
|
New feature |
|
Improvement or additions to document |
|
General questions |