
scikit-activeml: A Library and Toolbox for Active Learning Algorithms#
Machine learning models often need large amounts of training data to perform well. While unlabeled data can be gathered with relative ease, labeling is typically difficult, time-consuming, or expensive. Active learning addresses this challenge by querying labels for the most informative samples, enabling high performance with fewer labeled examples. With this goal in mind, scikit-activeml has been developed as a Python library for active learning on top of scikit-learn.
User Installation#
The easiest way to install scikit-activeml is using pip
:
pip install -U scikit-activeml
This installation via pip includes only the minimum requirements to avoid potential package downgrades within your installation. If you encounter any incompatibility issues, you can install the maximum requirements, which have been tested for the current package release:
pip install -U scikit-activeml[max]
Examples#
We provide a broad overview of different use cases in our tutorial section, including:
Deep Pool-based Active Learning - scikit-activeml with Skorch
Multi-annotator Pool-based Active Learning - Getting Started
Batch Stream-based Active Learning with Pool Query Strategies
Below are two code snippets illustrating how straightforward it is to implement active learning cycles using our Python package skactiveml
.
Pool-based Active Learning#
The following snippet implements an active learning cycle with 20 iterations using a Gaussian process
classifier and uncertainty sampling. You can substitute other classifiers from sklearn
or those
provided by skactiveml
. Note that when using active learning with sklearn
, unlabeled data
is represented by the value MISSING_LABEL
in the label vector y
. Additional query strategies
are available in our documentation.
import numpy as np
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.datasets import make_blobs
from skactiveml.pool import UncertaintySampling
from skactiveml.utils import MISSING_LABEL
from skactiveml.classifier import SklearnClassifier
# Generate data set.
X, y_true = make_blobs(n_samples=200, centers=4, random_state=0)
y = np.full(shape=y_true.shape, fill_value=MISSING_LABEL)
# Use the first 10 samples as initial training data.
y[:10] = y_true[:10]
# Create classifier and query strategy.
clf = SklearnClassifier(
GaussianProcessClassifier(random_state=0),
classes=np.unique(y_true),
random_state=0
)
qs = UncertaintySampling(method='entropy')
# Execute active learning cycle.
n_cycles = 20
for c in range(n_cycles):
query_idx = qs.query(X=X, y=y, clf=clf)
y[query_idx] = y_true[query_idx]
# Fit final classifier.
clf.fit(X, y)
As a result, an actively trained Gaussian process classifier is obtained. A visualization of its decision boundary (black line) along with sample utilities (greenish contours) is shown below.

Stream-based Active Learning#
The following snippet implements an active learning cycle with 200 data points and a default budget of 10%
using a Parzen window classifier and split uncertainty sampling.
Similar to the pool-based example, you can wrap classifiers from sklearn
, use sklearn-compatible classifiers,
or choose from the example classifiers provided by skactiveml
.
import numpy as np
from sklearn.datasets import make_blobs
from skactiveml.classifier import ParzenWindowClassifier
from skactiveml.stream import Split
from skactiveml.utils import MISSING_LABEL
# Generate data set.
X, y_true = make_blobs(n_samples=200, centers=4, random_state=0)
# Create classifier and query strategy.
clf = ParzenWindowClassifier(random_state=0, classes=np.unique(y_true))
qs = Split(random_state=0)
# Initialize training data as empty lists.
X_train = []
y_train = []
# Initialize a list to store prediction results.
correct_classifications = []
# Execute active learning cycle.
for x_t, y_t in zip(X, y_true):
X_cand = x_t.reshape([1, -1])
y_cand = y_t
clf.fit(X_train, y_train)
correct_classifications.append(clf.predict(X_cand)[0] == y_cand)
sampled_indices = qs.query(candidates=X_cand, clf=clf)
qs.update(candidates=X_cand, queried_indices=sampled_indices)
X_train.append(x_t)
y_train.append(y_cand if len(sampled_indices) > 0 else MISSING_LABEL)
As a result, an actively trained Parzen window classifier is obtained. A visualization of its accuracy curve across the active learning cycle is shown below.

Query Strategy Overview#
For better orientation, we provide an overview
(including paper references and visual examples)
of the query strategies implemented by skactiveml
.
Citing#
If you use skactiveml
in your research projects and find it helpful, please cite the following:
@article{skactiveml2021,
title={scikit-activeml: {A} {L}ibrary and {T}oolbox for {A}ctive {L}earning {A}lgorithms},
author={Daniel Kottke and Marek Herde and Tuan Pham Minh and Alexander Benz and Pascal Mergard and Atal Roghman and Christoph Sandrock and Bernhard Sick},
journal={Preprints},
doi={10.20944/preprints202103.0194.v1},
year={2021},
url={https://github.com/scikit-activeml/scikit-activeml}
}