https://raw.githubusercontent.com/scikit-activeml/scikit-activeml/master/docs/logos/scikit-activeml-logo.png

scikit-activeml: A Library and Toolbox for Active Learning Algorithms#

Doc Codecov PythonVersion PyPi Black Downloads Paper

Machine learning models often need large amounts of training data to perform well. While unlabeled data can be gathered with relative ease, labeling is typically difficult, time-consuming, or expensive. Active learning addresses this challenge by querying labels for the most informative samples, enabling high performance with fewer labeled examples. With this goal in mind, scikit-activeml has been developed as a Python library for active learning on top of scikit-learn.

User Installation#

The easiest way to install scikit-activeml is using pip:

pip install -U scikit-activeml

This installation via pip includes only the minimum requirements to avoid potential package downgrades within your installation. If you encounter any incompatibility issues, you can install the maximum requirements, which have been tested for the current package release:

pip install -U scikit-activeml[max]

Examples#

We provide a broad overview of different use cases in our tutorial section, including:

Below are two code snippets illustrating how straightforward it is to implement active learning cycles using our Python package skactiveml.

Pool-based Active Learning#

The following snippet implements an active learning cycle with 20 iterations using a Gaussian process classifier and uncertainty sampling. You can substitute other classifiers from sklearn or those provided by skactiveml. Note that when using active learning with sklearn, unlabeled data is represented by the value MISSING_LABEL in the label vector y. Additional query strategies are available in our documentation.

import numpy as np
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.datasets import make_blobs
from skactiveml.pool import UncertaintySampling
from skactiveml.utils import MISSING_LABEL
from skactiveml.classifier import SklearnClassifier

# Generate data set.
X, y_true = make_blobs(n_samples=200, centers=4, random_state=0)
y = np.full(shape=y_true.shape, fill_value=MISSING_LABEL)

# Use the first 10 samples as initial training data.
y[:10] = y_true[:10]

# Create classifier and query strategy.
clf = SklearnClassifier(
    GaussianProcessClassifier(random_state=0),
    classes=np.unique(y_true),
    random_state=0
)
qs = UncertaintySampling(method='entropy')

# Execute active learning cycle.
n_cycles = 20
for c in range(n_cycles):
    query_idx = qs.query(X=X, y=y, clf=clf)
    y[query_idx] = y_true[query_idx]

# Fit final classifier.
clf.fit(X, y)

As a result, an actively trained Gaussian process classifier is obtained. A visualization of its decision boundary (black line) along with sample utilities (greenish contours) is shown below.

https://raw.githubusercontent.com/scikit-activeml/scikit-activeml/master/docs/logos/pal-example-output.png

Stream-based Active Learning#

The following snippet implements an active learning cycle with 200 data points and a default budget of 10% using a Parzen window classifier and split uncertainty sampling. Similar to the pool-based example, you can wrap classifiers from sklearn, use sklearn-compatible classifiers, or choose from the example classifiers provided by skactiveml.

import numpy as np
from sklearn.datasets import make_blobs
from skactiveml.classifier import ParzenWindowClassifier
from skactiveml.stream import Split
from skactiveml.utils import MISSING_LABEL

# Generate data set.
X, y_true = make_blobs(n_samples=200, centers=4, random_state=0)

# Create classifier and query strategy.
clf = ParzenWindowClassifier(random_state=0, classes=np.unique(y_true))
qs = Split(random_state=0)

# Initialize training data as empty lists.
X_train = []
y_train = []

# Initialize a list to store prediction results.
correct_classifications = []

# Execute active learning cycle.
for x_t, y_t in zip(X, y_true):
    X_cand = x_t.reshape([1, -1])
    y_cand = y_t
    clf.fit(X_train, y_train)
    correct_classifications.append(clf.predict(X_cand)[0] == y_cand)
    sampled_indices = qs.query(candidates=X_cand, clf=clf)
    qs.update(candidates=X_cand, queried_indices=sampled_indices)
    X_train.append(x_t)
    y_train.append(y_cand if len(sampled_indices) > 0 else MISSING_LABEL)

As a result, an actively trained Parzen window classifier is obtained. A visualization of its accuracy curve across the active learning cycle is shown below.

https://raw.githubusercontent.com/scikit-activeml/scikit-activeml/master/docs/logos/stream-example-output.png

Query Strategy Overview#

For better orientation, we provide an overview (including paper references and visual examples) of the query strategies implemented by skactiveml.

Overview Visualization

Citing#

If you use skactiveml in your research projects and find it helpful, please cite the following:

@article{skactiveml2021,
    title={scikit-activeml: {A} {L}ibrary and {T}oolbox for {A}ctive {L}earning {A}lgorithms},
    author={Daniel Kottke and Marek Herde and Tuan Pham Minh and Alexander Benz and Pascal Mergard and Atal Roghman and Christoph Sandrock and Bernhard Sick},
    journal={Preprints},
    doi={10.20944/preprints202103.0194.v1},
    year={2021},
    url={https://github.com/scikit-activeml/scikit-activeml}
}

Indices and tables#