A Comprehensive and User-friendly Active Learning Library#

Machine learning models often require substantial amounts of training data to perform effectively. While unlabeled data can be gathered with relative ease, labeling is typically difficult, time-consuming, or expensive. Active learning addresses this challenge by querying labels for the most informative samples, achieving high performance with fewer labeled examples. With this goal in mind, scikit-activeml has been developed as a Python library for active learning on top of scikit-learn. As a result, it natively supports deep active learning via skorch. Illustrations for pool-based and stream-based active learning with code snippets are given below:

🏊 Pool-based Active Learning: Code Snippet

The following snippet implements an active learning cycle with 15 iterations using a PyTorch-based classifier (wrapped via SkorchClassifier) and the BADGE query strategy on sentence-transformer embeddings of the Reuters-21578 dataset obtained from the pretrained SentenceTransformer model all-MiniLM-L6-v2. Unlabeled data is represented by the value missing_label in the label vector y_train. Note that the packages torch, sentence_transformers, and datasets are not included in the default skactiveml installation and must be installed separately. You can do this via:

pip install -U torch torchvision
pip install -U scikit-activeml[opt] datasets sentence-transformers

Note that you might need to adjust this command for GPU support with torch.

import numpy as np
import torch
from torch import nn
from torch.optim.lr_scheduler import CosineAnnealingLR
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from skorch.callbacks import LRScheduler

from skactiveml.classifier import SkorchClassifier
from skactiveml.pool import Badge

# Define the device depending on its availability.
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load data from Huggingface and encode it via `sentence_transformers`.
ds_train = load_dataset("yangwang825/reuters-21578", split="train")
ds_test = load_dataset("yangwang825/reuters-21578", split="test")
mdl = SentenceTransformer("all-MiniLM-L6-v2", device=device)
X_pool = mdl.encode(ds_train["text"], show_progress_bar=True)
y_pool = np.asarray(ds_train["label"], dtype=np.int64)
X_test = mdl.encode(ds_test["text"], show_progress_bar=True)
y_test = np.asarray(ds_test["label"], dtype=np.int64)
n_features, classes = X_pool.shape[1], np.unique(y_pool)
missing_label = -1

# Build your `torch` module for classification, which outputs:
# - classification logits,
# - learned sample embeddings.
class ClassificationModule(nn.Module):
     def __init__(self, n_features, n_classes, n_hidden_units):
         super().__init__()
         self.linear_1 = nn.Linear(n_features, n_hidden_units)
         self.linear_2 = nn.Linear(n_hidden_units, n_classes)
         self.activation = nn.ReLU()

     def forward(self, x):
         x_embed = self.linear_1(x)
         logits = self.linear_2(self.activation(x_embed))
         return logits, x_embed

# Wrap your torch module via a `skactiveml` wrapper, which requires the
# definition of training parameters.
clf = SkorchClassifier(
    module=ClassificationModule,
    criterion=nn.CrossEntropyLoss,
    # Each mapping entry is interpreted as:
    #   name -> (idx in module.forward, post-hoc transform).
    forward_outputs={"proba": (0, nn.Softmax(dim=-1)), "emb": (1, None)},
    # Set `skorch`-specific parameters. Double underscore can be used to set
    # nested parameters, i.e., `module__n_features` sets `n_features` of
    # `module` (see `skorch` documentation for more details).
    neural_net_param_dict={
        # Module-related parameters.
        "module__n_features": n_features,
        "module__n_hidden_units": 128,
        "module__n_classes": len(classes),
        # Optimizer-related parameters.
        "max_epochs": 100,
        "batch_size": 16,
        "lr": 0.01,
        "optimizer": torch.optim.RAdam,
        "callbacks": [
            ("lr_scheduler", LRScheduler(policy=CosineAnnealingLR, T_max=100))
        ],
        # General parameters.
        "verbose": 0,
        "device": device,
        "train_split": False,
        "iterator_train__shuffle": True,
    },
    classes=classes,
    missing_label=missing_label,
).initialize()

# Start the active learning cycle with zero initial labels.
y_train = np.full_like(y_pool, missing_label)

# Create query strategy such that:
#   P_class, X_embed = clf.predict_proba(X, **clf_embedding_flag_name).
qs = Badge(
    missing_label=missing_label,
    clf_embedding_flag_name={"extra_outputs": "emb"},
)

# Define the active learning parameters.
n_cycles = 15
batch_size = 4

# Execute active learning cycles.
for c in range(n_cycles):
    # Since we train the classifier outside of `qs.query` in this example,
    # we use `fit_clf=False` to avoid unnecessary retraining during
    # querying.
    query_idx = qs.query(
        X=X_pool,
        y=y_train,
        batch_size=batch_size,
        clf=clf,
        fit_clf=False,
    )
    y_train[query_idx] = y_pool[query_idx]
    clf.fit(X_pool, y_train)

print(f"Final accuracy: {clf.score(X_test, y_test)}")

🌊 Stream-based Active Learning: Code Snippet

The following snippet implements a stream-based active learning cycle over 300 time steps on CIFAR-10 embeddings computed with the pretrained DINOv2 vision transformer. A PyTorch-based classifier (wrapped via SkorchClassifier) is trained online, and the Split query strategy is used with a labeling budget of 10% of the stream. Unlabeled data is represented by the value missing_label in the label vector y_train. Note that the packages torch, transformers, and datasets are not included in the default skactiveml installation and must be installed separately.

pip install -U torch torchvision
pip install -U scikit-activeml[opt] datasets transformers

Note that you might need to adjust this command for GPU support with torch.

import numpy as np
import torch
from torch import nn
from torch.optim.lr_scheduler import CosineAnnealingLR
from datasets import load_dataset
from skorch.callbacks import LRScheduler
from transformers import AutoImageProcessor, Dinov2Model

from skactiveml.classifier import SkorchClassifier
from skactiveml.stream import Split

# Define the device depending on its availability.
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load data.
ds = load_dataset("cifar10")
processor = AutoImageProcessor.from_pretrained(
    "facebook/dinov2-small", use_fast=True
)
model = Dinov2Model.from_pretrained("facebook/dinov2-small").to(device).eval()
def embed(batch):
    inputs = processor(images=batch["img"], return_tensors="pt").to(device)
    with torch.no_grad():
        out = model(**inputs).last_hidden_state[:, 0]
    batch["emb"] = out.cpu().numpy()
    return batch
ds = ds.map(embed, batched=True, batch_size=128)
X_stream = np.stack(ds["train"]["emb"], dtype=np.float32)[:300]
y_stream = np.array(ds["train"]["label"], dtype=np.int64)[:300]
X_test = np.stack(ds["test"]["emb"], dtype=np.float32)
y_test = np.array(ds["test"]["label"], dtype=np.int64)
n_features, classes = X_stream.shape[1], np.unique(y_stream)
missing_label = -1

# Build `torch` module for classification, outputting classification logits.
class ClassificationModule(nn.Module):
    def __init__(self, n_features, n_classes, n_hidden_units):
        super().__init__()
        self.linear_1 = nn.Linear(n_features, n_hidden_units)
        self.linear_2 = nn.Linear(n_hidden_units, n_classes)
        self.activation = nn.ReLU()

    def forward(self, x):
        x_embed = self.linear_1(x)
        logits = self.linear_2(self.activation(x_embed))
        return logits

# Wrap your torch module via a `skactiveml` wrapper, which requires the
# definition of training parameters.
clf = SkorchClassifier(
    module=ClassificationModule,
    criterion=nn.CrossEntropyLoss,
    # Set `skorch`-specific parameters. Double underscore can be used to set
    # nested parameters, i.e., `module__n_features` sets `n_features` of
    # `module` (see `skorch` documentation for more details).
    neural_net_param_dict={
        # Module-related parameters.
        "module__n_features": n_features,
        "module__n_hidden_units": 128,
        "module__n_classes": len(classes),
        # Optimizer-related parameters.
        "max_epochs": 100,
        "batch_size": 16,
        "lr": 0.01,
        "optimizer": torch.optim.RAdam,
        "callbacks": [
            ("lr_scheduler", LRScheduler(policy=CosineAnnealingLR, T_max=100))
        ],
        # General parameters.
        "verbose": 0,
        "device": device,
        "train_split": False,
        "iterator_train__shuffle": True,
    },
    classes=classes,
    missing_label=missing_label,
).initialize()

# Initialize training data as empty lists.
y_train = np.full_like(y_stream, missing_label)

# Execute active learning cycle.
qs = Split(random_state=0, budget=0.1)
n_cycles = len(X_stream)
query_idx = []
for t in range(n_cycles):
    # Since we train the classifier outside of `qs.query` in this example,
    # we use `fit_clf=False` to avoid unnecessary retraining during
    # querying.
    query_idx = qs.query(
        candidates=X_stream[[t]], y=y_stream[t], clf=clf, fit_clf=False
    )
    qs.update(candidates=X_stream[[t]], queried_indices=query_idx)
    if len(query_idx) > 0:
        y_train[t] = y_stream[t]
        clf.fit(X_stream, y_train)

print(f"Final accuracy: {clf.score(X_test, y_test)}")

💾 User Installation#

In most cases, we recommend installing scikit-activeml together with the optional dependencies for better support of deep active learning:

pip install -U scikit-activeml[opt]

The opt installs additional packages such as skorch to enable more sophisticated deep learning support. Version constraints are chosen to be reasonably flexible so that scikit-activeml can integrate well into an existing environment. The optional deep learning functionality (via skorch) assumes that torch (PyTorch) is already installed in your environment. Since the correct PyTorch build depends on your hardware and CUDA setup, we do not install PyTorch automatically.

Please install PyTorch separately by following the installation instructions of from skorch.

Minimal Installation

The minimum way to install scikit-activeml is using:

pip install -U scikit-activeml

This installs only the minimum requirements to avoid potential package downgrades within your existing environment.

Tested Fallback Installation

If you prefer a configuration where dependency versions have been tested explicitly for this release, you can install scikit-activeml with the maximum tested core and optional requirements:

pip install -U scikit-activeml[max,opt_max]

This setup uses the versions listed in requirements_max.txt and requirements_opt_max.txt and corresponds to the configuration used in our continuous integration tests. You can also install only the maximum tested core dependencies via:

pip install -U scikit-activeml[max]

🗂️ Query Strategy Overview#

For better orientation, we provide an overview (including paper references and visual examples) of the over 60 query strategies implemented by skactiveml. The following mind map illustrates different attributes of a query strategy.

📚 In-depth Tutorials#

The table below summarizes a subset of our many in-depth tutorials. Each entry lists the active learning scenario, prediction task, data modality, and models used in the tutorial.

Tutorial	Scenario	Task	Data	Model
Deep Active Learning for Fine-tuning Vision Transformers	Pool	Classification	Image	Vision Transformer with Full Fine-tuning
Advanced Active Learning for Regression Tasks	Pool	Regression	Tabular	Extreme Gradient Boosted Tree Multi-layer Perceptron Random Forest
Stream-based Active Learning: Getting Started	Stream	Classification	Text	Sentence Transformer with Parzen Window Classifier

📝 Citing#

If you use skactiveml in your research or projects, please cite the following work and consider starring the repository to help others discover it:

@article{skactiveml2025,
    title={{scikit-activeml: A Comprehensive and User-friendly Active Learning Library}},
    author={Herde, Marek and Pham, Minh Tuan and Kottke, Daniel and Benz, Alexander and L{\"u}hrs, Lukas and Mergard, Pascal and Sandrock, Christoph and Cheng, Jiaying and Roghman, Atal and M{\"u}jde, Mehmet and Rauch, Lukas and Sick, Bernahrd},
    journal={Preprints},
    doi={10.20944/preprints202507.0252.v1},
    year={2025},
    url={https://github.com/scikit-activeml/scikit-activeml}
}

A Comprehensive and User-friendly Active Learning Library#

💾 User Installation#

🗂️ Query Strategy Overview#

📚 In-depth Tutorials#

📝 Citing#

Indices and tables#

This Page