{"cells": [{"metadata": {}, "cell_type": "markdown", "source": "# Advanced Active Learning with Multiple Annotators", "id": "ecb4c9ef9b411e30"}, {"metadata": {}, "cell_type": "markdown", "source": ["> **_Google Colab Note:_** If the notebook fails to run after installing the needed packages, try to restart the runtime (Ctrl + M) under Runtime -> Restart session.\n", "\n", "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/scikit-activeml/scikit-activeml.github.io/blob/gh-pages/latest/generated/tutorials_colab//11_multiple_annotators_advanced.ipynb)"], "id": "40e5bcb5ce6baa06"}, {"metadata": {}, "cell_type": "markdown", "source": ["**Notebook Dependencies**\n", "\n", "Uncomment the following cell to install all dependencies for this tutorial."], "id": "f71148c855a74294"}, {"metadata": {}, "cell_type": "code", "outputs": [], "execution_count": null, "source": "# !pip install scikit-activeml[opt] torch torchvision tqdm", "id": "66693b55107b4127"}, {"metadata": {}, "cell_type": "markdown", "source": "
", "id": "32f32816679bb797"}, {"metadata": {}, "cell_type": "markdown", "source": "This notebook demonstrates how to use `scikit-activeml` to perform pool-based active learning when labels are provided by multiple, potentially unreliable annotators. We work on the handwritten digits dataset and simulate a small crowd of annotators with different noise patterns. We then compare a simple majority-vote baseline with a dedicated multi-annotator model that learns annotator performance (i.e., annotation accuracy) while querying new labels.", "id": "5526b4044f2fb5b6"}, {"metadata": {"ExecuteTime": {"end_time": "2025-12-02T18:55:55.936404Z", "start_time": "2025-12-02T18:55:54.000590Z"}}, "cell_type": "code", "source": ["import warnings\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import torch\n", "import torch.nn.functional as F\n", "\n", "from sklearn.datasets import load_digits\n", "from sklearn.model_selection import train_test_split\n", "\n", "from skactiveml.classifier import SkorchClassifier\n", "from skactiveml.classifier.multiannotator import (\n", " AnnotMixClassifier,\n", " CrowdLayerClassifier,\n", ")\n", "from skactiveml.pool import RandomSampling\n", "from skactiveml.pool.multiannotator import SingleAnnotatorWrapper\n", "from skactiveml.utils import MISSING_LABEL, is_labeled, majority_vote\n", "\n", "from skorch.callbacks import LRScheduler\n", "from torch import nn\n", "from torch.optim import RAdam\n", "from torch.optim.lr_scheduler import CosineAnnealingLR\n", "from tqdm.auto import tqdm\n", "\n", "\n", "# Global configuration\n", "DEVICE = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "torch.manual_seed(0)\n", "warnings.filterwarnings(\"ignore\")"], "id": "13c5edccd0d21e5d", "outputs": [], "execution_count": 2}, {"metadata": {}, "cell_type": "markdown", "source": ["## Data Preparation and Simulation of Multiple Annotators\n", "\n", "We start by loading the digits dataset and splitting it into training and test sets. Each image is scaled to the range \\([0, 1]\\) and reshaped to match the input format expected by a small convolutional neural network. On top of the ground-truth training labels `y_train` we simulate five annotators with different reliability profiles:\n", "\n", "- some are almost perfect,\n", "- some are noisy, one struggles with a specific digit,\n", "- and another one is biased towards the majority class.\n", "\n", "The resulting matrix `z_train` has one column per annotator and serves as crowd labels for the multi-annotator models.\n"], "id": "74b896636ca24294"}, {"metadata": {"ExecuteTime": {"end_time": "2025-12-02T18:55:55.998294Z", "start_time": "2025-12-02T18:55:55.981331Z"}}, "cell_type": "code", "source": ["# Load and preprocess the data\n", "digits = load_digits()\n", "X = digits.images[:, np.newaxis, :, :] / 16.0 # (n_samples, 1, 8, 8)\n", "y = digits.target\n", "classes = np.unique(y)\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X,\n", " y,\n", " test_size=0.5,\n", " stratify=y,\n", " random_state=42,\n", ")\n", "\n", "rng = np.random.default_rng(0)\n", "n_samples = len(y_train)\n", "classes = np.asarray(classes)\n", "y_train = np.asarray(y_train)\n", "\n", "\n", "def sample_noisy_labels(y_true, accuracy, rng):\n", " \"\"\"Generate annotator labels with a given accuracy.\n", "\n", " With probability `accuracy` the annotator returns\n", " the true label. Otherwise, it samples a wrong label\n", " uniformly from the remaining classes.\n", " \"\"\"\n", " y_true = np.asarray(y_true)\n", " n = y_true.shape[0]\n", " y_annot = y_true.copy()\n", "\n", " correct = rng.random(n) < accuracy\n", " wrong_idx = np.flatnonzero(~correct)\n", "\n", " if wrong_idx.size:\n", " wrong_labels = []\n", " for yt in y_true[wrong_idx]:\n", " candidates = classes[classes != yt]\n", " wrong_labels.append(rng.choice(candidates))\n", " y_annot[wrong_idx] = wrong_labels\n", "\n", " return y_annot\n", "\n", "\n", "# Annotator 1: almost perfect\n", "ann1 = sample_noisy_labels(y_train, accuracy=0.95, rng=rng)\n", "\n", "# Annotator 2: strong, but clearly imperfect\n", "ann2 = sample_noisy_labels(y_train, accuracy=0.85, rng=rng)\n", "\n", "# Annotator 3: noticeably noisy\n", "ann3 = sample_noisy_labels(y_train, accuracy=0.70, rng=rng)\n", "\n", "# Annotator 4: class-dependent difficulties\n", "# (struggles with one hard digit)\n", "hard_class = 9 # treat digit '9' as hard, for example\n", "acc_easy = 0.85\n", "acc_hard = 0.40\n", "\n", "per_sample_acc = np.where(y_train == hard_class, acc_hard, acc_easy)\n", "rand = rng.random(n_samples)\n", "correct = rand < per_sample_acc\n", "\n", "ann4 = y_train.copy()\n", "wrong_idx = np.flatnonzero(~correct)\n", "if wrong_idx.size:\n", " wrong_labels = []\n", " for yt in y_train[wrong_idx]:\n", " candidates = classes[classes != yt]\n", " wrong_labels.append(rng.choice(candidates))\n", " ann4[wrong_idx] = wrong_labels\n", "\n", "# Annotator 5: biased towards the majority class\n", "values, counts = np.unique(y_train, return_counts=True)\n", "majority_class = values[counts.argmax()]\n", "\n", "ann5 = y_train.copy()\n", "bias_strength = 0.50 # probability of forcing the majority label\n", "bias_mask = rng.random(n_samples) < bias_strength\n", "ann5[bias_mask] = majority_class\n", "\n", "# Final annotation matrix: shape (n_samples, n_annotators)\n", "z_train = np.column_stack([ann1, ann2, ann3, ann4, ann5])\n", "n_annotators = z_train.shape[1]"], "id": "f3745dbd8c761421", "outputs": [], "execution_count": 3}, {"metadata": {}, "cell_type": "markdown", "source": ["## Neural Network Backbone and Classifiers\n", "\n", "The next step is to define a small convolutional network (CNN) that serves as a backbone for all classifiers. The network processes the 8 x 8 images, applies a single convolutional layer with max pooling, and maps the resulting 128-dimensional feature vector to class logits. The forward method returns both the logits and the intermediate embedding, which could be used by an active learning query strategy, such as `BADGE`.\n", "\n", "On top of this backbone we define two classifiers.\n", "\n", "- The `MajorityVoteClassifier` first aggregates the crowd labels in `z_train` by taking a majority vote per sample and then fits a standard single-annotator model.\n", "- In contrast, `AnnotMixClassifier` directly consumes the full matrix of annotator labels and internally models annotator performance.\n", "\n", "Both models share the same neural architecture and optimization hyperparameters so that differences in performance can be attributed to how they handle multiple annotators.\n"], "id": "8237de992529d4de"}, {"metadata": {"ExecuteTime": {"end_time": "2025-12-02T18:55:56.027719Z", "start_time": "2025-12-02T18:55:56.023746Z"}}, "cell_type": "code", "source": ["# Small CNN backbone for 8\u00d78 digit images\n", "class CNNModule(nn.Module):\n", " \"\"\"\n", " Compact CNN that returns both logits\n", " and a 128-dimensional embedding.\n", " \"\"\"\n", "\n", " feature_dim = 128\n", "\n", " def __init__(self, n_classes=10):\n", " super().__init__()\n", "\n", " # Input: (B, 1, 8, 8)\n", " self.conv = nn.Conv2d(\n", " in_channels=1,\n", " out_channels=8,\n", " kernel_size=3,\n", " padding=1,\n", " ) # -> (B, 8, 8, 8)\n", "\n", " # after 2\u00d72 pooling: (B, 8, 4, 4)\n", " self.fc = nn.Linear(self.feature_dim, n_classes)\n", "\n", " def forward(self, x: torch.Tensor):\n", " # x: (B, 1, 8, 8)\n", " x = self.conv(x)\n", " x = F.relu(x)\n", " x = F.max_pool2d(x, kernel_size=2) # (B, 8, 4, 4)\n", " x_embed = x.view(x.size(0), -1) # (B, 128)\n", " logits = self.fc(x_embed) # (B, n_classes)\n", " return logits, x_embed\n", "\n", "class MajorityVoteClassifier(SkorchClassifier):\n", " \"\"\"\n", " Classifier that trains on labels\n", " aggregated by majority vote.\n", " \"\"\"\n", "\n", " def fit(self, X, y, **fit_params):\n", " y_mv = majority_vote(\n", " y,\n", " classes=self.classes,\n", " missing_label=self.missing_label,\n", " random_state=self.random_state,\n", " )\n", " return super().fit(X, y_mv, **fit_params)\n", "\n", "\n", "\n", "# Parameters passed to the NeuralNet in skorch`\n", "neural_net_param_dict = {\n", " # Module-related parameters.\n", " \"module__n_classes\": len(classes),\n", " # Optimizer-related parameters.\n", " \"max_epochs\": 100,\n", " \"optimizer\": RAdam,\n", " \"optimizer__weight_decay\": 0.0,\n", " \"optimizer__lr\": 0.01,\n", " # Data loading parameters.\n", " \"iterator_train__shuffle\": True,\n", " \"iterator_train__num_workers\": 1,\n", " \"iterator_train__batch_size\": 16,\n", " \"iterator_valid__batch_size\": 64,\n", " \"iterator_train__drop_last\": True,\n", " \"train_split\": None,\n", " # Scheduler.\n", " \"callbacks\": [\n", " (\n", " \"lr_scheduler\",\n", " LRScheduler(policy=CosineAnnealingLR, T_max=100),\n", " ),\n", " ],\n", " # Misc.\n", " \"verbose\": 0,\n", " \"device\": DEVICE,\n", "}\n", "\n", "# Common keyword arguments shared by all classifiers\n", "common_clf_kwargs = dict(\n", " classes=classes,\n", " missing_label=MISSING_LABEL,\n", " sample_dtype=np.float32,\n", " neural_net_param_dict=neural_net_param_dict,\n", ")\n", "\n", "\n", "def make_majority_vote_classifier(random_state):\n", " return MajorityVoteClassifier(\n", " module=CNNModule,\n", " criterion=nn.CrossEntropyLoss,\n", " random_state=random_state,\n", " **common_clf_kwargs,\n", " )\n", "\n", "\n", "def make_annot_mix_classifier(random_state):\n", " return AnnotMixClassifier(\n", " clf_module=CNNModule,\n", " sample_embed_dim=2,\n", " n_annotators=n_annotators,\n", " random_state=random_state,\n", " **common_clf_kwargs,\n", " )\n", "\n", "classifier_dict = {\n", " \"Majority-Vote\": make_majority_vote_classifier,\n", " \"Annot-Mix\": make_annot_mix_classifier,\n", "}"], "id": "c4347732b970bb0f", "outputs": [], "execution_count": 4}, {"metadata": {}, "cell_type": "markdown", "source": ["## Active Learning loop with Multiple Annotators\n", "\n", "We now set up the active-learning loop. For each classifier we run several independent repetitions, and in each repetition we start with an annotation matrix `y` that contains only missing labels. We then alternate between fitting the classifier on the currently observed noisy annotations and querying new labels from the simulated annotators.\n", "\n", "The query strategy is based on random sampling and wrapped in a `SingleAnnotatorWrapper` so that individual annotator\u2013sample pairs can be requested. For the `Annot-Mix` model we optionally use estimated annotator performances `A_perf` after the first warm-start iteration and pass them to the query strategy. After each cycle we record the test classification accuracy on the held-out test set and the accuracy of the collected labels compared to `y_train`.\n"], "id": "5bc9b5db6d898f18"}, {"metadata": {"ExecuteTime": {"end_time": "2025-12-02T19:02:03.975360Z", "start_time": "2025-12-02T18:55:56.066640Z"}}, "cell_type": "code", "source": ["n_reps = 3\n", "n_cycles = 10\n", "query_batch_size = 64\n", "\n", "results = {}\n", "\n", "for clf_name, clf_factory in classifier_dict.items():\n", " # rows: repetitions, cols: AL cycles (excluding warm start)\n", " clf_acc = np.full((n_reps, n_cycles), np.nan)\n", " label_acc = np.full((n_reps, n_cycles), np.nan)\n", "\n", " for i_rep in range(n_reps):\n", " # Start with all annotations missing\n", " y = np.full_like(z_train, fill_value=MISSING_LABEL, dtype=np.float32)\n", "\n", " clf = clf_factory(random_state=i_rep)\n", "\n", " base_qs = RandomSampling(\n", " missing_label=MISSING_LABEL,\n", " random_state=i_rep,\n", " )\n", " ma_qs = SingleAnnotatorWrapper(\n", " base_qs,\n", " missing_label=MISSING_LABEL,\n", " random_state=i_rep,\n", " )\n", "\n", " for c in tqdm(\n", " range(n_cycles + 1),\n", " total=n_cycles + 1,\n", " desc=f\"{clf_name} | repetition {i_rep + 1}/{n_reps}\",\n", " ):\n", " # Train classifier on current annotations\n", " clf.fit(X_train, y)\n", "\n", " if c > 0:\n", " # For Annot-Mix, use estimated annotator performance from the model\n", " if clf_name == \"Annot-Mix\":\n", " _, A_perf = clf.predict_proba(\n", " X_train,\n", " extra_outputs=[\"annotator_perf\"],\n", " )\n", " else:\n", " A_perf = None\n", "\n", " # Evaluate label and test accuracy\n", " is_lbld_y = is_labeled(y, missing_label=MISSING_LABEL)\n", " n_labels = is_lbld_y.sum()\n", "\n", " if n_labels > 0:\n", " is_true_y = y == y_train[:, None]\n", " n_correct_labels = (is_true_y & is_lbld_y).sum()\n", " label_acc[i_rep, c - 1] = n_correct_labels / n_labels\n", " clf_acc[i_rep, c - 1] = clf.score(X_test, y_test)\n", " else:\n", " # Warm start: no model-based A_perf\n", " A_perf = None\n", "\n", " # Query new (sample, annotator) pairs and reveal their labels\n", " query_idx = ma_qs.query(\n", " X=X_train,\n", " y=y,\n", " batch_size=query_batch_size,\n", " A_perf=A_perf,\n", " n_annotators_per_sample=1,\n", " )\n", " sample_idx, annotator_idx = query_idx[:, 0], query_idx[:, 1]\n", " y[sample_idx, annotator_idx] = z_train[sample_idx, annotator_idx]\n", "\n", " results[f\"{clf_name}_clf-acc\"] = clf_acc\n", " results[f\"{clf_name}_label-acc\"] = label_acc"], "id": "6205f3055b5f6b02", "outputs": [{"data": {"text/plain": ["Majority-Vote | repetition 1/3: 0%| | 0/11 [00:00 `AnnotMix` is able to model the annotators' performances as basis for increasing the correct annotation rate by greedily assigning annotators with higher performances to samples.\n"], "id": "8bb09e8413c8c47b"}, {"metadata": {"ExecuteTime": {"end_time": "2025-12-02T19:02:04.214503Z", "start_time": "2025-12-02T19:02:04.038912Z"}}, "cell_type": "code", "source": ["fig, axes = plt.subplots(1, 2, figsize=(14, 5), constrained_layout=True)\n", "\n", "metrics = [\n", " (\"clf-acc\", \"Test Classification Accuracy\"),\n", " (\"label-acc\", \"Correct Annotation Rate\"),\n", "]\n", "\n", "x_values = np.arange(1, n_cycles + 1) * query_batch_size\n", "\n", "for (metric_key, ylabel), ax in zip(metrics, axes):\n", " for clf_name in classifier_dict.keys():\n", " result = results[\n", " f\"{clf_name}_{metric_key}\"\n", " ]\n", "\n", " mean_curve = np.mean(result, axis=0)\n", " std_curve = np.std(result, axis=0)\n", "\n", " alcu = np.mean(mean_curve)\n", "\n", " ax.errorbar(\n", " x_values,\n", " mean_curve,\n", " yerr=std_curve,\n", " label=f\"{clf_name}: ALCU={alcu:.3f}\",\n", " alpha=1.0,\n", " markersize=0.1,\n", " )\n", "\n", " ax.set_xlabel(\"# Acquired Annotations\", fontsize=14)\n", " ax.set_ylabel(ylabel, fontsize=14)\n", " ax.grid(True)\n", " ax.legend(\n", " fontsize=14,\n", " loc=\"lower right\",\n", " frameon=True,\n", " )\n", " ax.tick_params(axis=\"both\", which=\"major\", labelsize=12)\n", "\n", "plt.show()"], "id": "d908397d36ff9d5d", "outputs": [{"data": {"text/plain": ["
"], "image/png": ""}, "metadata": {}, "output_type": "display_data", "jetTransient": {"display_id": null}}], "execution_count": 6}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.0"}, "nbsphinx": {"orphan": true}}, "nbformat": 4, "nbformat_minor": 5}