Pool-based Active Learning for Regression - Getting Started#

This notebook gives an overview of some query strategies for active learning in regression.

[1]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt

from skactiveml.pool import GreedySamplingX, GreedySamplingTarget, QueryByCommittee, \
    KLDivergenceMaximization
from sklearn.ensemble import BaggingRegressor
from skactiveml.regressor import NICKernelRegressor, SklearnRegressor
from skactiveml.utils import call_func, is_labeled
from scipy.stats import norm, uniform

mlp.rcParams["figure.facecolor"] = "white"

Data Set Generation#

First, we generate a dataset. This notebook provides four one-dimensional datasets.

[2]:
random_state = 0
n_iterations = 8

def uniform_rvs(*pos_args, **key_word_args):
    return uniform.rvs(*pos_args, **key_word_args, random_state=random_state)

def norm_rvs(*pos_args, **key_word_args):
    return norm.rvs(*pos_args, **key_word_args, random_state=random_state)

settings = {
    'basic': (
        lambda X_: (X_**3 + 2*X_**2 + X_ - 1).flatten(),
        np.sort(np.concatenate((uniform_rvs(0, 1, 60),uniform_rvs(1, 0.5, 30), uniform_rvs(1.5, 0.5, 10)))),
        np.concatenate((norm_rvs(0, 1.5, 60), norm_rvs(0, 0.5, 40)))
    ),
    'complex_func': (
        lambda X_: (6/7*X_**6 - 10*X_**3 + 20*X_).flatten(),
        np.sort(uniform_rvs(0, 2, 100)),
        norm_rvs(0, 0.5, 100)
    ),
    'high_noise': (
        lambda X_: X_.flatten(),
        np.sort(np.concatenate(tuple(uniform_rvs(s, 0.5, n) for s, n in [(0, 10), (0.5, 40), (1.0, 40), (1.5, 10)]))),
        np.concatenate(tuple(norm_rvs(0, std, n) for std, n in [(0.5, 10), (2.5, 80), (0.5, 10)]))
    ),
    'high_density_diff': (
        lambda X_: (X_**2).flatten(),
        np.sort(np.concatenate((uniform_rvs(0, 1, 10), uniform_rvs(1, 0.25, 80), uniform_rvs(1.25, 0.75, 10)))),
        norm_rvs(0, 0.25, 100)
    )
}

for variant in ['basic', 'complex_func', 'high_noise', 'high_density_diff']:
    true_function, X, noise = settings[variant]
    X = X.reshape(-1, 1)
    y_true = true_function(X) + noise
    X_test = np.linspace(0, 2, num=200).reshape(-1, 1)

    plt.title(variant)
    plt.scatter(X, y_true)
    plt.plot(X, true_function(X))
    plt.show()
../../_images/generated_tutorials_02_pool_regression_getting_started_3_0.png
../../_images/generated_tutorials_02_pool_regression_getting_started_3_1.png
../../_images/generated_tutorials_02_pool_regression_getting_started_3_2.png
../../_images/generated_tutorials_02_pool_regression_getting_started_3_3.png

Active Regression#

Now, we want to look at how the different query strategies determine, which samples to query. We do so by looking at the utility assigned by each query strategy for querying a samples. The assigned utility is displayed by the lightgreen line matching the axis to the right. The unlabeled samples are lightblue, the already labeled samples are orange, and the sample selected to be queried next is red. The model predictions, built by the orange samples, are displayed in black. We provide a small evaluation of the query behavior for the four dataset in the cell below. As a result, we make the following observations: - GreedySamplingX queries labels quiet uniformly over the feature space creating utility spikes evenly between the labeled samples across the feature space. - GreedySamplingTarget has a strong tendency to query labels where the target values are steep if the function is monotone, since there the target values differ the most from the prediction. - QueryByCommittee only queries, where some target data is available, slowly gaining diversity. This happens because all the learners share the same prior and thus do not differ, where no target data exist. - KLDivergenceMaximization seems to rely on the steepness of the target data the density and the variance, which seems to create a quiet uniform querying density.

[3]:
for variant in ['basic', 'complex_func', 'high_noise', 'high_density_diff']:
    print(variant)

    true_function, X, noise = settings[variant] # select another data set here
    X = X.reshape(-1, 1)
    y_true = true_function(X) + noise
    X_test = np.linspace(0, 2, num=200).reshape(-1, 1)

    qs_s = [
        GreedySamplingX(random_state=random_state),
        GreedySamplingTarget(random_state=random_state),
        QueryByCommittee(random_state=random_state),
        KLDivergenceMaximization(
            random_state=random_state,
            integration_dict_target_val={
                "method": "assume_linear",
                "n_integration_samples": 3,
            },
            integration_dict_cross_entropy={
                "method": "assume_linear",
                "n_integration_samples": 3,
            }
        ),
    ]

    y = np.full_like(y_true, np.nan)
    y_s = [y.copy() for _ in range(len(qs_s))]

    reg = NICKernelRegressor(metric_dict={'gamma': 15.0})

    for i in range(n_iterations):
        fig, axes = plt.subplots(1, len(qs_s), figsize=(20, 5))
        axes = [axes, ] if len(qs_s)==1 else axes

        for qs, ax, y in zip(qs_s, axes, y_s):
            reg.fit(X, y)
            indices, utils = call_func(qs.query,
                X=X,
                y=y,
                reg=reg,
                ensemble=SklearnRegressor(BaggingRegressor(reg, n_estimators=4)),
                fit_reg=True,
                return_utilities=True,
            )
            _, utilities_test = call_func(qs.query,
                X=X,
                y=y,
                reg=reg,
                ensemble=SklearnRegressor(BaggingRegressor(reg, n_estimators=4)),
                candidates=X_test,
                fit_reg=True,
                return_utilities=True,
            )
            old_is_lbld = is_labeled(y)
            y[indices] = y_true[indices]
            is_lbld = is_labeled(y)
            ax_t = ax.twinx()
            ax_t.plot(X_test, utilities_test.flatten(), c='green')

            ax.scatter(X[~is_lbld], y_true[~is_lbld], c='lightblue')
            ax.scatter(X[old_is_lbld], y[old_is_lbld], c='orange')
            ax.scatter(X[indices], y[indices], c='red')

            y_pred, y_std = reg.predict(X_test, return_std=True)
            ax.plot(X_test, y_pred, c='black')
            if i == 0:
                ax.set_title(qs.__class__.__name__, fontdict={'fontsize': 15})

        plt.show()
basic
../../_images/generated_tutorials_02_pool_regression_getting_started_5_1.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_2.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_3.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_4.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_5.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_6.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_7.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_8.png
complex_func
../../_images/generated_tutorials_02_pool_regression_getting_started_5_10.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_11.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_12.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_13.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_14.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_15.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_16.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_17.png
high_noise
../../_images/generated_tutorials_02_pool_regression_getting_started_5_19.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_20.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_21.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_22.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_23.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_24.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_25.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_26.png
high_density_diff
../../_images/generated_tutorials_02_pool_regression_getting_started_5_28.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_29.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_30.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_31.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_32.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_33.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_34.png
../../_images/generated_tutorials_02_pool_regression_getting_started_5_35.png