Contributing Guide
==================
**scikit-activeml** is a library that implements the most important
query strategies of active learning. It is built upon the well-known
machine learning framework
`scikit-learn `__.
Overview
--------
Our philosophy is to extend the ``sklearn`` eco-system with the most relevant
query strategies for active learning and to implement tools for working with
partially unlabeled data. An overview of our repository's structure is given in
the image below. Each node represents a class or interface. The arrows
illustrate the inheritance hierarchy among them. The functionality of a dashed
node is not yet available in our library.
.. image:: https://raw.githubusercontent.com/scikit-activeml/scikit-activeml/master/docs/logos/scikit-activeml-structure.png
:width: 1000
In our package ``skactiveml``, there three major components, i.e.,
``SkactivemlClassifier``, ``SkactivemlRegressor``, and the ``QueryStrategy``.
The classifier and regressor modules are necessary to deal with partially
unlabeled data and to implement active-learning specific estimators. This way,
an active learning cycle can be easily implemented to start with zero initial
labels. Regarding the active learning query strategies, we currently differ
between the pool-based (a large pool of unlabeled samples is available) and
stream-based (unlabeled samples arrive sequentially, i.e., as a stream)
paradigm. On top of both paradigms, we also distinguish the single- and
multi-annotator setting. In the latter setting, multiple error-prone annotators
are queried to provide labels. As a result, an active learning query strategy
not only decides which samples but also which annotators should be queried.
Thank you, contributors!
~~~~~~~~~~~~~~~~~~~~~~~~
A big thank you to all contributors who provide the **scikit-activeml**
project with new enhancements and bug fixes.
Getting Help
~~~~~~~~~~~~
If you have any questions, please reach out to other developers via the
following channels:
- `Github
Issues `__
Roadmap
~~~~~~~
Our roadmap is summarized in the issue `Upcoming
Features `__.
Get Started
-----------
Before you can contribute to this project, you might execute the
following steps.
Setup Development Environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are several ways to create a local Python environment, such as
`virtualenv `__,
`pipenv `__,
`miniconda `__, etc. One
possible workflow is to install ``miniconda`` and use it to create a
Python environment.
Example With miniconda
^^^^^^^^^^^^^^^^^^^^^^
Create a new Python environment named **scikit-activeml**:
.. code:: bash
conda create -n scikit-activeml
To be sure that the correct environment is active:
.. code:: bash
conda activate scikit-activeml
Then install ``pip``:
.. code:: bash
conda install pip
Install Dependencies
~~~~~~~~~~~~~~~~~~~~
Now we can install some required project dependencies, which are defined
in the ``requirements.txt`` and ``requirements_extra.txt`` (for development)
files.
.. code:: bash
# Make sure your scikit-activeml python environment is active!
cd
pip install -r requirements.txt
pip install -r requirements_extra.txt
After the pip installation was successful, we have to install ``pandoc``
and ``ghostscript`` if it is not already installed.
Example with MacOS (Homebrew)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: bash
brew install pandoc ghostscript
Contributing Code
-----------------
General Coding Conventions
~~~~~~~~~~~~~~~~~~~~~~~~~~
As this library conforms to the convention of
`scikit-learn `__,
the code should conform to `PEP
8 `__ Style Guide for Python
Code. For linting, the use of
`flake8 `__ is recommended. The Python
package `black `__ provides a simple
solution for this formatting. Concretely, you can install it and format
the code via the following commands:
.. code:: bash
pip install black
black --line-length 79 example_file.py
Example for C3 (Code Contribution Cycle) and Pull Requests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. Fork the repository using the Github `Fork `__
button.
2. Then, clone your fork to your local machine:
.. code:: bash
git clone https://github.com//scikit-activeml.git
3. Create a new branch for your changes from the ``development`` branch:
.. code:: bash
git checkout -b
4. After you have finished implementing the feature, make sure that all
the tests pass. The tests can be run as
.. code:: bash
$ pytest
Make sure, you covered all lines by tests.
.. code:: bash
$ pytest --cov=./skactiveml
5. Commit and push the changes.
.. code:: bash
$ git add
$ git commit -m ""
$ git push
6. Create a pull request.
Query Strategies
----------------
All query strategies inherit from ``skactiveml.base.QueryStrategy`` as abstract
superclass implemented in ``skactiveml/base.py``. This superclass inherits from
``sklearn.base.Estimator``. The ``__init__`` method requires by default a
``random_state`` parameter and the abstract method ``query`` is to enforce the
implementation of the sample selection logic.
Single-annotator Pool-based Query Strategies
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. _general-1:
General
^^^^^^^
Single-annotator pool-based query strategies are stored in a file
``skactiveml/pool/*.py`` and inherit from
``skactiveml.base.SingleAnnotatorPoolQueryStrategy``.
The class must implement the following methods:
+------------+----------------------------------------------------------------+
| Method | Description |
+============+================================================================+
| ``init`` | Method for initialization. |
+------------+----------------------------------------------------------------+
| ``query`` | Select the samples whose labels are to be queried. |
+------------+----------------------------------------------------------------+
.. _init-1:
``__init__`` method
^^^^^^^^^^^^^^^^^^^
For typical class parameters, we use standard names:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``random_state`` | Number or np.random.RandomState |
| | like sklearn. |
+-----------------------------------------------------------------------+
| ``prior``, optional | Prior probabilities for the |
| | distribution of probabilistic |
| | strategies. |
+-----------------------------------+-----------------------------------+
| ``method``, optional | String for classes that implement |
| | multiple methods. |
+-----------------------------------+-----------------------------------+
| ``cost_matrix``, optional | Cost matrix defining the cost of |
| | interchanging classes. |
+-----------------------------------+-----------------------------------+
.. _query-1:
``query`` method
^^^^^^^^^^^^^^^^
Required Parameters:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``X`` | Training data set, usually |
| | complete, i.e. including the |
| | labeled and unlabeled samples. |
+-----------------------------------+-----------------------------------+
| ``y`` | Labels of the training data set |
| | (possibly including unlabeled |
| | ones indicated by MISSING_LABEL.) |
+-----------------------------------+-----------------------------------+
| ``candidates``, optional | If candidates is None, the |
| | unlabeled samples from (X, y) are |
| | considered as candidates. If |
| | candidates is of shape |
| | (n_candidates) and of type int, |
| | candidates is considered as the |
| | indices of the samples in (X,y). |
| | If candidates is of shape |
| | (n_candidates, n_features), the |
| | candidates are directly given in |
| | candidates (not necessarily |
| | contained in X). This is not |
| | supported by all query |
| | strategies. |
+-----------------------------------+-----------------------------------+
| ``batch_size``, optional | Number of samples to be selected |
| | in one AL cycle. |
+-----------------------------------+-----------------------------------+
| ``return_utilities``, optional | If true, additionally return the |
| | utilities of the query strategy.` |
+-----------------------------------+-----------------------------------+
Returns:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``query_indices`` | The ``query_indices`` indicate |
| | for which candidate sample a |
| | label is to be queried, e.g., |
| | ``query_indices[0]`` indicates |
| | the first selected sample. If |
| | candidates is None or of shape |
| | (n_candidates), the indexing |
| | refers to samples in ``X``. If |
| | candidates is of shape |
| | (n_candidates, n_features), the |
| | indexing refers to samples in |
| | candidates. |
+-----------------------------------+-----------------------------------+
| ``utilities``, optional | The utilities of samples after |
| | each selected sample of the |
| | batch, e.g., ``utilities[0]`` |
| | indicates the utilities used for |
| | selecting the first sample (with |
| | index ``query_indices[0]``) of |
| | the batch. Utilities for labeled |
| | samples will be set to np.nan. If |
| | candidates is None or of shape |
| | (n_candidates), the indexing |
| | refers to samples in ``X``. If |
| | candidates is of shape |
| | (n_candidates, n_features), the |
| | indexing refers to samples in |
| | candidates. |
+-----------------------------------+-----------------------------------+
.. _general-advice-1:
General advice
''''''''''''''
Use ``self._validate_data`` method (implemented in the superclass).
Check the input ``X`` and ``y`` only once. Fit the classifier or regressors if
it is not yet fitted (may use ``fit_if_not_fitted`` from ``utils``). Calculate
utilities via an extra function that should be public. Use ``simple_batch``
function from ``utils`` for determining `query_indices` and setting ``utilities``
in naive batch query strategies.
.. _testing-1:
Testing
^^^^^^^
The test classes ``skactiveml.pool.test.TestQueryStrategy`` of single-annotator
pool-based query strategies need to inherit from the test template
``skactiveml.tests.template_query_strategy.TemplateSingleAnnotatorPoolQueryStrategy``.
As a result, many required functionalities will be automatically tested.
As a requirement, one needs to specify the parameters of ``qs_class``,
``init_default_params`` of the ``__init__`` accordingly. Depending on whether
the query strategy can handle regression/classification or both settings, one
needs to additionally define the parameters
``query_default_params_reg/query_default_params_clf``.
Once, the parameters are set, the developer needs to adjust the test until
all errors are resolved. In particular, the method ``test_query`` must
be implemented. We refer to the test template for more detailed information.
Single-annotator Stream-based Query Strategies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. _general-2:
General
^^^^^^^
All query strategies are stored in a file ``skactivml/stream/*.py``.
Every query strategy inherits from
``SingleAnnotatorStreamQueryStrategy``. Every query strategy has
either an internal budget handling or an outsourced ``budget_manager``.
For typical class parameters we use standard names:
+------------------------------+------------------------------------------+
| Parameter | Description |
+==============================+==========================================+
| ``random_state`` | Integer that acts as random seed |
| | or ``np.random.RandomState`` like |
| | sklearn |
+------------------------------+------------------------------------------+
| ``budget`` | The share of labels that thestrategy is |
| | allowed to query |
+------------------------------+------------------------------------------+
| ``budget_manager``, optional | Enforces the budget constraint |
+------------------------------+------------------------------------------+
The class must implement the following methods:
+------------+-----------------------------------------------------------------+
| Function | Description |
+============+=================================================================+
| ``init`` | Function for initialization |
+------------+-----------------------------------------------------------------+
| ``query`` | Identify the instances whose labels to select without adapting |
| | the internal state |
+------------+-----------------------------------------------------------------+
| ``update`` | Adapting the budget monitoring according to the queried labels |
+------------+-----------------------------------------------------------------+
.. _query-method-2:
``query`` method
^^^^^^^^^^^^^^^^^^
Required Parameters:
+------------------------------+-------------------------------------------------------------+
| Parameter | Description |
+==============================+=============================================================+
| ``candidates`` | Set of candidate instances, |
| | inherited from |
| | ``SingleAnnotatorStreamBasedQueryStrategy`` |
+------------------------------+-------------------------------------------------------------+
| ``clf``, optional | The classifier used by the |
| | strategy |
+------------------------------+-------------------------------------------------------------+
| ``X``, optional | Set of labeled and unlabeled |
| | instances |
+------------------------------+-------------------------------------------------------------+
| ``y``, optional | Labels of ``X`` (it may be set to |
| | ``MISSING_LABEL`` if ``y`` is |
| | unknown) |
+------------------------------+-------------------------------------------------------------+
| ``sample_weight``, optional | Weights for each instance in |
| | ``X`` or ``None`` if all are |
| | equally weighted |
+------------------------------+-------------------------------------------------------------+
| ``fit_clf``, optional | uses ``X`` and ``y`` to fit the classifier |
+------------------------------+-------------------------------------------------------------+
| ``return_utilities`` | Whether to return the candidates' utilities, |
| | inherited from ``SingleAnnotatorStreamBasedQueryStrategy`` |
+------------------------------+-------------------------------------------------------------+
Returns:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``queried_indices`` | Indices of the best instances |
| | from ``X_Cand`` |
+-----------------------------------+-----------------------------------+
| ``utilities`` | Utilities of all candidate |
| | instances, only if |
| | ``return_utilities`` is ``True`` |
+-----------------------------------+-----------------------------------+
.. _general-advice-2:
General advice
''''''''''''''
The ``query`` method must not change the internal state of the ``query``
strategy (``budget``, ``budget_manager`` and ``random_state`` included) to allow
for assessing multiple instances with the same state. Update the internal state
in the ``update()`` method. If the class implements a classifier (``clf``) the
optional attributes need to be implement. Use ``self._validate_data`` method
(is implemented in superclass). Check the input ``X`` and ``y`` only once. Fit
classifier if ``fit_clf`` is set to ``True``.
.. _update-1:
``update`` method
^^^^^^^^^^^^^^^^^^^
Required Parameters:
+-------------------------------+----------------------------------------------+
| Parameter | Description |
+===============================+==============================================+
| ``candidates`` | Set of candidate instances, |
| | inherited from |
| | ``SingleAnnotatorStreamBasedQueryStrategy`` |
+-------------------------------+----------------------------------------------+
| ``queried_indices`` | Typically the return value of |
| | ``query`` |
+-------------------------------+----------------------------------------------+
| ``budget_manager_param_dict`` | Provides additional parameters to |
| | the ``update`` method of the |
| | ``budget_manager`` (only include |
| | if a ``budget_manager`` is used) |
+-------------------------------+----------------------------------------------+
.. _general-advice-3:
General advice
''''''''''''''
Use ``self._validate_data`` in case the strategy is used without using
the ``query`` method (if parameters need to be initialized before the
update). If a ``budget_manager`` is used forward the update call to the
``budget_manager.update`` method.
.. _testing-2:
Testing
^^^^^^^
All stream query strategies are tested by a general unittest
(``stream/tests/test_stream.py``) -For every class
``ExampleQueryStrategy`` that inherits from
``SingleAnnotatorStreamQueryStrategy`` (stored in ``_example.py``), it
is automatically tested if there exists a file ``test/test_example.py``.
It is necessary that both filenames are the same. Moreover, the test
class must be called ``TestExampleQueryStrategy`` and inherit from
``unittest.TestCase``. Every parameter in ``init()`` will be tested if
it is written the same as a class variable. Every parameter arg in
``init()`` will be evaluated if there exists a method in the testclass
``TestExampleQueryStrategy`` that is called ``test_init_param_arg()``.
Every parameter arg in ``query()`` will be evaluated if there exists a
method in the testclass ``TestExampleQueryStrategy`` that is called
``test_query_param_arg()``. It is tested if the internal state of ``query()``
is unchanged after multiple calls without using ``update()``.
.. _general-advice-4:
General advice for the ``budget_manager``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
All budget managers are stored in
``skactivml/stream/budget_manager/*.py``. The class must implement the
following methods:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``__init__`` | Function for initialization |
+-----------------------------------+-----------------------------------+
| ``query_by_utilities`` | Identify which instances to query |
| | based on the assessed utility |
+-----------------------------------+-----------------------------------+
| ``update`` | Adapting the budget monitoring |
| | according to the queried labels |
+-----------------------------------+-----------------------------------+
.. _update-2:
``update`` method
^^^^^^^^^^^^^^^^^^^
The update method of the budget manager has the same functionality as
the query strategy update.
Required Parameters:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``budget`` | % of labels that the strategy is |
| | allowed to query |
+-----------------------------------+-----------------------------------+
| ``random_state`` | Integer that acts as random seed |
| | or ``np.random.RandomState`` like |
| | sklearn |
+-----------------------------------+-----------------------------------+
.. _query-by-utilities-1:
``query_by_utilities`` method
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Required Parameters:
+-----------------------------------+------------------------------------+
| Parameter | Description |
+===================================+====================================+
| ``utilities`` | The ``utilities`` of ``candidates``|
| | calculated by the query strategy, |
| | inherited from ``BudgetManager`` |
+-----------------------------------+------------------------------------+
.. _general-advice-5:
General advice for working with a ``budget_manager``:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If a ``budget_manager`` is used, the ``_validate_data`` of the query
strategy needs to be adapted accordingly:
- If only a ``budget`` is given use the default ``budget_manager`` with
the given budget
- If only a ``budget_manager`` is given use the ``budget_manager``
- If both are not given use the default ``budget_manager`` with the
default budget
- If both are given and the budget differs from
``budget_manager.budget`` throw an error
All budget managers are tested by a general unittest
(``stream/budget_manager/tests/test_budget_manager.py``). For every
class ``ExampleBudgetManager`` that inherits from ``BudgetManager``
(stored in ``_example.py``), it is automatically tested if there exists
a file ``test/test_example.py``. It is necessary that both filenames are
the same.
.. _testing-1:
Testing
^^^^^^^
Moreover, the test class must be called ``TestExampleBudgetManager`` and
inheriting from ``unittest.TestCase``. Every parameter in ``__init__()``
will be tested if it is written the same as a class variable. Every
parameter ``arg`` in ``__init__()`` will be evaluated if there exists a
method in the testclass ``TestExampleQueryStrategy`` that is called
``test_init_param_arg()``. Every parameter ``arg`` in
``query_by_utility()`` will be evaluated if there exists a method in the
testclass ``TestExampleQueryStrategy`` that is called
``test_query_by_utility`` ``_param_arg()``. It is tested if the internal state
of ``query()`` is unchanged after multiple calls without using ``update()``.
Multi-Annotator Pool-based Query Strategies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All query strategies are stored in a file
``skactiveml/pool/multi/*.py`` and inherit
``skactiveml.base.MultiAnnotatorPoolQueryStrategy``.
The class must implement the following methods:
+------------+----------------------------------------------------------------+
| Method | Description |
+============+================================================================+
| ``init`` | Method for initialization. |
+------------+----------------------------------------------------------------+
| ``query`` | Select the annotator-sample pairs to decide which sample's |
| | class label is to be queried from which annotator. |
+------------+----------------------------------------------------------------+
.. _query-method-3:
``query`` method
^^^^^^^^^^^^^^^^
Required Parameters:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``X`` | Training data set, usually |
| | complete, i.e. including the |
| | labeled and unlabeled samples. |
+-----------------------------------+-----------------------------------+
| ``y`` | Labels of the training data set |
| | for each annotator (possibly |
| | including unlabeled ones |
| | indicated by self.MISSING_LABEL), |
| | meaning that ``y[i, j]`` contains |
| | the label annotated by annotator |
| | ``i`` for sample ``j``. |
+-----------------------------------+-----------------------------------+
| ``candidates``, optional | If ``candidates`` is ``None``, |
| | the samples from ``(X, y)``, for |
| | which an annotator exists such |
| | that the annotator sample pair is |
| | unlabeled are considered as |
| | sample candidates. |
| | If ``candidates`` is of shape |
| | ``(n_candidates,)`` and of type |
| | int, ``candidates`` is considered |
| | as the indices of the sample |
| | candidates in ``(X, y)``. If |
| | ``candidates`` is of shape |
| | ``(n_candidates, n_features)``, |
| | the sample candidates are |
| | directly given in ``candidates`` |
| | (not necessarily contained in |
| | ``X``). This is not supported by |
| | all query strategies. |
+-----------------------------------+-----------------------------------+
| ``annotators``, optional | If ``annotators`` is ``None``, |
| | all annotators are considered as |
| | available annotators. If |
| | ``annotators`` is of shape |
| | (n_avl_annotators), and of type |
| | int, ``annotators`` is considered |
| | as the indices of the available |
| | annotators. If candidate samples |
| | and available annotators are |
| | specified: The annotator-sample |
| | pairs, for which the sample is a |
| | candidate sample and the |
| | annotator is an available |
| | annotator are considered as |
| | candidate annotator-sample-pairs. |
| | If ``annotators`` is a boolean |
| | array of shape (n_candidates, |
| | n_avl_annotators) the |
| | annotator-sample pairs, for which |
| | the sample is a candidate sample |
| | and the boolean matrix has entry |
| | ``True`` are considered as |
| | candidate annotator-sample pairs. |
+-----------------------------------+-----------------------------------+
| ``batch_size``, optional | The number of annotator-sample |
| | pairs to be selected in one AL |
| | cycle. |
+-----------------------------------+-----------------------------------+
| ``return_utilities``, optional | If ``True``, also return the |
| | utilities based on the query |
| | strategy. |
+-----------------------------------+-----------------------------------+
Returns:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``query_indices`` | The ``query_indices`` indicate |
| | for which candidate sample a |
| | label is to be queried, e.g., |
| | ``query_indices[0]`` indicates |
| | the first selected sample. If |
| | candidates is None or of shape |
| | (n_candidates), the indexing |
| | refers to samples in ``X``. If |
| | candidates is of shape |
| | (n_candidates, n_features), the |
| | indexing refers to samples in |
| | candidates. |
+-----------------------------------+-----------------------------------+
| ``utilities`` | The utilities of samples after |
| | each selected sample of the |
| | batch, e.g., ``utilities[0]`` |
| | indicates the utilities used for |
| | selecting the first sample (with |
| | index ``query_indices[0]``) of |
| | the batch. Utilities for labeled |
| | samples will be set to np.nan. If |
| | candidates is None or of shape |
| | (n_candidates), the indexing |
| | refers to samples in ``X``. If |
| | candidates is of shape |
| | (n_candidates, n_features), the |
| | indexing refers to samples in |
| | candidates. |
+-----------------------------------+-----------------------------------+
.. _general-advice-6:
General advice
''''''''''''''
Use ``self._validate_data method`` (is implemented in superclass).
Check the input ``X`` and ``y`` only once. Fit classifier if it is not
yet fitted (may use ``fit_if_not_fitted`` form ``utils``). If the
strategy combines a single annotator query strategy with a performance
estimate:
- define an aggregation function,
- evaluate the performance for each sample-annotator pair,
- use the ``SingleAnnotatorWrapper``.
If the strategy is a ``greedy`` method regarding the utilities:
- calculate utilities (in an extra function),
- use ``skactiveml.utils.simple_batch`` function for returning values.
.. _testing-3:
Testing
^^^^^^^
The test classes ``skactiveml.pool.multiannotator.test.TestQueryStrategy`` of
multi-annotator pool-based query strategies need inherit form
``unittest.TestCase``. In this class, each parameter ``a`` of the
``__init__`` method needs to be tested via a method ``test_init_param_a``.
This applies also for a parameter ``a`` of the ``query`` method, which is
tested via a method ``test_query_param_a``. The main logic of the query
strategy is test via the method ``test_query``.
Classifiers
-----------
Standard classifier implementations are part of the subpackage
``skactiveml.classifier`` and classifiers learning from multiple
annotators are implemented in its subpackage
``skactiveml.classifier.multiannotator``. Every class of a classifier inherits
from ``skactiveml.base.SkactivemlClassifier``.
The class must implement the following methods:
+-------------------+---------------------------------------------------------+
| Method | Description |
+===================+=========================================================+
| ``init`` | Method for initialization. |
+-------------------+---------------------------------------------------------+
| ``fit`` | Method to fit the classifier for given training data. |
+-------------------+---------------------------------------------------------+
| ``predict_proba`` | Method predicting class-membership probabilities for |
| | samples. |
+-------------------+---------------------------------------------------------+
| ``predict`` | Method predicting class labels for samples. The super |
| | already provides an implementation using |
| | ``predict_proba``. |
+-------------------+---------------------------------------------------------+
.. _init-2:
``init`` method
~~~~~~~~~~~~~~~
Required Parameters:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``classes``, optional | Holds the label for each class. |
| | If ``None``, the classes are |
| | determined during the fit. |
+-----------------------------------+-----------------------------------+
| ``missing_label``, optional | Value to represent a missing |
| | label. |
+-----------------------------------+-----------------------------------+
| ``cost_matrix``, optional | Cost matrix with |
| | ``cost_matrix[i,j]`` indicating |
| | cost of predicting class |
| | ``classes[j]`` for a sample of |
| | class ``classes[i]``. Can be only |
| | set, if classes is not ``None``. |
+-----------------------------------+-----------------------------------+
| ``random_state``, optional | Ensures reproducibility |
| | (cf. scikit-learn). |
+-----------------------------------+-----------------------------------+
.. _fit-1:
``fit`` method
~~~~~~~~~~~~~~
Required Parameters:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``X`` | Is a matrix of feature values |
| | representing the samples. |
+-----------------------------------+-----------------------------------+
| ``y`` | Contains the class labels of the |
| | training samples. Missing labels |
| | are represented through the |
| | attribute ``missing_label``. |
| | Usually, ``y`` is a column array |
| | except for multi-annotator |
| | classifiers which expect a matrix |
| | with columns containing the class |
| | labels provided by a specific |
| | annotator. |
+-----------------------------------+-----------------------------------+
| ``sample_weight``, optional | Contains the weights of the |
| | training samples' class labels. |
| | It must have the same shape as |
| | ``y``. |
+-----------------------------------+-----------------------------------+
Returns:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
|``self`` | The fitted classifier object. |
+-----------------------------------+-----------------------------------+
.. _general-advice-7:
General advice
^^^^^^^^^^^^^^
Use ``self._validate_data`` method (is implemented in superclass) to
check standard parameters of ``__init__`` and ``fit`` method. If the
``classes`` parameter was provided, the classifier can be fitted with
training sample of which each was assigned a ``missing_label``.
In this case, the classifier should make random predictions, i.e.,
outputting uniform class-membership probabilities when calling
``predict_proba``. Ensure that the classifier can handle ``missing labels``
also in other cases.
.. _predict-proba-1:
``predict_proba`` method
~~~~~~~~~~~~~~~~~~~~~~~~
Required Parameters:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``X`` | Is a matrix of feature values |
| | representing the samples, for |
| | which the classifier will make |
| | predictions. |
+-----------------------------------+-----------------------------------+
Returns:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``P`` | The estimated class-membership |
| | probabilities per sample. |
+-----------------------------------+-----------------------------------+
.. _general-advice-8:
General advice
^^^^^^^^^^^^^^
Check parameter ``X`` regarding its shape, i.e., use superclass method
``self._check_n_features`` to ensure a correct number of features. Check
that the classifier has been fitted. If the classifier is a
``skactiveml.base.ClassFrequencyEstimator``, this method is already
implemented in the superclass.
.. _predict-1:
``predict`` method
~~~~~~~~~~~~~~~~~~
Required Parameters:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``X`` | Is a matrix of feature values |
| | representing the samples, for |
| | which the classifier will make |
| | predictions. |
+-----------------------------------+-----------------------------------+
Returns:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``y_pred`` | The estimated class label |
| | of each per sample. |
+-----------------------------------+-----------------------------------+
.. _general-advice-9:
General advice
^^^^^^^^^^^^^^
Usually, this method is already implemented by the superclass through
calling the ``predict_proba`` method. If the superclass method is
overwritten, ensure that it can handle imbalanced costs and missing
labels.
.. _score-1:
``score`` method
~~~~~~~~~~~~~~~~
Required Parameters:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``X`` | Is a matrix of feature values |
| | representing the samples, for |
| | which the classifier will make |
| | predictions. |
+-----------------------------------+-----------------------------------+
| ``y`` | Contains the true label of each |
| | sample. |
+-----------------------------------+-----------------------------------+
| ``sample_weight``, optional | Defines the importance of each |
| | sample when computing the |
| | accuracy of the classifier. |
+-----------------------------------+-----------------------------------+
Returns:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``score`` | Mean accuracy of |
| | ``self.predict(X)`` regarding |
| | ``y``. |
+-----------------------------------+-----------------------------------+
.. _general-advice-10:
General advice
^^^^^^^^^^^^^^
Usually, this method is already implemented by the superclass. If the
superclass method is overwritten, ensure that it checks the parameters
and that the classifier has been fitted.
.. _testing-4:
Testing
~~~~~~~
All classifiers are tested by a general unittest
(``skactiveml/classifier/tests/test_classifier.py``). For every class
``ExampleClassifier`` that inherits from
``skactiveml.base.SkactivemlClassifier`` (stored in
``_example_classifier.py``), it is automatically tested if there exists
a file ``tests/test_example_classifier.py``. It is necessary that both
filenames are the same. Moreover, the test class must be called
``TestExampleClassifier`` and inherit from ``unittest.TestCase``. For
each parameter of an implemented method, there must be a test method
called ``test_methodname_parametername`` in the Python file
``tests/test_example_classifier.py``. It is to check whether invalid parameters
are handled correctly. For each implemented method, there must be a test
method called ``test_methodname`` in the Python file
``tests/test_example_classifier.py``. It is to check whether the method works
as intended.
Regressors
----------
Standard regressors implementations are part of the subpackage
``skactiveml.regressor``. Every class of a regressor inherits
from ``skactiveml.base.SkactivemlRegressor``.
The class must implement the following methods:
+-------------------+---------------------------------------------------------+
| Method | Description |
+===================+=========================================================+
| ``init`` | Method for initialization. |
+-------------------+---------------------------------------------------------+
| ``fit`` | Method to fit the regressor for given training data. |
+-------------------+---------------------------------------------------------+
| ``predict`` | Method predicting the target values (labels) for |
| | samples. |
+-------------------+---------------------------------------------------------+
.. _init-3:
``init`` method
~~~~~~~~~~~~~~~
Required Parameters:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``random_state``, optional | Ensures reproducibility |
| | (cf. scikit-learn). |
+-----------------------------------+-----------------------------------+
| ``missing_label``, optional | Value to represent a missing |
| | label. |
+-----------------------------------+-----------------------------------+
.. _fit-2:
``fit`` method
~~~~~~~~~~~~~~
Required Parameters:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``X`` | Is a matrix of feature values |
| | representing the samples. |
+-----------------------------------+-----------------------------------+
| ``y`` | Contains the target values of the |
| | training samples. Missing labels |
| | are represented through the |
| | attribute ``missing_label``. |
| | Usually, ``y`` is a column array |
| | except for multi-target |
| | regressors which expect a matrix |
| | with columns containing the |
| | different target types. |
+-----------------------------------+-----------------------------------+
| ``sample_weight``, optional | Contains the weights of the |
| | training samples' targets. |
| | It must have the same shape as |
| | ``y``. |
+-----------------------------------+-----------------------------------+
Returns:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
|``self`` | The fitted regressor object. |
+-----------------------------------+-----------------------------------+
.. _general-advice-11:
General advice
^^^^^^^^^^^^^^
Use ``self._validate_data`` method (is implemented in superclass) to
check standard parameters of ``__init__`` and ``fit`` method. If the regressor
was fitted on training sample of which each was assigned a ``missing_label``,
the regressor should predict a default value of zero when calling ``predict``.
Ensure that the regressor can handle ``missing labels`` also in other cases.
.. _predict-2:
``predict`` method
~~~~~~~~~~~~~~~~~~~~~~~~
Required Parameters:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``X`` | Is a matrix of feature values |
| | representing the samples, for |
| | which the regressor will make |
| | predictions. |
+-----------------------------------+-----------------------------------+
Returns:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``y_pred`` | The estimated targets per sample. |
+-----------------------------------+-----------------------------------+
.. _general-advice-12:
General advice
^^^^^^^^^^^^^^
Check parameter ``X`` regarding its shape, i.e., use superclass method
``self._check_n_features`` to ensure a correct number of features. Check
that the regressor has been fitted. If the classifier is a
``skactiveml.base.ProbabilisticRegressor``, this method is already
implemented in the superclass.
.. _score-2:
``score`` method
~~~~~~~~~~~~~~~~
Required Parameters:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``X`` | Is a matrix of feature values |
| | representing the samples, for |
| | which the regressor will make |
| | predictions. |
+-----------------------------------+-----------------------------------+
| ``y`` | Contains the true target of each |
| | sample. |
+-----------------------------------+-----------------------------------+
| ``sample_weight``, optional | Defines the importance of each |
| | sample when computing the |
| | R2 score of the regressor. |
+-----------------------------------+-----------------------------------+
Returns:
+-----------------------------------+-----------------------------------+
| Parameter | Description |
+===================================+===================================+
| ``score`` | R2 score of ``self.predict(X)`` |
| | regarding ``y``. |
+-----------------------------------+-----------------------------------+
.. _general-advice-13:
General advice
^^^^^^^^^^^^^^
Usually, this method is already implemented by the superclass. If the
superclass method is overwritten, ensure that it checks the parameters
and that the regressor has been fitted.
.. _testing-5:
Testing
~~~~~~~
For every class ``ExampleRegressor`` that inherits from
``skactiveml.base.SkactivemlRegressor`` (stored in
``_example_regressor.py``), there need to be a file
``tests/test_example_classifier.py``. It is necessary that both
filenames are the same. Moreover, the test class must be called
``TestExampleRegressor`` and inherit from ``unittest.TestCase``. For
each parameter of an implemented method, there must be a test method
called ``test_methodname_parametername`` in the Python file
``tests/test_example_regressor.py``. It is to check whether invalid parameters
are handled correctly. For each implemented method, there must be a test
method called ``test_methodname`` in the Python file
``tests/test_example_regressor.py``. It is to check whether the method works
as intended.
Annotators Models
-----------------
Annotator models are marked by implementing the interface
``skactiveml.base.AnnotatorModelMixin``. These models can estimate the
performances of annotators for given samples. The class of an annotator model
must implement the ``predict_annotator_perf`` method estimating the
performances per sample of each annotator as proxies of the provided
annotations' qualities.
.. _predict-annotator-perf-1:
``predict_annotator_perf`` method
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Required Parameters:
+-------------+---------------------------------------------------------+
| Parameter | Description |
+=============+=========================================================+
| ``X`` | Is a matrix of feature values representing the samples. |
+-------------+---------------------------------------------------------+
Returns:
+-------------+---------------------------------------------------------+
| Parameter | Description |
+=============+=========================================================+
| ``P_annot`` | The estimated performances per sample-annotator pair. |
+-------------+---------------------------------------------------------+
.. _general-advice-14:
General advice
^^^^^^^^^^^^^^
Check parameter ``X`` regarding its shape and check that the annotator
model has been fitted. If no samples or class labels were provided
during the previous call of the ``fit`` method, the maximum value of
annotator performance should be outputted for each sample-annotator
pair.
Examples
--------
Two of our main goals are to make active learning more understandable and
improve our framework's usability.
Therefore, we require the implementation of an example for each query strategy.
To do so, one needs to create a file name
``scikit-activeml/docs/examples/query_strategy.json``. Currently, we support
examples for single-annotator pool-based query strategies and single-annotator
stream-based query strategies.
The ``.json`` file supports the following entries:
+------------------+----------------------------------------------------------+
| Entry | Description |
+==================+==========================================================+
| ``class`` | Query strategy's class name. |
+------------------+----------------------------------------------------------+
| ``package`` | Name of the sub-package, e.g., pool. |
+------------------+----------------------------------------------------------+
| ``method`` | Query strategy's official name. |
+------------------+----------------------------------------------------------+
| ``category`` | The methodological category of this query strategy, |
| | i.e., Expected Error Reduction, Model Change, |
| | Query-by-Committee, Random Sampling, |
| | Uncertainty Sampling, or Others. |
+------------------+----------------------------------------------------------+
| ``template`` | Defines the general setup/setting of the example. |
| | Supported templates are ``examples/template_pool.py``, |
| | ``examples/template_pool_regression.py``, |
| | ``examples/template_stream.py``, and |
| | ``examples/template_pool_batch.py`` |
+------------------+----------------------------------------------------------+
| ``tags`` | Defines search categories. Supported tags are ``pool``, |
| | ``stream``, ``single-annotator``, ``multi-annotator``, |
| | ``classification``, and ``regression``. |
+------------------+----------------------------------------------------------+
| ``title`` | Title of the example, usually named after the query |
| | strategy. |
+------------------+----------------------------------------------------------+
| ``text_0`` | Placeholder for additional explanations. |
+------------------+----------------------------------------------------------+
| ``refs`` | References (BibTeX key) to the paper(s) of the query |
| | strategy. |
+------------------+----------------------------------------------------------+
| ``sequence`` | Order in which content is displayed, usually ["title", |
| | "text_0", "plot", "refs"]. |
+------------------+----------------------------------------------------------+
| ``import_misc`` | Python code for imports, e.g., |
| | "from skactiveml.pool import RandomSampling". |
+------------------+----------------------------------------------------------+
| ``n_samples`` | Number of samples of the example data set. |
+------------------+----------------------------------------------------------+
| ``init_qs`` | Python code to initialize the query strategy object, |
| | e.g., "RandomSampling()". |
+------------------+----------------------------------------------------------+
| ``query_params`` | Python code of parameters passed to the query method of |
| | the query strategy, e.g., "X=X, y=y". |
+------------------+----------------------------------------------------------+
| ``preproc`` | Python code for preprocessing before executing the AL |
| | cycle, e.g., "X = (X-X.min())/(X.max()-X.min())". |
+------------------+----------------------------------------------------------+
| ``n_cycles`` | Number of AL cycles. |
+------------------+----------------------------------------------------------+
| ``init_clf`` | Python code to initialize the classifier object, e.g., |
| | "ParzenWindowClassifier(classes=[0, 1])". Only supported |
| | for ``examples/template_pool.py``, |
| | ``examples/template_pool_batch.py``, and |
| | ``examples/template_stream.py``. |
+------------------+----------------------------------------------------------+
| ``init_reg`` | Python code to initialize the regressor object, e.g., |
| | "NICKernelRegressor()". Only supported for |
| | ``examples/template_pool_regression.py``. |
+------------------+----------------------------------------------------------+
Testing and code coverage
-------------------------
Please ensure test coverage is close to 100%. The current code coverage
can be viewed
`here `__.
Documentation
-------------
Guidelines for writing documentation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In ``scikit-activeml``, the
`guidelines `__
for writing the documentation are adopted from
`scikit-learn `__.
Building the documentation
~~~~~~~~~~~~~~~~~~~~~~~~~~
To ensure the documentation of your work is well formatted, build the sphinx
documentation by executing the following line.
.. code:: bash
sphinx-build -b html docs docs/_build
Issue Tracking
--------------
We use `Github
Issues `__ as
our issue tracker. If you think you have found a bug in ``scikit-activeml``,
you can report it to the issue tracker. Documentation bugs can also be reported
there.
Checking If A Bug Already Exists
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The first step before filing an issue report is to see whether the
problem has already been reported. Checking if the problem is an
existing issue will:
1. Help you see if the problem has already been resolved or has been
fixed for the next release
2. Save time for you and the developers
3. Help you learn what needs to be done to fix it
4. Determine if additional information, such as how to replicate the
issue, is needed
To see if the issue already exists, search the issue database (``bug``
label) using the search box on the top of the issue tracker page.
Reporting an issue
~~~~~~~~~~~~~~~~~~
Use the following labels to report an issue:
================= ====================================
Label Usecase
================= ====================================
``bug`` Something isn’t working
``enhancement`` New feature
``documentation`` Improvement or additions to document
``question`` General questions
================= ====================================