Skip to content

pydvl.valuation.utility.knn

This module implements the utility function used in KNN-Shapley, as introduced by Jia et al. (2019)1.

Uses of this utility

Although this class can be used in conjunction with any semi-value method and sampler, when computing Shapley values, it is recommended to use the dedicated valuation class KNNShapleyValuation, because it implements a more efficient algorithm for computing Shapley values which runs in \(O(n \log n)\) time for each test point.

KNN-Shapley

See the documentation for an introduction to the method and our implementation.

The utility implemented by the class KNNClassifierUtility is defined as:

\[ u (S) := \frac{1}{n_{\text{test}}} \sum_{j = 1}^{n_{\text{test}}} \frac{1}{K} \sum_{k = 1}^{| \alpha^{(j)} | \}} \mathbb{1} \{ y_{\alpha^{(j)}_k (S)} = y^{\text{test}}_j \}, \]

where \(\alpha^{(j)} (S)\) is the intersection of the \(K\)-nearest neighbors of the test point \(x^{\text{test}}_j\) across the whole training set, and the sample \(S\). In particular, \(\alpha^{(j)}_k (S)\) is the index of the training point in \(S\) which is ranked \(k\)-th closest to test point \(x^{\text{test}}_j.\)

References


  1. Jia, R. et al., 2019. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. In: Proceedings of the VLDB Endowment, Vol. 12, No. 11, pp. 1610–1623. 

KNNClassifierUtility

KNNClassifierUtility(
    model: KNeighborsClassifier,
    test_data: Dataset,
    *,
    catch_errors: bool = False,
    show_warnings: bool = False,
    cache_backend: CacheBackend | None = None,
    cached_func_options: CachedFuncConfig | None = None,
    clone_before_fit: bool = True,
)

Bases: ModelUtility[Sample, KNeighborsClassifier]

Utility object for KNN Classifiers.

The utility function is the model's predicted probability for the true class.

Uses of this utility

Although this class can be used in conjunction with any semi-value method and sampler, when computing Shapley values, it is recommended to use the dedicated class KNNShapleyValuation, because it implements a more efficient algorithm for computing Shapley values which runs in O(n log n) time for each test point.

PARAMETER DESCRIPTION
model

A KNN classifier model.

TYPE: KNeighborsClassifier

test_data

The test data to evaluate the model on.

TYPE: Dataset

catch_errors

set to True to catch the errors when fit() fails. This could happen in several steps of the pipeline, e.g. when too little training data is passed, which happens often during Shapley value calculations. When this happens, the scorer's default value is returned as a score and computation continues.

TYPE: bool DEFAULT: False

show_warnings

Set to False to suppress warnings thrown by fit().

TYPE: bool DEFAULT: False

cache_backend

Optional instance of [CacheBackend][ pydvl.utils.caching.base.CacheBackend] used to wrap the _utility method of the Utility instance. By default, this is set to None and that means that the utility evaluations will not be cached.

TYPE: CacheBackend | None DEFAULT: None

cached_func_options

Optional configuration object for cached utility evaluation.

TYPE: CachedFuncConfig | None DEFAULT: None

clone_before_fit

If True, the model will be cloned before calling fit() in utility evaluations.

TYPE: bool DEFAULT: True

Source code in src/pydvl/valuation/utility/knn.py
def __init__(
    self,
    model: KNeighborsClassifier,
    test_data: Dataset,
    *,
    catch_errors: bool = False,
    show_warnings: bool = False,
    cache_backend: CacheBackend | None = None,
    cached_func_options: CachedFuncConfig | None = None,
    clone_before_fit: bool = True,
):
    self.test_data = test_data
    self.sorted_neighbors: NDArray[np.int_] | None = None
    dummy_scorer = _DummyScorer()

    super().__init__(
        model=model,
        scorer=dummy_scorer,  # not applicable
        catch_errors=catch_errors,
        show_warnings=show_warnings,
        cache_backend=cache_backend,
        cached_func_options=cached_func_options,
        clone_before_fit=clone_before_fit,
    )

cache_stats property

cache_stats: CacheStats | None

Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.

training_data property

training_data: Dataset | None

Retrieves the training data used by this utility.

This property is read-only. In order to set it, use with_dataset().

__call__

__call__(sample: SampleT | None) -> float
PARAMETER DESCRIPTION
sample

contains a subset of valid indices for the x_train attribute of Dataset.

TYPE: SampleT | None

Source code in src/pydvl/valuation/utility/modelutility.py
def __call__(self, sample: SampleT | None) -> float:
    """
    Args:
        sample: contains a subset of valid indices for the
            `x_train` attribute of [Dataset][pydvl.utils.dataset.Dataset].
    """
    if sample is None or len(sample.subset) == 0:
        return self.scorer.default

    return cast(float, self._utility_wrapper(sample))

_maybe_clone_model staticmethod

_maybe_clone_model(model: ModelT, do_clone: bool) -> ModelT

Clones the passed model to avoid the possibility of reusing a fitted estimator.

PARAMETER DESCRIPTION
model

Any supervised model. Typical choices can be found on this page

TYPE: ModelT

do_clone

Whether to clone the model or not.

TYPE: bool

Source code in src/pydvl/valuation/utility/modelutility.py
@staticmethod
def _maybe_clone_model(model: ModelT, do_clone: bool) -> ModelT:
    """Clones the passed model to avoid the possibility of reusing a fitted
    estimator.

    Args:
        model: Any supervised model. Typical choices can be found
            on [this page](https://scikit-learn.org/stable/supervised_learning.html)
        do_clone: Whether to clone the model or not.
    """
    if not do_clone:
        return model
    try:
        model = clone(model)
    except TypeError:
        # This happens if the passed model is not an sklearn model
        # In this case, we just make a deepcopy of the model.
        model = clone(model, safe=False)
    return cast(ModelT, model)

_utility

_utility(sample: SampleT) -> float
PARAMETER DESCRIPTION
sample

contains a subset of valid indices for the x attribute of Dataset.

TYPE: SampleT

RETURNS DESCRIPTION
float

0 if no indices are passed, otherwise the KNN utility for the sample.

Source code in src/pydvl/valuation/utility/knn.py
def _utility(self, sample: SampleT) -> float:
    """

    Args:
        sample: contains a subset of valid indices for the
            `x` attribute of [Dataset][pydvl.valuation.dataset.Dataset].

    Returns:
        0 if no indices are passed, otherwise the KNN utility for the sample.
    """
    if self.training_data is None:
        raise ValueError("No training data provided")

    check_is_fitted(
        self.model,
        msg="The KNN model has to be fitted before calling the utility.",
    )

    _, y_train = self.training_data.data()
    x_test, y_test = self.test_data.data()
    n_neighbors = self.model.get_params()["n_neighbors"]

    if self.sorted_neighbors is None:
        self.sorted_neighbors = self.model.kneighbors(x_test, return_distance=False)

    # Labels of the (restricted) nearest neighbors to each test point
    nns_labels = np.full((len(x_test), n_neighbors), None)
    for i, neighbors in enumerate(self.sorted_neighbors):
        restricted_ns = neighbors[np.isin(neighbors, sample.subset)]
        nns_labels[i][: len(restricted_ns)] = y_train[restricted_ns]
    # Likelihood of the correct labels
    probs = np.asarray(nns_labels == y_test[:, None]).sum(axis=1) / n_neighbors
    return float(probs.mean())

sample_to_data

sample_to_data(sample: SampleT) -> tuple

Returns the raw data corresponding to a sample.

Subclasses can override this e.g. to do reshaping of tensors. Be careful not to rely on self.training_data not changing between calls to this method. For manipulations to it, use the with_dataset() method.

PARAMETER DESCRIPTION
sample

contains a subset of valid indices for the x_train attribute of Dataset.

TYPE: SampleT

Returns: Tuple of the training data and labels corresponding to the sample indices.

Source code in src/pydvl/valuation/utility/modelutility.py
def sample_to_data(self, sample: SampleT) -> tuple:
    """Returns the raw data corresponding to a sample.

    Subclasses can override this e.g. to do reshaping of tensors. Be careful not to
    rely on `self.training_data` not changing between calls to this method. For
    manipulations to it, use the `with_dataset()` method.

    Args:
        sample: contains a subset of valid indices for the
            `x_train` attribute of [Dataset][pydvl.utils.dataset.Dataset].
    Returns:
        Tuple of the training data and labels corresponding to the sample indices.
    """
    if self.training_data is None:
        raise ValueError("No training data provided")

    x_train, y_train = self.training_data.data(sample.subset)
    return x_train, y_train

with_dataset

with_dataset(data: Dataset, copy: bool = True) -> Self

Return the utility, or a copy of it, with the given dataset and the model fitted on it.

PARAMETER DESCRIPTION
data

The dataset to use.

TYPE: Dataset

copy

Whether to copy the utility object or not. Additionally, if True then the model is also cloned. If False, the model is only cloned if clone_before_fit is True.

TYPE: bool DEFAULT: True

Returns: The utility object.

Source code in src/pydvl/valuation/utility/knn.py
def with_dataset(self, data: Dataset, copy: bool = True) -> Self:
    """Return the utility, or a copy of it, with the given dataset and the model
    fitted on it.

    Args:
        data: The dataset to use.
        copy: Whether to copy the utility object or not. Additionally, if `True`
            then the model is also cloned. If `False`, the model is only cloned if
            `clone_before_fit` is `True`.
    Returns:
        The utility object.
    """
    utility: Self = super().with_dataset(data, copy)
    if copy or self.clone_before_fit:
        utility.model = self._maybe_clone_model(self.model, do_clone=True)
    utility.model.fit(*data.data())
    return utility