Skip to content

pydvl.valuation.methods.knn_shapley

This module contains Shapley computations for K-Nearest Neighbours.

Todo

Implement approximate KNN computation for sublinear complexity

References


  1. Jia, R. et al., 2019. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. In: Proceedings of the VLDB Endowment, Vol. 12, No. 11, pp. 1610–1623. 

KNNShapleyValuation

KNNShapleyValuation(utility: KNNClassifierUtility, progress: bool = True)

Bases: Valuation

Computes exact Shapley values for a KNN classifier.

This implements the method described in (Jia, R. et al., 2019)1. It exploits the local structure of K-Nearest Neighbours to reduce the number of calls to the utility function to a constant number per index, thus reducing computation time to \(O(n)\).

PARAMETER DESCRIPTION
utility

KNNUtility with a KNN model to extract parameters from. The object will not be modified nor used other than to call get_params()

TYPE: KNNClassifierUtility

progress

Whether to display a progress bar.

TYPE: bool DEFAULT: True

Source code in src/pydvl/valuation/methods/knn_shapley.py
def __init__(self, utility: KNNClassifierUtility, progress: bool = True):
    super().__init__()
    self.utility = utility
    self.progress = progress

    config = self.utility.model.get_params()
    self.n_neighbors = config["n_neighbors"]

    del config["weights"]
    self.helper_model = NearestNeighbors(**config)

values

values(sort: bool = False) -> ValuationResult

Returns a copy of the valuation result.

The valuation must have been run with fit() before calling this method.

PARAMETER DESCRIPTION
sort

Whether to sort the valuation result before returning it.

TYPE: bool DEFAULT: False

Returns: The result of the valuation.

Source code in src/pydvl/valuation/base.py
def values(self, sort: bool = False) -> ValuationResult:
    """Returns a copy of the valuation result.

    The valuation must have been run with `fit()` before calling this method.

    Args:
        sort: Whether to sort the valuation result before returning it.
    Returns:
        The result of the valuation.
    """
    if not self.is_fitted:
        raise NotFittedException(type(self))
    assert self.result is not None

    from copy import copy

    r = copy(self.result)
    if sort:
        r.sort()
    return r

fit

fit(data: Dataset) -> Self

Calculate exact shapley values for a KNN model on a dataset.

This fit method bypasses direct evaluations of the utility function and calculates the Shapley values directly.

In contrast to other data valuation models, the runtime increases linearly with the size of the test dataset.

Calculating the KNN valuation is a computationally expensive task that can be parallelized. To do so, call the fit() method inside a joblib.parallel_config context manager as follows:

from joblib import parallel_config

with parallel_config(n_jobs=4):
    valuation.fit(data)
Source code in src/pydvl/valuation/methods/knn_shapley.py
def fit(self, data: Dataset) -> Self:
    """Calculate exact shapley values for a KNN model on a dataset.

    This fit method bypasses direct evaluations of the utility function and
    calculates the Shapley values directly.

    In contrast to other data valuation models, the runtime increases linearly
    with the size of the test dataset.

    Calculating the KNN valuation is a computationally expensive task that
    can be parallelized. To do so, call the `fit()` method inside a
    `joblib.parallel_config` context manager as follows:

    ```python
    from joblib import parallel_config

    with parallel_config(n_jobs=4):
        valuation.fit(data)
    ```

    """
    self.helper_model = self.helper_model.fit(data.x)
    n_obs = len(data.x)
    n_test = len(self.utility.test_data)

    generator = zip(self.utility.test_data.x, self.utility.test_data.y)

    generator_with_progress = tqdm(
        generator,
        total=n_test,
        disable=not self.progress,
        position=0,
    )

    with Parallel(return_as="generator") as parallel:
        results = parallel(
            delayed(self._compute_values_for_one_test_point)(
                self.helper_model, x, y, data.y
            )
            for x, y in generator_with_progress
        )
        values = np.zeros(n_obs)
        for res in results:
            values += res
        values /= n_test

    res = ValuationResult(
        algorithm="knn_shapley",
        status=Status.Converged,
        values=values,
        data_names=data.data_names,
    )

    self.result = res
    return self