pydvl.valuation.methods.knn_shapley ¶

This module contains Shapley computations for K-Nearest Neighbours classifier, introduced by Jia et al. (2019).¹

In particular it provides KNNShapleyValuation to compute exact Shapley values for a KNN classifier in \(O(n \log n)\) time per test point, as opposed to \(O(n^2 \log^2 n)\) if the model were simply fed to a generic ShapleyValuation object.

See the documentation or the paper for details.

Todo

Implement approximate KNN computation for sublinear complexity

References¶

Jia, R. et al., 2019. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. In: Proceedings of the VLDB Endowment, Vol. 12, No. 11, pp. 1610–1623. ↩

KNNShapleyValuation ¶

KNNShapleyValuation(
    model: KNeighborsClassifier,
    test_data: Dataset[NDArray],
    progress: bool = True,
    clone_before_fit: bool = True,
)

Bases: Valuation

Computes exact Shapley values for a KNN classifier.

This implements the method described in (Jia, R. et al., 2019)¹.

PARAMETER	DESCRIPTION
`model`	KNeighborsClassifier model to use for valuation TYPE: `KNeighborsClassifier`
`test_data`	Dataset containing test data to evaluate the model. TYPE: `Dataset[NDArray]`
`progress`	Whether to display a progress bar. TYPE: `bool` DEFAULT: `True`
`clone_before_fit`	Whether to clone the model before fitting. TYPE: `bool` DEFAULT: `True`

Source code in src/pydvl/valuation/methods/knn_shapley.py

def __init__(
    self,
    model: KNeighborsClassifier,
    test_data: Dataset[NDArray],
    progress: bool = True,
    clone_before_fit: bool = True,
):
    super().__init__()
    if not isinstance(model, KNeighborsClassifier):
        raise TypeError("KNN Shapley requires a K-Nearest Neighbours model")
    self.model = model
    self.test_data = test_data
    self.progress = progress
    self.clone_before_fit = clone_before_fit

result `property` ¶

result: ValuationResult

The current valuation result (not a copy).

fit ¶

fit(
    data: Dataset[NDArray], continue_from: ValuationResult | None = None
) -> Self

Calculate exact shapley values for a KNN model on a dataset.

This fit method bypasses direct evaluations of the utility function and calculates the Shapley values directly.

In contrast to other data valuation models, the runtime increases linearly with the size of the dataset.

Calculating the KNN valuation is a computationally expensive task that can be parallelized. To do so, call the fit() method inside a joblib.parallel_config context manager as follows:

from joblib import parallel_config

with parallel_config(n_jobs=4):
    valuation.fit(data)

Args: data: The dataset to use for valuation. continue_from: A previously saved valuation result to continue from.

Source code in src/pydvl/valuation/methods/knn_shapley.py

def fit(
    self, data: Dataset[NDArray], continue_from: ValuationResult | None = None
) -> Self:
    """Calculate exact shapley values for a KNN model on a dataset.

    This fit method bypasses direct evaluations of the utility function and
    calculates the Shapley values directly.

    In contrast to other data valuation models, the runtime increases linearly
    with the size of the dataset.

    Calculating the KNN valuation is a computationally expensive task that
    can be parallelized. To do so, call the `fit()` method inside a
    `joblib.parallel_config` context manager as follows:

    ```python
    from joblib import parallel_config

    with parallel_config(n_jobs=4):
        valuation.fit(data)
    ```
    Args:
        data: The dataset to use for valuation.
        continue_from: A previously saved valuation result to continue from.

    """
    if isinstance(data, GroupedDataset):
        raise TypeError("GroupedDataset is not supported by KNNShapleyValuation")

    self._result = self._init_or_check_result(data, continue_from)

    x_train, y_train = data.data()
    if self.clone_before_fit:
        self.model = cast(KNeighborsClassifier, clone(self.model))
    self.model.fit(x_train, y_train)

    n_test = len(self.test_data)

    _, n_jobs = get_active_backend()
    n_jobs = n_jobs or 1  # Handle None if outside a joblib context
    batch_size = (n_test // n_jobs) + (1 if n_test % n_jobs else 0)
    x_test, y_test = self.test_data.data()
    batches = zip(chunked(x_test, batch_size), chunked(y_test, batch_size))

    process = delayed(self._compute_values_for_test_points)
    with Parallel(return_as="generator_unordered") as parallel:
        results = parallel(
            process(self.model, x_test, y_test, y_train)
            for x_test, y_test in batches
        )
        values = np.zeros(len(data))
        # FIXME: this progress bar won't add much since we have n_jobs batches and
        #  they will all take about the same time
        for res in tqdm(results, total=n_jobs, disable=not self.progress):
            values += res
        values /= n_test

    self._result += ValuationResult(
        algorithm=str(self),
        status=Status.Converged,
        values=values,
        data_names=data.names,
    )

    return self

values ¶

values(sort: bool = False) -> ValuationResult

Returns a copy of the valuation result.

The valuation must have been run with fit() before calling this method.

PARAMETER	DESCRIPTION
`sort`	Whether to sort the valuation result by value before returning it. TYPE: `bool` DEFAULT: `False`

Returns: The result of the valuation.

Source code in src/pydvl/valuation/base.py

@deprecated(
    target=None,
    deprecated_in="0.10.0",
    remove_in="0.11.0",
)
def values(self, sort: bool = False) -> ValuationResult:
    """Returns a copy of the valuation result.

    The valuation must have been run with `fit()` before calling this method.

    Args:
        sort: Whether to sort the valuation result by value before returning it.
    Returns:
        The result of the valuation.
    """
    if not self.is_fitted:
        raise NotFittedException(type(self))
    assert self._result is not None

    r = self._result.copy()
    if sort:
        r.sort(inplace=True)
    return r