Skip to content

pydvl.valuation.methods.classwise_shapley

Class-wise Shapley (Schoch et al., 2022)1 offers a Shapley framework tailored for classification problems. Let \(D\) be a dataset, \(D_{y_i}\) be the subset of \(D\) with labels \(y_i\), and \(D_{-y_i}\) be the complement of \(D_{y_i}\) in \(D\). The key idea is that a sample \((x_i, y_i)\), might enhance the overall performance on \(D\), while being detrimental for the performance on \(D_{y_i}\). The Class-wise value is defined as:

\[ v_u(i) = \frac{1}{2^{|D_{-y_i}|}} \sum_{S_{-y_i}} \frac{1}{|D_{y_i}|!} \sum_{S_{y_i}} \binom{|D_{y_i}|-1}{|S_{y_i}|}^{-1} [u( S_{y_i} \cup \{i\} | S_{-y_i} ) − u( S_{y_i} | S_{-y_i})], \]

where \(S_{y_i} \subseteq D_{y_i} \setminus \{i\}\) and \(S_{-y_i} \subseteq D_{-y_i}\).

Analysis of Class-wise Shapley

For a detailed analysis of the method, with comparison to other valuation techniques, please refer to the main documentation.

In practice, the quantity above is estimated using Monte Carlo sampling of the powerset and the set of index permutations. This results in the estimator

\[ v_u(i) = \frac{1}{K} \sum_k \frac{1}{L} \sum_l [u(\sigma^{(l)}_{:i} \cup \{i\} | S^{(k)} ) − u( \sigma^{(l)}_{:i} | S^{(k)})], \]

with \(S^{(1)}, \dots, S^{(K)} \subseteq T_{-y_i},\) \(\sigma^{(1)}, \dots, \sigma^{(L)} \in \Pi(T_{y_i}\setminus\{i\}),\) and \(\sigma^{(l)}_{:i}\) denoting the set of indices in permutation \(\sigma^{(l)}\) before the position where \(i\) appears. The sets \(T_{y_i}\) and \(T_{-y_i}\) are the training sets for the labels \(y_i\) and \(-y_i\), respectively.

Notes for derivation of test cases

The unit tests include the following manually constructed data: Let \(D=\{(1,0),(2,0),(3,0),(4,1)\}\) be the test set and \(T=\{(1,0),(2,0),(3,1),(4,1)\}\) the train set. This specific dataset is chosen as it allows to solve the model

\[y = \max(0, \min(1, \text{round}(\beta^T x)))\]

in closed form \(\beta = \frac{\text{dot}(x, y)}{\text{dot}(x, x)}\). From the closed-form solution, the tables for in-class accuracy \(a_S(D_{y_i})\) and out-of-class accuracy \(a_S(D_{-y_i})\) can be calculated. By using these tables and setting \(\{S^{(1)}, \dots, S^{(K)}\} = 2^{T_{-y_i}}\) and \(\{\sigma^{(1)}, \dots, \sigma^{(L)}\} = \Pi(T_{y_i}\setminus\{i\})\), the Monte Carlo estimator can be evaluated (\(2^M\) is the powerset of \(M\)). The details of the derivation are left to the eager reader.

References


  1. Schoch, Stephanie, Haifeng Xu, and Yangfeng Ji. CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification. In Proc. of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS). New Orleans, Louisiana, USA, 2022. 

ClasswiseShapleyValuation

ClasswiseShapleyValuation(
    utility: ClasswiseModelUtility,
    sampler: ClasswiseSampler,
    is_done: StoppingCriterion,
    progress: dict[str, Any] | bool = False,
    *,
    normalize_values: bool = True
)

Bases: Valuation

Class to compute Class-wise Shapley values.

It proceeds by sampling independent permutations of the index set for each label and index sets sampled from the powerset of the complement (with respect to the currently evaluated label).

PARAMETER DESCRIPTION
utility

Classwise utility object with model and classwise scoring function.

TYPE: ClasswiseModelUtility

sampler

Classwise sampling scheme to use.

TYPE: ClasswiseSampler

is_done

Stopping criterion to use.

TYPE: StoppingCriterion

progress

Whether to show a progress bar.

TYPE: dict[str, Any] | bool DEFAULT: False

normalize_values

Whether to normalize values after valuation.

TYPE: bool DEFAULT: True

Source code in src/pydvl/valuation/methods/classwise_shapley.py
def __init__(
    self,
    utility: ClasswiseModelUtility,
    sampler: ClasswiseSampler,
    is_done: StoppingCriterion,
    progress: dict[str, Any] | bool = False,
    *,
    normalize_values: bool = True,
):
    super().__init__()
    self.utility = utility
    self.sampler = sampler
    self.labels: NDArray | None = None
    if not isinstance(utility.scorer, ClasswiseSupervisedScorer):
        raise ValueError("scorer must be an instance of ClasswiseSupervisedScorer")
    self.scorer: ClasswiseSupervisedScorer = utility.scorer
    self.is_done = is_done
    self.tqdm_args: dict[str, Any] = {
        "desc": f"{self.__class__.__name__}: {str(is_done)}"
    }
    # HACK: parse additional args for the progress bar if any (we probably want
    #  something better)
    if isinstance(progress, bool):
        self.tqdm_args.update({"disable": not progress})
    else:
        self.tqdm_args.update(progress if isinstance(progress, dict) else {})
    self.normalize_values = normalize_values

values

values(sort: bool = False) -> ValuationResult

Returns a copy of the valuation result.

The valuation must have been run with fit() before calling this method.

PARAMETER DESCRIPTION
sort

Whether to sort the valuation result before returning it.

TYPE: bool DEFAULT: False

Returns: The result of the valuation.

Source code in src/pydvl/valuation/base.py
def values(self, sort: bool = False) -> ValuationResult:
    """Returns a copy of the valuation result.

    The valuation must have been run with `fit()` before calling this method.

    Args:
        sort: Whether to sort the valuation result before returning it.
    Returns:
        The result of the valuation.
    """
    if not self.is_fitted:
        raise NotFittedException(type(self))
    assert self.result is not None

    from copy import copy

    r = copy(self.result)
    if sort:
        r.sort()
    return r