Shapley value¶

The Shapley method is an approach to compute data values originating in cooperative game theory. Shapley values are a common way of assigning payoffs to each participant in a cooperative game (i.e. one in which players can form coalitions) in a way that ensures that certain axioms are fulfilled.

pyDVL implements several methods for the computation and approximation of Shapley values. Empirically, one of the most useful methods is the so-called Truncated Monte Carlo Shapley (Ghorbani and Zou, 2019)², but several approximations exist with different convergence rates and computational costs.

Support for torch models

Starting from version 0.10.1, all Shapley value methods support both NumPy arrays and PyTorch tensors as input data types. The implementation preserves the input type throughout the computation, allowing integration with PyTorch models. See Tensor Support for more details.

Combinatorial Shapley¶

The first algorithm is just a verbatim implementation of the definition below. As such it returns as exact a value as the utility function allows (see what this means in Problems of Data Values).

The value \(v\) of the \(i\)-th sample in dataset \(D\) wrt. utility \(u\) is computed as a weighted sum of its marginal utility wrt. every possible coalition of training samples within the training set:

\[ \begin{equation} v_\text{shap}(i) = \frac{1}{n} \sum_{S \subseteq D_{-i}} \binom{n-1}{ | S | }^{-1} [u(S_{+i}) − u(S)], \label{combinatorial-shapley}\tag{1} \end{equation} \]

where \(D_{-i}\) denotes the set of samples in \(D\) excluding \(x_i,\) and \(S_{+i}\) denotes the set \(S\) with \(x_i\) added.¹

Computing exact Shapley values

from joblib import parallel_config
from pydvl.valuation import (
    Dataset, ModelUtility, SupervisedScorer, ShapleyValuation
)

train, test = SomeVerySmallDatasets()
model = ...
scorer = SupervisedScorer(model, test, default=..)
utility = ModelUtility(model, scorer)
sampler = DeterministicUniformSampler()
valuation = ShapleyValuation(utility, sampler, NoStopping(sampler))

with parallel_config(n_jobs=-1):
    valuation.fit(train)
result = valuation.result

We can convert the return value to a pandas.DataFrame with the to_dataframe method. Please refer to the introduction to data valuation and to the documentation in ValuationResult for more information.

Monte Carlo Combinatorial Shapley¶

Because the number of subsets \(S \subseteq D_{-i}\) is \(2^{ | D | - 1 },\) one must typically resort to approximations. The simplest one is done via Monte Carlo sampling of the powerset \(\mathcal{P}(D).\) In pyDVL this simple technique is called "Monte Carlo Combinatorial". The method has very poor converge rate and others are preferred, but if desired, usage follows the same pattern:

Monte Carlo Combinatorial Shapley values

from pydvl.valuation import (
    ShapleyValuation,
    ModelUtility,
    SupervisedScorer,
    PermutationSampler,
    MaxSamples
)

model = SomeSKLearnModel()
scorer = SupervisedScorer("accuracy", test_data, default=0.0)
utility = ModelUtility(model, scorer)
sampler = UniformSampler(seed=42)
stopping = MaxSamples(sampler, 5000)
valuation = ShapleyValuation(utility, sampler, stopping)
with parallel_config(n_jobs=16):
    valuation.fit(training_data)
result = valuation.result

Note the usage of the object MaxSamples as the stopping condition, which takes the sampler as argument. This is a special instance of a StoppingCriterion. More examples which are not tied to the sampler are MaxTime (stops after a certain time), MinUpdates (looks at the number of updates to the individual values), and AbsoluteStandardError (not very reliable as a stopping criterion), among others.

A stratified approach¶

Let's decompose definition (1) into "layers", one per subset size \(k,\) by writing it in the equivalent form:¹

\[v_\text{shap}(i) = \sum_{k=0}^{n-1} \frac{1}{n} \binom{n-1}{k}^{-1} \sum_{S \subseteq D_{-i}^{k}} \Delta_i(S).\]

Here \(D_i^{k}\) is the set of all subsets of size \(k\) in the complement of \(\{i\}.\) Since there are \(\binom{n-1}{k}\) such sets, the above is an average over all \(n\) set sizes \(k\) of the average marginal contributions of the point \(i\) to all sets of size \(k.\)

We can now devise a sampling scheme over the powerset of \(N_{-i}\) that yields this expression:

Sample \(k\) uniformly from \(\{0, ..., n-1\}.\)
Sample \(S\) uniformly from the powerset of \(N_{-i}^k.\)

Call this distribution \(\mathcal{L}_k.\) Then

\[ \begin{eqnarray*} \mathbb{E}_{S \sim \mathcal{L}} [\Delta_i (S)] & = & \sum_{k = 0}^{n - 1} \sum_{S \subseteq N_{- i}^k} \Delta_i (S) p (S|k) p (k) \\ & = & \sum_{k = 0}^{n - 1} \sum_{S \subseteq N_{- i}^k} \Delta_i (S) \binom{n - 1}{k}^{- 1} \frac{1}{n} \\ & = & v_{\text{sh}}(i). \end{eqnarray*} \]

The choice \(p(k) = 1/n\) is implemented in StratifiedShapleyValuation but can be changed to any other distribution over \(k.\) (Wu et al., 2023)³ introduced VRDS sampling as a way to reduce the variance of the estimator.

Stratified Shapley

The specific instance of stratified sampling described above can be directly used by instantiating a StratifiedShapleyValuation object. For more general use cases, use ShapleyValuation with a custom sampler, for instance VRDSSampler. Note the use of the History object, a stopping which does not stop, but records the trace of value updates in a rolling memory. The data can then be used to check for convergence, debugging, plotting, etc.

from pydvl.valuation import StratifiedShapleyValuation, MinUpdates, History
training_data, test_data = Dataset.from_arrays(...)
model = ...
scorer = SupervisedScorer(model, test_data, default=..., range=...)
utility = ModelUtility(model, scorer)
valuation = StratifiedShapleyValuation(
    utility=utility,
    is_done=MinUpdates(min_updates) | History(n_steps=min_updates),
    batch_size=batch_size,
    seed=seed,
    skip_converged=True,
    progress=True,
)
with parallel_config(n_jobs=-4):
    valuation.fit(training_data)
results = valuation.result

Permutation Shapley¶

An equivalent way of computing Shapley values (ApproShapley) appeared in (Castro et al., 2009)⁴ and is the basis for the method most often used in practice. It uses permutations over indices instead of subsets:

\[ v_u(i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)} [u(S_{i}^{\sigma} \cup \{i\}) − u(S_{i}^{\sigma})], \]

where \(S_{i}^{\sigma}\) denotes the set of indices in permutation sigma before the position where \(i\) appears. To approximate this sum (which has \(\mathcal{O}(n!)\) terms!) one uses Monte Carlo sampling of permutations, something which has surprisingly low sample complexity. One notable difference wrt. the combinatorial approach above is that the approximations always fulfill the efficiency axiom of Shapley, namely \(\sum_{i=1}^n \hat{v}_i = u(D)\) (see (Castro et al., 2009)⁴, Proposition 3.2).

A note about implementation

The definition above uses all permutations to update one datapoint \(i\). However, in practice, instead of searching for the position of a fixed index in every permutation, one can use a single permutation to update all datapoints, by iterating through it and updating the value for the index at the current position. This has the added benefit of allowing to use the utility for the previous index to compute the marginal utility for the current one, thus halving the number of utility calls. This strategy is implemented in PermutationEvaluationStrategy, and is automatically selected when using any of the permutation samplers.

Truncated Monte Carlo Shapley¶

By adding two types of early stopping, the result is the so-called Truncated Monte Carlo Shapley (Ghorbani and Zou, 2019)², which is efficient enough to be useful in applications.

The first is simply a convergence criterion, of which there are several to choose from. The second is a criterion to truncate the iteration over single permutations. RelativeTruncation chooses to stop iterating over samples in a permutation when the marginal utility becomes too small. The method is available through the class TMCShapleyValuation.

However, being a heuristic to permutation sampling, it can be "manually" implemented by choosing a RelativeTruncation for a PermutationSampler when configuring ShapleyValuation (note however that this introduces some correcting factors, see Sampling strategies for semi-values).

You can see this method in action in this example using the Spotify dataset.

Truncated Monte Carlo Shapley values

Use of this object follows the same pattern as the previous examples, except that separate instantiation of the sampler is not necessary anymore. This has the drawback that we cannot use MaxSamples as stopping criterion anymore since it requires the sampler. To work around this, use ShapleyValuation directly.

from pydvl.valuation import (
    MinUpdates
    ModelUtility,
    PermutationSampler,
    SupervisedScorer,
    RelativeTruncation,
    TMCShapleyValuation,
)

model = SomeSKLearnModel()
scorer = SupervisedScorer("accuracy", test_data, default=0)
utility = ModelUtility(model, scorer, ...)
truncation = RelativeTruncation(rtol=0.05)
stopping = MinUpdates(5000)
valuation = TMCShapleyValuation(utility, truncation, stopping)
with parallel_config(n_jobs=16):
    valuation.fit(training_data)
result = valuation.result

Other approximation methods¶

As already mentioned, with the architecture of ShapleyValuation it is possible to try different importance-sampling schemes by swapping the sampler. Besides TMCS we also have Owen sampling (Okhrati and Lipani, 2021)⁵, and Beta Shapley (Kwon and Zou, 2022)⁶ when \(\alpha = \beta = 1.\)

A different approach is via a SAT problem, as done in Group Testing Shapley (Jia et al., 2019)⁷.

Yet another, which is applicable to any utility-based valuation method, is Data Utility Learning (Wang et al., 2022)⁸. This method learns a model of the utility function during a warmup phase, and then uses it to speed up marginal utility computations.

Model-specific methods¶

Shapley values can have a closed form expression or a simpler approximation scheme when the model class is restricted. The prime example is kNN-Shapley (Jia et al., 2019)⁹, which is exact for the kNN model, and is \(O(n_\text{test}\ n \log n).\)

The quantity \(u(S_{+i}) − u(S)\) is called the marginal utility of the sample \(x_i\) (with respect to \(S\)), and we will often denote it by \(\Delta_i(S, u),\) or, when no confusion is possible, simply \(\Delta_i(S).\) ↩↩
Ghorbani, A., Zou, J., 2019. Data Shapley: Equitable Valuation of Data for Machine Learning, in: Proceedings of the 36th International Conference on Machine Learning, PMLR. Presented at the International Conference on Machine Learning (ICML 2019), PMLR, pp. 2242--2251. ↩↩
Wu, M., Jia, R., Lin, C., Huang, W., Chang, X., 2023. Variance reduced Shapley value estimation for trustworthy data valuation. Computers\ & Operations Research 159, 106305. https://doi.org/10.1016/j.cor.2023.106305 ↩
Castro, J., Gómez, D., Tejada, J., 2009. Polynomial calculation of the Shapley value based on sampling. Computers\ & Operations Research, Selected papers presented at the Tenth International Symposium on Locational Decisions (ISOLDE X) 36, 1726--1730. https://doi.org/10.1016/j.cor.2008.04.004 ↩↩
Okhrati, R., Lipani, A., 2021. A Multilinear Sampling Algorithm to Estimate Shapley Values, in: 2020 25th International Conference on Pattern Recognition (ICPR). Presented at the 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, pp. 7992--7999. https://doi.org/10.1109/ICPR48806.2021.9412511 ↩
Kwon, Y., Zou, J., 2022. Beta Shapley: A Unified and [Noise-reduced Data Valuation Framework]{.nocase} for Machine Learning, in: Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022,. Presented at the AISTATS 2022, PMLR, Valencia, Spain. ↩
Jia, R., Dao, D., Wang, B., Hubis, F.A., Hynes, N., Gürel, N.M., Li, B., Zhang, C., Song, D., Spanos, C.J., 2019. Towards Efficient Data Valuation Based on the Shapley Value, in: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. Presented at the International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, pp. 1167--1176. ↩
Wang, T., Yang, Y., Jia, R., 2022. Improving [Cooperative Game Theory-based Data Valuation]{.nocase} via Data Utility Learning. Presented at the International Conference on Learning Representations (ICLR 2022). Workshop on Socially Responsible Machine Learning, arXiv. https://doi.org/10.48550/arXiv.2107.06336 ↩
Jia, R., Dao, D., Wang, B., Hubis, F.A., Gurel, N.M., Li, B., Zhang, C., Spanos, C., Song, D., 2019. Efficient task-specific data valuation for nearest neighbor algorithms. Proc. VLDB Endow. 12, 1610--1623. https://doi.org/10.14778/3342263.3342637 ↩