-Shapley¶
Practical application
While we provide an implementation of the
A layer-wise approximation¶
As discussed in Stratified Shapley, it is possible
to compute Shapley values by averaging the "layer-wise" (i.e. per subset size)
marginal contributions
Therefore, one can estimate
Clipping the sample sizes¶
Additionally, the authors argue that, for certain model classes, small
coalitions are good estimators of the data value, because adding more data tends
to have diminishing returns. This motivates clipping
where
Delta Shapley as importance sampling¶
Above, we have departed from the paper and use the notation
Recall from Stratified Shapley that we if we define
a distribution
This holds because the probability under
Now,
And computes the approximation
Also see
Read Sampling strategies for semi-values for more on the interaction between coefficients and sampler weights in Monte Carlo estimation of semi-values.
Choosing the number of sets¶
As discussed, subset sizes are sampled according to the probability
The choice of
Delta-Shapley for non-convex SGD¶
Setting
All of these parameters must be set when instantiating the
Inconsistencies between the paper and the code
There are several inconsistencies between the paper and the code that we
could find online, which
we couldn't resolve to our satisfaction. These include:
1. There are minor discrepancies in the definition of
Powerset and permutation sampling¶
The original paper uses a standard powerset sampling approach, where the sets
Alternatively, we provide an experimental and approximate permutation-based
approach which clips permutations and keeps track of the sizes of sets returned.
This reduces computation by at least a factor of 2, since the evaluation
strategy can reuse the previously computed utility for the marginal
contribution. It is implemented in
StratifiedPermutationSampler.
Note that it does not guarantee sampling the exact number of set sizes
-
A technical detail is the assumption that the order in which batches are sampled from a coalition
when computing is not random (i.e. it is constant across epochs). ↩ -
We believe Definition 9 in the paper to be a bit misleading since it lacks the averaging of the marginals over the sampled sets. As it stands, it iss an unweighted average of the marginals, which would explode in magnitude for large
This seems substantiated by the fact that the code we found online does not implement this definition, but rather the one we provide here. ↩ -
Watson, L., Kujawa, Z., Andreeva, R., Yang, H.-T., Elahi, T., Sarkar, R., 2023. Accelerated Shapley Value Approximation for Data Evaluation [WWW Document]. https://doi.org/10.48550/arXiv.2311.05346 ↩