Skip to content

δ-Shapley

δ-Shapley is a semi-value that approximates the average marginal contribution of a data point per subset size, truncated for sizes beyond a certain range. It was introduced in (Watson et al., 2023)3, and is available in pyDVL as DeltaShapleyValuation, which is built around a stratified sampler. When its clipping feature (discussed below) is not used, it is an approximation to Shapley value.

Practical application

While we provide an implementation of the δ-Shapley value for the sake of completeness, in practice, properly adjusting the constants required is often difficult, making it hard to use. If one still wishes to use stratified sampling, it is possible to use subset size sampling strategies that don't require these constants, such as PowerLawSampleSize, or VRDSSampler.

A layer-wise approximation

As discussed in Stratified Shapley, it is possible to compute Shapley values by averaging the "layer-wise" (i.e. per subset size) marginal contributions ϕik:=SjNikΔi(Sj).

Therefore, one can estimate vshap(i) by approximating the ϕik and then averaging those. This approximation consists of sampling only a fraction mk of all the sets of size k to average the Δi(Sj), j=1,...,mk. The main contributions of the paper is a careful choice of mk (see below) which depends on the ML model for which the values are computed.

Clipping the sample sizes

Additionally, the authors argue that, for certain model classes, small coalitions are good estimators of the data value, because adding more data tends to have diminishing returns. This motivates clipping k outside a given range, to come to the final definition of the δ-Shapley value:2

v^δ(i):=1ul+1k=lu1mkj=1mkΔi(Sj),

where l and u are lower and upper bounds for k, the sets Sj are sampled uniformly at random from Nik, and mk is the number of samples at size k.

Delta Shapley as importance sampling

Above, we have departed from the paper and use the notation v^δ to indicate that this is an approximation to Shapley value using importance sampling and a certain stratified sampling distribution.

Recall from Stratified Shapley that we if we define a distribution L over sets by first sampling a size k uniformly between 0 and n1, then a subset of Ni and size k uniformly from the powerset, then

vshap(i)=EL[Δi(S)].

This holds because the probability under L of drawing any set S is, letting k=|S|:

pL(S)=pL(k) pL(S|k)=1n(n1k)1.

Now, δ-Shapley introduces a new stratified distribution Lk, such that:

pLk(S)=pLk(k) pLk(S|k)=mkjmj(n1k)1.

And computes the approximation

v^δ(i):=ELk[Δi(S) pL(S)pLk(S)].

Also see

Read Sampling strategies for semi-values for more on the interaction between coefficients and sampler weights in Monte Carlo estimation of semi-values.

Choosing the number of sets

As discussed, subset sizes are sampled according to the probability mk/sumjmj (even though the exact procedure can vary, e.g. iterate through all k deterministically or at random). This means that the probability of sampling a set of size k is

The choice of mk is guided by theoretical bounds derived from the uniform stability properties of the learning algorithm. The authors of δ-Shapley derive bounds for different classes of loss functions using concentration inequalities, with Theorems 6 and 7 of the paper providing the choice of mk for the case of non-convex, smooth Lipschitz models trained with SGD.1 This is the case that we implement in DeltaShapleyNCSGDSampleSize, but we will discuss below how to implement any choice of mk.

Delta-Shapley for non-convex SGD

Setting mk for a general model trained with SGD requires several parameters: the number of SGD iterations, the range of the loss, the Lipschitz constant of the model, and the learning rate, which is assumed to decay as αt=c/t. For the exact expressions see equations (8) and (9) of the paper.

All of these parameters must be set when instantiating the δ-Shapley class.

Inconsistencies between the paper and the code

There are several inconsistencies between the paper and the code that we could find online, which we couldn't resolve to our satisfaction. These include: 1. There are minor discrepancies in the definition of mk, most notably the introduction of a factor ntest. 2. The code uses a certain, rather arbitrary, number of SGD iterations T to compute mk which is never actually used to train the model. 3. Most constants are set to arbitrary values, seemingly without any justification, potentially invalidating the application of the theoretical bounds. For these reasons, we provide two modes of operation for the sample size strategy implementing these bounds to either follow those in the paper or those in the code, for reproducibility purposes. See DeltaShapleyNCSGDSampleSize.

Powerset and permutation sampling

The original paper uses a standard powerset sampling approach, where the sets Sj are sampled uniformly at random from the powerset of Dik. We provide this sampling method via StratifiedSampler, which can be configured with any of the classes inheriting SampleSizeStrategy. These implement the mk, and the lower and upper bounds truncating k.

Alternatively, we provide an experimental and approximate permutation-based approach which clips permutations and keeps track of the sizes of sets returned. This reduces computation by at least a factor of 2, since the evaluation strategy can reuse the previously computed utility for the marginal contribution. It is implemented in StratifiedPermutationSampler. Note that it does not guarantee sampling the exact number of set sizes mk.


  1. A technical detail is the assumption that the order in which batches are sampled from a coalition S when computing u(s) is not random (i.e. it is constant across epochs). 

  2. We believe Definition 9 in the paper to be a bit misleading since it lacks the averaging of the marginals over the sampled sets. As it stands, it iss an unweighted average of the marginals, which would explode in magnitude for large k. This seems substantiated by the fact that the code we found online does not implement this definition, but rather the one we provide here. 

  3. Watson, L., Kujawa, Z., Andreeva, R., Yang, H.-T., Elahi, T., Sarkar, R., 2023. Accelerated Shapley Value Approximation for Data Evaluation [WWW Document]. https://doi.org/10.48550/arXiv.2311.05346