Semi-values¶

The well-known Shapley Value is a particular case of a more general concept called semi-value, which is a generalization to different weighting schemes. A semi-value is any valuation function with the form:

\[ v_\text{semi}(i) = \sum_{i=1}^n w(k) \sum_{S \subseteq D_{-i}^{k}} [u(S_{+i}) - u(S)], \]

where the coefficients \(w(k)\) satisfy the property:

\[\sum_{k=1}^n \binom{n-1}{k} w(k) = 1,\]

and \(D_{-i}^{k}\) is the set of all sets \(S\) of size \(k\) that do not include sample \(x_i\), \(S_{+i}\) is the set \(S\) with \(x_i\) added, and \(u\) is the utility function.

With \(w(k) = \frac{1}{n} \binom{n-1}{k}^{-1}\), we recover the Shapley value.

Two additional instances of semi-value are Data Banzhaf (Wang and Jia, 2023)² and Beta Shapley (Kwon and Zou, 2022)³, which offer improved numerical and rank stability in certain situations.

All semi-values, including those two, are implemented in pyDVL by composing different sampling methods and weighting schemes. The abstract class from which they derive is SemiValueValuation, whose main abstract method is the weighting scheme \(k \mapsto w(k)\).

General semi-values¶

In pyDVL we provide a general method for computing general semi-values with any combination of the three ingredients that define them:

A utility function \(u\).
A sampling method.
A weighting scheme \(w\).

You can construct any combination of these three ingredients with subclasses of SemivalueValuation and any of the samplers defined in pydvl.valuation.samplers.

Allowing any combination enables testing different importance-sampling schemes and can help when experimenting with models that are more sensitive to changes in training set size.¹

For more on this topic and how Monte Carlo sampling interacts with the semi-value coefficient and the sampler probabilities, see Sampling strategies for semi-values.

Note however that Data Banzhaf has shown to be among the most robust to variance in the utility function, in the sense of rank stability, across a range of models and datasets (Wang and Jia, 2023)². ↩
Wang, J.T., Jia, R., 2023. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning, in: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Presented at the International Conference on Artificial Intelligence and Statistics, PMLR, pp. 6388--6421. ↩↩
Kwon, Y., Zou, J., 2022. Beta Shapley: A Unified and [Noise-reduced Data Valuation Framework]{.nocase} for Machine Learning, in: Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022,. Presented at the AISTATS 2022, PMLR, Valencia, Spain. ↩