Shapley values
Shapley value¶
The Shapley method is an approach to compute data values originating in cooperative game theory. Shapley values are a common way of assigning payoffs to each participant in a cooperative game (i.e. one in which players can form coalitions) in a way that ensures that certain axioms are fulfilled.
pyDVL implements several methods for the computation and approximation of Shapley values. They can all be accessed via the facade function compute_shapley_values. The supported methods are enumerated in ShapleyMode.
Empirically, the most useful method is the so-called Truncated Monte Carlo Shapley (Ghorbani and Zou, 2019)1, which is a Monte Carlo approximation of the permutation Shapley value.
Combinatorial Shapley¶
The first algorithm is just a verbatim implementation of the definition. As such it returns as exact a value as the utility function allows (see what this means in Problems of Data Values).
The value \(v\) of the \(i\)-th sample in dataset \(D\) wrt. utility \(u\) is computed as a weighted sum of its marginal utility wrt. every possible coalition of training samples within the training set:
where \(D_{-i}\) denotes the set of samples in \(D\) excluding \(x_i\), and \(S_{+i}\) denotes the set \(S\) with \(x_i\) added.
from pydvl.value import compute_shapley_values
values = compute_shapley_values(utility, mode="combinatorial_exact")
df = values.to_dataframe(column='value')
We can convert the return value to a
pandas.DataFrame.
and name the column with the results as value
. Please refer to the
documentation in shapley and
ValuationResult for more information.
Monte Carlo Combinatorial Shapley¶
Because the number of subsets \(S \subseteq D_{-i}\) is \(2^{ | D | - 1 }\), one typically must resort to approximations. The simplest one is done via Monte Carlo sampling of the powerset \(\mathcal{P}(D)\). In pyDVL this simple technique is called "Monte Carlo Combinatorial". The method has very poor converge rate and others are preferred, but if desired, usage follows the same pattern:
from pydvl.value import compute_shapley_values, MaxUpdates
values = compute_shapley_values(
utility, mode="combinatorial_montecarlo", done=MaxUpdates(1000)
)
df = values.to_dataframe(column='cmc')
The DataFrames returned by most Monte Carlo methods will contain approximate
standard errors as an additional column, in this case named cmc_stderr
.
Note the usage of the object MaxUpdates as the stop condition. This is an instance of a StoppingCriterion. Other examples are MaxTime and AbsoluteStandardError.
Owen sampling¶
Owen Sampling (Okhrati and Lipani, 2021)2 is a practical algorithm based on the combinatorial definition. It uses a continuous extension of the utility from \(\{0,1\}^n\), where a 1 in position \(i\) means that sample \(x_i\) is used to train the model, to \([0,1]^n\). The ensuing expression for Shapley value uses integration instead of discrete weights:
Using Owen sampling follows the same pattern as every other method for Shapley values in pyDVL. First construct the dataset and utility, then call compute_shapley_values:
from pydvl.value import compute_shapley_values
values = compute_shapley_values(
u=utility, mode="owen", n_iterations=4, max_q=200
)
There are more details on Owen sampling, and its variant Antithetic Owen Sampling in the documentation for the function doing the work behind the scenes: owen_sampling_shapley.
Note that in this case we do not pass a StoppingCriterion to the function, but instead the number of iterations and the maximum number of samples to use in the integration.
Permutation Shapley¶
An equivalent way of computing Shapley values (ApproShapley
) appeared in
(Castro et al., 2009)3 and is the basis for the method most often used in
practice. It uses permutations over indices instead of subsets:
where \(\sigma_{:i}\) denotes the set of indices in permutation sigma before the position where \(i\) appears. To approximate this sum (which has \(\mathcal{O}(n!)\) terms!) one uses Monte Carlo sampling of permutations, something which has surprisingly low sample complexity. One notable difference wrt. the combinatorial approach above is that the approximations always fulfill the efficiency axiom of Shapley, namely \(\sum_{i=1}^n \hat{v}_i = u(D)\) (see (Castro et al., 2009)3, Proposition 3.2).
By adding two types of early stopping, the result is the so-called Truncated Monte Carlo Shapley (Ghorbani and Zou, 2019)1, which is efficient enough to be useful in applications. The first is simply a convergence criterion, of which there are several to choose from. The second is a criterion to truncate the iteration over single permutations. RelativeTruncation chooses to stop iterating over samples in a permutation when the marginal utility becomes too small.
from pydvl.value import compute_shapley_values, MaxUpdates, RelativeTruncation
values = compute_shapley_values(
u=utility,
mode="permutation_montecarlo",
done=MaxUpdates(1000),
truncation=RelativeTruncation(utility, rtol=0.01)
)
You can see this method in action in this example using the Spotify dataset.
Exact Shapley for KNN¶
It is possible to exploit the local structure of K-Nearest Neighbours to reduce the amount of subsets to consider: because no sample besides the K closest affects the score, most are irrelevant and it is possible to compute a value in linear time. This method was introduced by (Jia et al., 2019)4, and can be used in pyDVL with:
from pydvl.utils import Dataset, Utility
from pydvl.value import compute_shapley_values
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
data = Dataset(...)
utility = Utility(model, data)
values = compute_shapley_values(u=utility, mode="knn")
Group testing¶
An alternative method for the approximation of Shapley values introduced in (Jia et al., 2019)4 first estimates the differences of values with a Monte Carlo sum. With
one then solves the following linear constraint satisfaction problem (CSP) to infer the final values:
Warning
We have reproduced this method in pyDVL for completeness and benchmarking, but we don't advocate its use because of the speed and memory cost. Despite our best efforts, the number of samples required in practice for convergence can be several orders of magnitude worse than with e.g. TMCS. Additionally, the CSP can sometimes turn out to be infeasible.
Usage follows the same pattern as every other Shapley method, but with the
addition of an epsilon
parameter required for the solution of the CSP. It
should be the same value used to compute the minimum number of samples required.
This can be done with
num_samples_eps_delta, but
note that the number returned will be huge! In practice, fewer samples can be
enough, but the actual number will strongly depend on the utility, in particular
its variance.
from pydvl.utils import Dataset, Utility
from pydvl.value import compute_shapley_values
model = ...
data = Dataset(...)
utility = Utility(model, data, score_range=(_min, _max))
min_iterations = num_samples_eps_delta(epsilon, delta, n, utility.score_range)
values = compute_shapley_values(
u=utility, mode="group_testing", n_iterations=min_iterations, eps=eps
)
-
Ghorbani, A., Zou, J., 2019. Data Shapley: Equitable Valuation of Data for Machine Learning, in: Proceedings of the 36th International Conference on Machine Learning, PMLR. Presented at the International Conference on Machine Learning (ICML 2019), PMLR, pp. 2242–2251. ↩↩
-
Okhrati, R., Lipani, A., 2021. A Multilinear Sampling Algorithm to Estimate Shapley Values, in: 2020 25th International Conference on Pattern Recognition (ICPR). Presented at the 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, pp. 7992–7999. https://doi.org/10.1109/ICPR48806.2021.9412511 ↩
-
Castro, J., Gómez, D., Tejada, J., 2009. Polynomial calculation of the Shapley value based on sampling. Computers & Operations Research, Selected papers presented at the Tenth International Symposium on Locational Decisions (ISOLDE X) 36, 1726–1730. https://doi.org/10.1016/j.cor.2008.04.004 ↩↩
-
Jia, R., Dao, D., Wang, B., Hubis, F.A., Gurel, N.M., Li, B., Zhang, C., Spanos, C., Song, D., 2019. Efficient task-specific data valuation for nearest neighbor algorithms. Proc. VLDB Endow. 12, 1610–1623. https://doi.org/10.14778/3342263.3342637 ↩↩