pydvl.valuation.methods.gt_shapley
¶
This module implements Group Testing for the approximation of Shapley values, as introduced in (Jia, R. et al., 2019)1. The sampling of index subsets is done in such a way that an approximation to the true Shapley values can be computed with guarantees.
Warning
This method is very inefficient. Potential improvements to the implementation notwithstanding, convergence seems to be very slow (in terms of evaluations of the utility required). We recommend other Monte Carlo methods instead.
You can read more in the documentation.
References¶
-
Jia, R. et al., 2019. Towards Efficient Data Valuation Based on the Shapley Value. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pp. 1167–1176. PMLR. ↩
-
Jia, R. et al., 2023. A Note on "Towards Efficient Data Valuation Based on the Shapley Value". ↩
GroupTestingShapleyValuation
¶
GroupTestingShapleyValuation(
utility: UtilityBase,
n_samples: int,
epsilon: float,
solver_options: dict | None = None,
progress: bool = True,
seed: Seed | None = None,
batch_size: int = 1,
)
Bases: Valuation
Class to calculate the group-testing approximation to shapley values.
See Data valuation for an overview.
Warning
This method is very inefficient. Potential improvements to the implementation notwithstanding, convergence seems to be very slow (in terms of evaluations of the utility required). We recommend other Monte Carlo methods instead.
PARAMETER | DESCRIPTION |
---|---|
utility |
Utility object with model, data and scoring function.
TYPE:
|
n_samples |
The number of samples to use. A sample size with theoretical
guarantees can be computed using
TYPE:
|
epsilon |
The error tolerance.
TYPE:
|
solver_options |
Optional dictionary containing a CVXPY solver and options to configure it. For valid values to the "solver" key see here. For additional options see here.
TYPE:
|
progress |
Whether to show a progress bar during the construction of the group-testing problem.
TYPE:
|
seed |
Seed for the random number generator.
TYPE:
|
batch_size |
The number of samples to draw in each batch. Can be used to reduce parallelization overhead for fast utilities. Defaults to 1.
TYPE:
|
Source code in src/pydvl/valuation/methods/gt_shapley.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort |
Whether to sort the valuation result before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
fit
¶
fit(data: Dataset) -> Self
Calculate the group-testing valuation on a dataset.
This method has to be called before calling values()
.
Calculating the least core valuation is a computationally expensive task that
can be parallelized. To do so, call the fit()
method inside a
joblib.parallel_config
context manager as follows:
Source code in src/pydvl/valuation/methods/gt_shapley.py
GroupTestingProblem
¶
GTSampler
¶
GTSampler(batch_size: int = 1, seed: Seed | None = None)
Bases: StochasticSamplerMixin
, IndexSampler
Sampler for the group-testing algorithm.
Methods that are specific to semi-values like weight and make_strategy are not implemented.
PARAMETER | DESCRIPTION |
---|---|
batch_size |
The number of samples to draw from each batch.
TYPE:
|
seed |
Seed for the random number generator.
TYPE:
|
Source code in src/pydvl/valuation/methods/gt_shapley.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
compute_n_samples
¶
Compute the minimal sample size with epsilon-delta guarantees.
Based on the formula in Theorem 4 of (Jia, R. et al., 2023)2 which gives a lower bound on the number of samples required to obtain an (ε/√n,δ/(N(N-1))-approximation to all pair-wise differences of Shapley values, wrt. \(\ell_2\) norm.
The updated version refines the lower bound of the original paper. Note that the bound is tighter than earlier versions but might still overestimate the number of samples required.
PARAMETER | DESCRIPTION |
---|---|
epsilon |
The error tolerance.
TYPE:
|
delta |
The confidence level.
TYPE:
|
n_obs |
Number of data points.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
int
|
The sample size. |
Source code in src/pydvl/valuation/methods/gt_shapley.py
create_group_testing_problem
¶
create_group_testing_problem(
utility: UtilityBase,
sampler: IndexSampler,
n_samples: int,
progress: bool,
epsilon: float,
) -> GroupTestingProblem
Create the feasibility problem that characterizes group testing shapley values.
PARAMETER | DESCRIPTION |
---|---|
utility |
Utility object with model, data and scoring function.
TYPE:
|
sampler |
The sampler to use for the valuation.
TYPE:
|
n_samples |
The number of samples to use.
TYPE:
|
progress |
Whether to show a progress bar.
TYPE:
|
epsilon |
The error tolerance.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
GroupTestingProblem
|
A GroupTestingProblem object. |
Source code in src/pydvl/valuation/methods/gt_shapley.py
solve_group_testing_problem
¶
solve_group_testing_problem(
problem: GroupTestingProblem,
solver_options: dict | None,
algorithm_name: str,
data_names: NDArray[object_],
) -> ValuationResult
Solve the group testing problem and create a ValuationResult.
PARAMETER | DESCRIPTION |
---|---|
problem |
The group testing problem.
TYPE:
|
solver_options |
Optional dictionary containing a CVXPY solver and options to configure it. For valid values to the "solver" key see here. For additional options see here.
TYPE:
|
algorithm_name |
The name of the algorithm.
TYPE:
|
data_names |
The names of the data columns. |
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
A ValuationResult object. |