pydvl.valuation.methods.gt_shapley
¶
This module implements Group Testing for the approximation of Shapley values, as introduced in (Jia, R. et al., 2019).1
Computation of the Shapley value is transformed into a feasibility problem, where the constraints involve certain utility evaluations of sampled subsets. The sampling is done in such a way that an approximation to the true Shapley values can be computed with guarantees.3
Warning
Group Testing is very inefficient. Potential improvements to the implementation notwithstanding, convergence seems to be very slow (in terms of evaluations of the utility required). We recommend other Monte Carlo methods instead. See the whole list here.
You can read about data valuation in general in the documentation.
References¶
-
Jia, R. et al., 2019. Towards Efficient Data Valuation Based on the Shapley Value. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pp. 1167–1176. PMLR. ↩
-
Jia, R. et al., 2023. A Note on "Towards Efficient Data Valuation Based on the Shapley Value". ↩
-
Internally, this sampling is achieved with a StratifiedSampler with GroupTestingSampleSize strategy. ↩
GroupTestingProblem
¶
GroupTestingShapleyValuation
¶
GroupTestingShapleyValuation(
utility: UtilityBase,
n_samples: int,
epsilon: float,
solver_options: dict | None = None,
progress: bool = True,
seed: Seed | None = None,
batch_size: int = 1,
)
Bases: Valuation
Class to calculate the group-testing approximation to shapley values.
See (Jia, R. et al., 2019)1 for a description of the method, and Data valuation for an overview of data valuation.
Warning
Group Testing is very inefficient. Potential improvements to the implementation notwithstanding, convergence seems to be very slow (in terms of evaluations of the utility required). We recommend other Monte Carlo methods instead. See the whole list here.
PARAMETER | DESCRIPTION |
---|---|
utility
|
Utility object with model, data and scoring function.
TYPE:
|
n_samples
|
The number of samples to use. A sample size with theoretical guarantees can be computed using compute_n_samples().
TYPE:
|
epsilon
|
The error tolerance.
TYPE:
|
solver_options
|
Optional dictionary containing a CVXPY solver and options to configure it. For valid values to the "solver" key see this tutorial. For additional options cvxpy's documentation.
TYPE:
|
progress
|
Whether to show a progress bar during the construction of the group-testing problem.
TYPE:
|
seed
|
Seed for the random number generator.
TYPE:
|
batch_size
|
The number of samples to draw in each batch. Can be used to reduce parallelization overhead for fast utilities. Defaults to 1.
TYPE:
|
Source code in src/pydvl/valuation/methods/gt_shapley.py
fit
¶
fit(data: Dataset, continue_from: ValuationResult | None = None) -> Self
Calculate the group-testing valuation on a dataset.
This method has to be called before calling values()
.
Calculating the least core valuation is a computationally expensive task that
can be parallelized. To do so, call the fit()
method inside a
joblib.parallel_config
context manager as follows:
Source code in src/pydvl/valuation/methods/gt_shapley.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
compute_n_samples
¶
Compute the minimal sample size with epsilon-delta guarantees.
Based on the formula in Theorem 4 of (Jia, R. et al., 2023)2 which gives a lower bound on the number of samples required to obtain an (ε/√n,δ/(N(N-1))-approximation to all pair-wise differences of Shapley values, wrt. \(\ell_2\) norm.
The updated version refines the lower bound of the original paper. Note that the bound is tighter than earlier versions but might still overestimate the number of samples required.
PARAMETER | DESCRIPTION |
---|---|
epsilon
|
The error tolerance.
TYPE:
|
delta
|
The confidence level.
TYPE:
|
n_obs
|
Number of data points.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
int
|
The sample size. |
Source code in src/pydvl/valuation/methods/gt_shapley.py
create_group_testing_problem
¶
create_group_testing_problem(
utility: UtilityBase,
sampler: IndexSampler,
n_samples: int,
progress: bool,
epsilon: float,
) -> GroupTestingProblem
Create the feasibility problem that characterizes group testing shapley values.
PARAMETER | DESCRIPTION |
---|---|
utility
|
Utility object with model, data and scoring function.
TYPE:
|
sampler
|
The sampler to use for the valuation.
TYPE:
|
n_samples
|
The number of samples to use.
TYPE:
|
progress
|
Whether to show a progress bar.
TYPE:
|
epsilon
|
The error tolerance.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
GroupTestingProblem
|
A GroupTestingProblem object. |
Source code in src/pydvl/valuation/methods/gt_shapley.py
solve_group_testing_problem
¶
solve_group_testing_problem(
problem: GroupTestingProblem,
solver_options: dict | None,
algorithm_name: str,
data_names: NDArray[NameT],
) -> ValuationResult
Solve the group testing problem and create a ValuationResult.
PARAMETER | DESCRIPTION |
---|---|
problem
|
The group testing problem.
TYPE:
|
solver_options
|
Optional dictionary containing a CVXPY solver and options to configure it. For valid values to the "solver" key see here. For additional options see here.
TYPE:
|
algorithm_name
|
The name of the algorithm.
TYPE:
|
data_names
|
The names of the data columns.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
A ValuationResult object. |