Skip to content

pydvl.valuation.methods.gt_shapley

This module implements Group Testing for the approximation of Shapley values, as introduced in (Jia, R. et al., 2019)1. The sampling of index subsets is done in such a way that an approximation to the true Shapley values can be computed with guarantees.

Warning

This method is very inefficient. Potential improvements to the implementation notwithstanding, convergence seems to be very slow (in terms of evaluations of the utility required). We recommend other Monte Carlo methods instead.

You can read more in the documentation.

References


  1. Jia, R. et al., 2019. Towards Efficient Data Valuation Based on the Shapley Value. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pp. 1167–1176. PMLR. 

  2. Jia, R. et al., 2023. A Note on "Towards Efficient Data Valuation Based on the Shapley Value"

GroupTestingShapleyValuation

GroupTestingShapleyValuation(
    utility: UtilityBase,
    n_samples: int,
    epsilon: float,
    solver_options: dict | None = None,
    progress: bool = True,
    seed: Seed | None = None,
    batch_size: int = 1,
)

Bases: Valuation

Class to calculate the group-testing approximation to shapley values.

See Data valuation for an overview.

Warning

This method is very inefficient. Potential improvements to the implementation notwithstanding, convergence seems to be very slow (in terms of evaluations of the utility required). We recommend other Monte Carlo methods instead.

PARAMETER DESCRIPTION
utility

Utility object with model, data and scoring function.

TYPE: UtilityBase

n_samples

The number of samples to use. A sample size with theoretical guarantees can be computed using pydvl.valuation.methods.gt_shapleyt.compute_n_samples.

TYPE: int

epsilon

The error tolerance.

TYPE: float

solver_options

Optional dictionary containing a CVXPY solver and options to configure it. For valid values to the "solver" key see here. For additional options see here.

TYPE: dict | None DEFAULT: None

progress

Whether to show a progress bar during the construction of the group-testing problem.

TYPE: bool DEFAULT: True

seed

Seed for the random number generator.

TYPE: Seed | None DEFAULT: None

batch_size

The number of samples to draw in each batch. Can be used to reduce parallelization overhead for fast utilities. Defaults to 1.

TYPE: int DEFAULT: 1

Source code in src/pydvl/valuation/methods/gt_shapley.py
def __init__(
    self,
    utility: UtilityBase,
    n_samples: int,
    epsilon: float,
    solver_options: dict | None = None,
    progress: bool = True,
    seed: Seed | None = None,
    batch_size: int = 1,
):
    super().__init__()

    self._utility = utility
    self._n_samples = n_samples
    self._solver_options = solver_options
    self._progress = progress
    self._sampler = GTSampler(batch_size=batch_size, seed=seed)
    self._epsilon = epsilon

values

values(sort: bool = False) -> ValuationResult

Returns a copy of the valuation result.

The valuation must have been run with fit() before calling this method.

PARAMETER DESCRIPTION
sort

Whether to sort the valuation result before returning it.

TYPE: bool DEFAULT: False

Returns: The result of the valuation.

Source code in src/pydvl/valuation/base.py
def values(self, sort: bool = False) -> ValuationResult:
    """Returns a copy of the valuation result.

    The valuation must have been run with `fit()` before calling this method.

    Args:
        sort: Whether to sort the valuation result before returning it.
    Returns:
        The result of the valuation.
    """
    if not self.is_fitted:
        raise NotFittedException(type(self))
    assert self.result is not None

    from copy import copy

    r = copy(self.result)
    if sort:
        r.sort()
    return r

fit

fit(data: Dataset) -> Self

Calculate the group-testing valuation on a dataset.

This method has to be called before calling values().

Calculating the least core valuation is a computationally expensive task that can be parallelized. To do so, call the fit() method inside a joblib.parallel_config context manager as follows:

from joblib import parallel_config

with parallel_config(n_jobs=4):
    valuation.fit(data)
Source code in src/pydvl/valuation/methods/gt_shapley.py
def fit(self, data: Dataset) -> Self:
    """Calculate the group-testing valuation on a dataset.

    This method has to be called before calling `values()`.

    Calculating the least core valuation is a computationally expensive task that
    can be parallelized. To do so, call the `fit()` method inside a
    `joblib.parallel_config` context manager as follows:

    ```python
    from joblib import parallel_config

    with parallel_config(n_jobs=4):
        valuation.fit(data)
    ```

    """
    self._utility = self._utility.with_dataset(data)

    problem = create_group_testing_problem(
        utility=self._utility,
        sampler=self._sampler,
        n_samples=self._n_samples,
        progress=self._progress,
        epsilon=self._epsilon,
    )

    solution = solve_group_testing_problem(
        problem=problem,
        solver_options=self._solver_options,
        algorithm_name=self.algorithm_name,
        data_names=data.data_names,
    )

    self.result = solution
    return self

GroupTestingProblem

Bases: NamedTuple

Solver agnostic representation of the group-testing problem.

GTSampler

GTSampler(batch_size: int = 1, seed: Seed | None = None)

Bases: StochasticSamplerMixin, IndexSampler

Sampler for the group-testing algorithm.

Methods that are specific to semi-values like weight and make_strategy are not implemented.

PARAMETER DESCRIPTION
batch_size

The number of samples to draw from each batch.

TYPE: int DEFAULT: 1

seed

Seed for the random number generator.

TYPE: Seed | None DEFAULT: None

Source code in src/pydvl/valuation/methods/gt_shapley.py
def __init__(self, batch_size: int = 1, seed: Seed | None = None):
    super().__init__(batch_size=batch_size, seed=seed)

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""

    # create an empty generator if the indices are empty. `generate_batches` is
    # a generator function because it has a yield statement later in its body.
    # Inside generator function, `return` acts like a `break`, which produces an
    # empty generator function. See: https://stackoverflow.com/a/13243870
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self._generate(indices), self.batch_size):
        yield batch
        self._n_samples += len(batch)
        if self._interrupted:
            break

sample_limit

sample_limit(indices: IndexSetT) -> int | None

Number of samples that can be generated from the indices.

Returns None if the number of samples is infinite, which is the case for most stochastic samplers.

Source code in src/pydvl/valuation/samplers/base.py
def sample_limit(self, indices: IndexSetT) -> int | None:
    """Number of samples that can be generated from the indices.

    Returns None if the number of samples is infinite, which is the case for most
    stochastic samplers.
    """
    if len(indices) == 0:
        out = 0
    else:
        out = None
    return out

compute_n_samples

compute_n_samples(epsilon: float, delta: float, n_obs: int) -> int

Compute the minimal sample size with epsilon-delta guarantees.

Based on the formula in Theorem 4 of (Jia, R. et al., 2023)2 which gives a lower bound on the number of samples required to obtain an (ε/√n,δ/(N(N-1))-approximation to all pair-wise differences of Shapley values, wrt. \(\ell_2\) norm.

The updated version refines the lower bound of the original paper. Note that the bound is tighter than earlier versions but might still overestimate the number of samples required.

PARAMETER DESCRIPTION
epsilon

The error tolerance.

TYPE: float

delta

The confidence level.

TYPE: float

n_obs

Number of data points.

TYPE: int

RETURNS DESCRIPTION
int

The sample size.

Source code in src/pydvl/valuation/methods/gt_shapley.py
def compute_n_samples(epsilon: float, delta: float, n_obs: int) -> int:
    """Compute the minimal sample size with epsilon-delta guarantees.

    Based on the formula in Theorem 4 of
    (Jia, R. et al., 2023)<sup><a href="#jia_update_2023">2</a></sup>
    which gives a lower bound on the number of samples required to obtain an
    (ε/√n,δ/(N(N-1))-approximation to all pair-wise differences of Shapley
    values, wrt. $\ell_2$ norm.

    The updated version refines the lower bound of the original paper. Note that the
    bound is tighter than earlier versions but might still overestimate the number of
    samples required.

    Args:
        epsilon: The error tolerance.
        delta: The confidence level.
        n_obs: Number of data points.

    Returns:
        The sample size.

    """
    kk = _create_sample_sizes(n_obs)
    Z = _calculate_z(n_obs)

    q = _create_sampling_probabilities(kk)
    q_tot = (n_obs - 2) / n_obs * q[0] + np.inner(
        q[1:], 1 + 2 * kk[1:] * (kk[1:] - n_obs) / (n_obs * (n_obs - 1))
    )

    def _h(u: float) -> float:
        return cast(float, (1 + u) * np.log(1 + u) - u)

    n_samples = np.log(n_obs * (n_obs - 1) / delta)
    n_samples /= 1 - q_tot
    n_samples /= _h(epsilon / (2 * Z * np.sqrt(n_obs) * (1 - q_tot)))

    return int(n_samples)

create_group_testing_problem

create_group_testing_problem(
    utility: UtilityBase,
    sampler: IndexSampler,
    n_samples: int,
    progress: bool,
    epsilon: float,
) -> GroupTestingProblem

Create the feasibility problem that characterizes group testing shapley values.

PARAMETER DESCRIPTION
utility

Utility object with model, data and scoring function.

TYPE: UtilityBase

sampler

The sampler to use for the valuation.

TYPE: IndexSampler

n_samples

The number of samples to use.

TYPE: int

progress

Whether to show a progress bar.

TYPE: bool

epsilon

The error tolerance.

TYPE: float

RETURNS DESCRIPTION
GroupTestingProblem

A GroupTestingProblem object.

Source code in src/pydvl/valuation/methods/gt_shapley.py
def create_group_testing_problem(
    utility: UtilityBase,
    sampler: IndexSampler,
    n_samples: int,
    progress: bool,
    epsilon: float,
) -> GroupTestingProblem:
    """Create the feasibility problem that characterizes group testing shapley values.

    Args:
        utility: Utility object with model, data and scoring function.
        sampler: The sampler to use for the valuation.
        n_samples: The number of samples to use.
        progress: Whether to show a progress bar.
        epsilon: The error tolerance.

    Returns:
        A GroupTestingProblem object.

    """
    if utility.training_data is None:
        raise ValueError("Utility object must have training data.")

    u_values, masks = compute_utility_values_and_sample_masks(
        utility=utility,
        sampler=sampler,
        n_samples=n_samples,
        progress=progress,
        extra_samples=[Sample(idx=None, subset=utility.training_data.indices)],
    )

    total_utility = u_values[-1]
    u_values = u_values[:-1]
    masks = masks[:-1]

    u_differences = _calculate_utility_differences(
        utility_values=u_values, masks=masks, n_obs=len(utility.training_data)
    )

    problem = GroupTestingProblem(
        utility_differences=u_differences,
        total_utility=total_utility,
        epsilon=epsilon,
    )

    return problem

solve_group_testing_problem

solve_group_testing_problem(
    problem: GroupTestingProblem,
    solver_options: dict | None,
    algorithm_name: str,
    data_names: NDArray[object_],
) -> ValuationResult

Solve the group testing problem and create a ValuationResult.

PARAMETER DESCRIPTION
problem

The group testing problem.

TYPE: GroupTestingProblem

solver_options

Optional dictionary containing a CVXPY solver and options to configure it. For valid values to the "solver" key see here. For additional options see here.

TYPE: dict | None

algorithm_name

The name of the algorithm.

TYPE: str

data_names

The names of the data columns.

TYPE: NDArray[object_]

RETURNS DESCRIPTION
ValuationResult

A ValuationResult object.

Source code in src/pydvl/valuation/methods/gt_shapley.py
def solve_group_testing_problem(
    problem: GroupTestingProblem,
    solver_options: dict | None,
    algorithm_name: str,
    data_names: NDArray[np.object_],
) -> ValuationResult:
    """Solve the group testing problem and create a ValuationResult.

    Args:
        problem: The group testing problem.
        solver_options: Optional dictionary containing a CVXPY solver and options to
            configure it. For valid values to the "solver" key see
            [here](https://www.cvxpy.org/tutorial/solvers/index.html#choosing-a-solver).
            For additional options see [here](https://www.cvxpy.org/tutorial/solvers/index.html#setting-solver-options).
        algorithm_name: The name of the algorithm.
        data_names: The names of the data columns.

    Returns:
        A ValuationResult object.

    """

    solver_options = {} if solver_options is None else solver_options.copy()

    C = problem.utility_differences
    total_utility = problem.total_utility
    epsilon = problem.epsilon
    n_obs = len(C)

    v = cp.Variable(n_obs)
    constraints = [cp.sum(v) == total_utility]
    for i in range(n_obs):
        for j in range(i + 1, n_obs):
            constraints.append(v[i] - v[j] <= epsilon + C[i, j])
            constraints.append(v[j] - v[i] <= epsilon - C[i, j])

    cp_problem = cp.Problem(cp.Minimize(0), constraints)
    solver = solver_options.pop("solver", cp.SCS)
    cp_problem.solve(solver=solver, **solver_options)

    if cp_problem.status != "optimal":
        log.warning(f"cvxpy returned status {cp_problem.status}")
        values = (
            np.nan * np.ones_like(n_obs) if not hasattr(v.value, "__len__") else v.value
        )
        status = Status.Failed
    else:
        values = v.value
        status = Status.Converged

    result = ValuationResult(
        status=status,
        values=values,
        data_names=data_names,
        solver_status=cp_problem.status,
        algorithm=algorithm_name,
    )

    return result