Skip to content

pydvl.value.shapley.gt

This module implements Group Testing for the approximation of Shapley values, as introduced in (Jia, R. et al., 2019)1. The sampling of index subsets is done in such a way that an approximation to the true Shapley values can be computed with guarantees.

Warning

This method is very inefficient. Potential improvements to the implementation notwithstanding, convergence seems to be very slow (in terms of evaluations of the utility required). We recommend other Monte Carlo methods instead.

You can read more in the documentation.

New in version 0.4.0

References


  1. Jia, R. et al., 2019. Towards Efficient Data Valuation Based on the Shapley Value. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pp. 1167–1176. PMLR. 

num_samples_eps_delta

num_samples_eps_delta(
    eps: float, delta: float, n: int, utility_range: float
) -> int

Implements the formula in Theorem 3 of (Jia, R. et al., 2019)1 which gives a lower bound on the number of samples required to obtain an (ε/√n,δ/(N(N-1))-approximation to all pair-wise differences of Shapley values, wrt. \(\ell_2\) norm.

PARAMETER DESCRIPTION
eps

ε

TYPE: float

delta

δ

TYPE: float

n

Number of data points

TYPE: int

utility_range

Range of the Utility function

TYPE: float

Returns: Number of samples from \(2^{[n]}\) guaranteeing ε/√n-correct Shapley pair-wise differences of values with probability 1-δ/(N(N-1)).

New in version 0.4.0

Source code in src/pydvl/value/shapley/gt.py
def num_samples_eps_delta(
    eps: float, delta: float, n: int, utility_range: float
) -> int:
    r"""Implements the formula in Theorem 3 of (Jia, R. et al., 2019)<sup><a href="#jia_efficient_2019">1</a></sup>
    which gives a lower bound on the number of samples required to obtain an
    (ε/√n,δ/(N(N-1))-approximation to all pair-wise differences of Shapley
    values, wrt. $\ell_2$ norm.

    Args:
        eps: ε
        delta: δ
        n: Number of data points
        utility_range: Range of the [Utility][pydvl.utils.utility.Utility] function
    Returns:
        Number of samples from $2^{[n]}$ guaranteeing ε/√n-correct Shapley
            pair-wise differences of values with probability 1-δ/(N(N-1)).

    !!! tip "New in version 0.4.0"

    """
    constants = _constants(n=n, epsilon=eps, delta=delta, utility_range=utility_range)
    return int(constants.T)

group_testing_shapley

group_testing_shapley(
    u: Utility,
    n_samples: int,
    epsilon: float,
    delta: float,
    *,
    n_jobs: int = 1,
    parallel_backend: Optional[ParallelBackend] = None,
    config: Optional[ParallelConfig] = None,
    progress: bool = False,
    seed: Optional[Seed] = None,
    **options: dict
) -> ValuationResult

Implements group testing for approximation of Shapley values as described in (Jia, R. et al., 2019)1.

Warning

This method is very inefficient. It requires several orders of magnitude more evaluations of the utility than others in montecarlo. It also uses several intermediate objects like the results from the runners and the constraint matrices which can become rather large.

By picking a specific distribution over subsets, the differences in Shapley values can be approximated with a Monte Carlo sum. These are then used to solve for the individual values in a feasibility problem.

PARAMETER DESCRIPTION
u

Utility object with model, data, and scoring function

TYPE: Utility

n_samples

Number of tests to perform. Use num_samples_eps_delta to estimate this.

TYPE: int

epsilon

From the (ε,δ) sample bound. Use the same as for the estimation of n_iterations.

TYPE: float

delta

From the (ε,δ) sample bound. Use the same as for the estimation of n_iterations.

TYPE: float

n_jobs

Number of parallel jobs to use. Each worker performs a chunk of all tests (i.e. utility evaluations).

TYPE: int DEFAULT: 1

parallel_backend

Parallel backend instance to use for parallelizing computations. If None, use JoblibParallelBackend backend. See the Parallel Backends package for available options.

TYPE: Optional[ParallelBackend] DEFAULT: None

config

(DEPRECATED) Object configuring parallel computation, with cluster address, number of cpus, etc.

TYPE: Optional[ParallelConfig] DEFAULT: None

progress

Whether to display progress bars for each job.

TYPE: bool DEFAULT: False

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

options

Additional options to pass to cvxpy.Problem.solve(). E.g. to change the solver (which defaults to cvxpy.SCS) pass solver=cvxpy.CVXOPT.

TYPE: dict DEFAULT: {}

RETURNS DESCRIPTION
ValuationResult

Object with the data values.

New in version 0.4.0

Changed in version 0.5.0

Changed the solver to cvxpy instead of scipy's linprog. Added the ability to pass arbitrary options to it.

Changed in version 0.9.0

Deprecated config argument and added a parallel_backend argument to allow users to pass the Parallel Backend instance directly.

Source code in src/pydvl/value/shapley/gt.py
@deprecated(
    target=True,
    args_mapping={"config": "config"},
    deprecated_in="0.9.0",
    remove_in="0.10.0",
)
def group_testing_shapley(
    u: Utility,
    n_samples: int,
    epsilon: float,
    delta: float,
    *,
    n_jobs: int = 1,
    parallel_backend: Optional[ParallelBackend] = None,
    config: Optional[ParallelConfig] = None,
    progress: bool = False,
    seed: Optional[Seed] = None,
    **options: dict,
) -> ValuationResult:
    """Implements group testing for approximation of Shapley values as described
    in (Jia, R. et al., 2019)<sup><a href="#jia_efficient_2019">1</a></sup>.

    !!! Warning
        This method is very inefficient. It requires several orders of magnitude
        more evaluations of the utility than others in
        [montecarlo][pydvl.value.shapley.montecarlo]. It also uses several intermediate
        objects like the results from the runners and the constraint matrices
        which can become rather large.

    By picking a specific distribution over subsets, the differences in Shapley
    values can be approximated with a Monte Carlo sum. These are then used to
    solve for the individual values in a feasibility problem.

    Args:
        u: Utility object with model, data, and scoring function
        n_samples: Number of tests to perform. Use
            [num_samples_eps_delta][pydvl.value.shapley.gt.num_samples_eps_delta]
            to estimate this.
        epsilon: From the (ε,δ) sample bound. Use the same as for the
            estimation of `n_iterations`.
        delta: From the (ε,δ) sample bound. Use the same as for the
            estimation of `n_iterations`.
        n_jobs: Number of parallel jobs to use. Each worker performs a chunk
            of all tests (i.e. utility evaluations).
        parallel_backend: Parallel backend instance to use
            for parallelizing computations. If `None`,
            use [JoblibParallelBackend][pydvl.parallel.backends.JoblibParallelBackend] backend.
            See the [Parallel Backends][pydvl.parallel.backends] package
            for available options.
        config: (**DEPRECATED**) Object configuring parallel computation,
            with cluster address, number of cpus, etc.
        progress: Whether to display progress bars for each job.
        seed: Either an instance of a numpy random number generator or a seed for it.
        options: Additional options to pass to
            [cvxpy.Problem.solve()](https://www.cvxpy.org/tutorial/advanced/index.html#solve-method-options).
            E.g. to change the solver (which defaults to `cvxpy.SCS`) pass
            `solver=cvxpy.CVXOPT`.

    Returns:
        Object with the data values.

    !!! tip "New in version 0.4.0"

    !!! tip "Changed in version 0.5.0"
        Changed the solver to cvxpy instead of scipy's linprog. Added the ability
        to pass arbitrary options to it.

    !!! tip "Changed in version 0.9.0"
        Deprecated `config` argument and added a `parallel_backend`
        argument to allow users to pass the Parallel Backend instance
        directly.
    """

    n = len(u.data.indices)

    const = _constants(
        n=n,
        epsilon=epsilon,
        delta=delta,
        utility_range=u.score_range.max() - u.score_range.min(),
    )
    T = n_samples
    if T < const.T:
        log.warning(
            f"n_samples of {T} are below the required {const.T} for the "
            f"ε={epsilon:.02f} guarantee at δ={1 - delta:.02f} probability"
        )

    parallel_backend = _maybe_init_parallel_backend(parallel_backend, config)

    samples_per_job = max(1, n_samples // parallel_backend.effective_n_jobs(n_jobs))

    def reducer(
        results_it: Iterable[Tuple[NDArray, NDArray]]
    ) -> Tuple[NDArray, NDArray]:
        return np.concatenate(list(x[0] for x in results_it)).astype(
            np.float_
        ), np.concatenate(list(x[1] for x in results_it)).astype(np.int_)

    seed_sequence = ensure_seed_sequence(seed)
    map_reduce_seed_sequence, cvxpy_seed = tuple(seed_sequence.spawn(2))

    map_reduce_job: MapReduceJob[Utility, Tuple[NDArray, NDArray]] = MapReduceJob(
        u,
        map_func=_group_testing_shapley,
        reduce_func=reducer,
        map_kwargs=dict(n_samples=samples_per_job, progress=progress),
        parallel_backend=parallel_backend,
        n_jobs=n_jobs,
    )
    uu, betas = map_reduce_job(seed=map_reduce_seed_sequence)

    # Matrix of estimated differences. See Eqs. (3) and (4) in the paper.
    C = np.zeros(shape=(n, n))
    for i in range(n):
        for j in range(i + 1, n):
            C[i, j] = np.dot(uu, betas[:, i] - betas[:, j])
    C *= const.Z / T
    total_utility = u(u.data.indices)

    ###########################################################################
    # Solution of the constraint problem with cvxpy

    v = cp.Variable(n)
    constraints = [cp.sum(v) == total_utility]
    for i in range(n):
        for j in range(i + 1, n):
            constraints.append(v[i] - v[j] <= epsilon + C[i, j])
            constraints.append(v[j] - v[i] <= epsilon - C[i, j])

    problem = cp.Problem(cp.Minimize(0), constraints)
    solver = options.pop("solver", cp.SCS)
    problem.solve(solver=solver, **options)

    if problem.status != "optimal":
        log.warning(f"cvxpy returned status {problem.status}")
        values = (
            np.nan * np.ones_like(u.data.indices)
            if not hasattr(v.value, "__len__")
            else v.value
        )
        status = Status.Failed
    else:
        values = v.value
        status = Status.Converged

    return ValuationResult(
        algorithm="group_testing_shapley",
        status=status,
        values=values,
        data_names=u.data.data_names,
        solver_status=problem.status,
    )