pydvl.value.shapley.gt ¶

This module implements Group Testing for the approximation of Shapley values, as introduced in (Jia, R. et al., 2019)¹. The sampling of index subsets is done in such a way that an approximation to the true Shapley values can be computed with guarantees.

Warning

This method is very inefficient. Potential improvements to the implementation notwithstanding, convergence seems to be very slow (in terms of evaluations of the utility required). We recommend other Monte Carlo methods instead.

You can read more in the documentation.

New in version 0.4.0

References¶

Jia, R. et al., 2019. Towards Efficient Data Valuation Based on the Shapley Value. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pp. 1167–1176. PMLR. ↩

num_samples_eps_delta ¶

num_samples_eps_delta(
    eps: float, delta: float, n: int, utility_range: float
) -> int

Implements the formula in Theorem 3 of (Jia, R. et al., 2019)¹ which gives a lower bound on the number of samples required to obtain an (ε/√n,δ/(N(N-1))-approximation to all pair-wise differences of Shapley values, wrt. \(\ell_2\) norm.

PARAMETER	DESCRIPTION
`eps`	ε TYPE: `float`
`delta`	δ TYPE: `float`
`n`	Number of data points TYPE: `int`
`utility_range`	Range of the Utility function TYPE: `float`

Returns: Number of samples from \(2^{[n]}\) guaranteeing ε/√n-correct Shapley pair-wise differences of values with probability 1-δ/(N(N-1)).

New in version 0.4.0

Source code in src/pydvl/value/shapley/gt.py

def num_samples_eps_delta(
    eps: float, delta: float, n: int, utility_range: float
) -> int:
    r"""Implements the formula in Theorem 3 of (Jia, R. et al., 2019)<sup><a href="#jia_efficient_2019">1</a></sup>
    which gives a lower bound on the number of samples required to obtain an
    (ε/√n,δ/(N(N-1))-approximation to all pair-wise differences of Shapley
    values, wrt. $\ell_2$ norm.

    Args:
        eps: ε
        delta: δ
        n: Number of data points
        utility_range: Range of the [Utility][pydvl.utils.utility.Utility] function
    Returns:
        Number of samples from $2^{[n]}$ guaranteeing ε/√n-correct Shapley
            pair-wise differences of values with probability 1-δ/(N(N-1)).

    !!! tip "New in version 0.4.0"

    """
    constants = _constants(n=n, epsilon=eps, delta=delta, utility_range=utility_range)
    return int(constants.T)

group_testing_shapley ¶

group_testing_shapley(
    u: Utility,
    n_samples: int,
    epsilon: float,
    delta: float,
    *,
    n_jobs: int = 1,
    parallel_backend: Optional[ParallelBackend] = None,
    config: Optional[ParallelConfig] = None,
    progress: bool = False,
    seed: Optional[Seed] = None,
    **options: dict
) -> ValuationResult

Implements group testing for approximation of Shapley values as described in (Jia, R. et al., 2019)¹.

Warning

This method is very inefficient. It requires several orders of magnitude more evaluations of the utility than others in montecarlo. It also uses several intermediate objects like the results from the runners and the constraint matrices which can become rather large.

By picking a specific distribution over subsets, the differences in Shapley values can be approximated with a Monte Carlo sum. These are then used to solve for the individual values in a feasibility problem.

PARAMETER	DESCRIPTION
`u`	Utility object with model, data, and scoring function TYPE: `Utility`
`n_samples`	Number of tests to perform. Use num_samples_eps_delta to estimate this. TYPE: `int`
`epsilon`	From the (ε,δ) sample bound. Use the same as for the estimation of `n_iterations`. TYPE: `float`
`delta`	From the (ε,δ) sample bound. Use the same as for the estimation of `n_iterations`. TYPE: `float`
`n_jobs`	Number of parallel jobs to use. Each worker performs a chunk of all tests (i.e. utility evaluations). TYPE: `int` DEFAULT: `1`
`parallel_backend`	Parallel backend instance to use for parallelizing computations. If `None`, use JoblibParallelBackend backend. See the Parallel Backends package for available options. TYPE: `Optional[ParallelBackend]` DEFAULT: `None`
`config`	(DEPRECATED) Object configuring parallel computation, with cluster address, number of cpus, etc. TYPE: `Optional[ParallelConfig]` DEFAULT: `None`
`progress`	Whether to display progress bars for each job. TYPE: `bool` DEFAULT: `False`
`seed`	Either an instance of a numpy random number generator or a seed for it. TYPE: `Optional[Seed]` DEFAULT: `None`
`options`	Additional options to pass to cvxpy.Problem.solve(). E.g. to change the solver (which defaults to `cvxpy.SCS`) pass `solver=cvxpy.CVXOPT`. TYPE: `dict` DEFAULT: `{}`

RETURNS	DESCRIPTION
`ValuationResult`	Object with the data values.

New in version 0.4.0

Changed in version 0.5.0

Changed the solver to cvxpy instead of scipy's linprog. Added the ability to pass arbitrary options to it.

Changed in version 0.9.0

Deprecated config argument and added a parallel_backend argument to allow users to pass the Parallel Backend instance directly.

Source code in src/pydvl/value/shapley/gt.py

@deprecated(
    target=True,
    args_mapping={"config": "config"},
    deprecated_in="0.9.0",
    remove_in="0.10.0",
)
def group_testing_shapley(
    u: Utility,
    n_samples: int,
    epsilon: float,
    delta: float,
    *,
    n_jobs: int = 1,
    parallel_backend: Optional[ParallelBackend] = None,
    config: Optional[ParallelConfig] = None,
    progress: bool = False,
    seed: Optional[Seed] = None,
    **options: dict,
) -> ValuationResult:
    """Implements group testing for approximation of Shapley values as described
    in (Jia, R. et al., 2019)<sup><a href="#jia_efficient_2019">1</a></sup>.

    !!! Warning
        This method is very inefficient. It requires several orders of magnitude
        more evaluations of the utility than others in
        [montecarlo][pydvl.value.shapley.montecarlo]. It also uses several intermediate
        objects like the results from the runners and the constraint matrices
        which can become rather large.

    By picking a specific distribution over subsets, the differences in Shapley
    values can be approximated with a Monte Carlo sum. These are then used to
    solve for the individual values in a feasibility problem.

    Args:
        u: Utility object with model, data, and scoring function
        n_samples: Number of tests to perform. Use
            [num_samples_eps_delta][pydvl.value.shapley.gt.num_samples_eps_delta]
            to estimate this.
        epsilon: From the (ε,δ) sample bound. Use the same as for the
            estimation of `n_iterations`.
        delta: From the (ε,δ) sample bound. Use the same as for the
            estimation of `n_iterations`.
        n_jobs: Number of parallel jobs to use. Each worker performs a chunk
            of all tests (i.e. utility evaluations).
        parallel_backend: Parallel backend instance to use
            for parallelizing computations. If `None`,
            use [JoblibParallelBackend][pydvl.parallel.backends.JoblibParallelBackend] backend.
            See the [Parallel Backends][pydvl.parallel.backends] package
            for available options.
        config: (**DEPRECATED**) Object configuring parallel computation,
            with cluster address, number of cpus, etc.
        progress: Whether to display progress bars for each job.
        seed: Either an instance of a numpy random number generator or a seed for it.
        options: Additional options to pass to
            [cvxpy.Problem.solve()](https://www.cvxpy.org/tutorial/advanced/index.html#solve-method-options).
            E.g. to change the solver (which defaults to `cvxpy.SCS`) pass
            `solver=cvxpy.CVXOPT`.

    Returns:
        Object with the data values.

    !!! tip "New in version 0.4.0"

    !!! tip "Changed in version 0.5.0"
        Changed the solver to cvxpy instead of scipy's linprog. Added the ability
        to pass arbitrary options to it.

    !!! tip "Changed in version 0.9.0"
        Deprecated `config` argument and added a `parallel_backend`
        argument to allow users to pass the Parallel Backend instance
        directly.
    """

    n = len(u.data.indices)

    const = _constants(
        n=n,
        epsilon=epsilon,
        delta=delta,
        utility_range=u.score_range.max() - u.score_range.min(),
    )
    T = n_samples
    if T < const.T:
        log.warning(
            f"n_samples of {T} are below the required {const.T} for the "
            f"ε={epsilon:.02f} guarantee at δ={1 - delta:.02f} probability"
        )

    parallel_backend = _maybe_init_parallel_backend(parallel_backend, config)

    samples_per_job = max(1, n_samples // parallel_backend.effective_n_jobs(n_jobs))

    def reducer(
        results_it: Iterable[Tuple[NDArray, NDArray]]
    ) -> Tuple[NDArray, NDArray]:
        return np.concatenate(list(x[0] for x in results_it)).astype(
            np.float_
        ), np.concatenate(list(x[1] for x in results_it)).astype(np.int_)

    seed_sequence = ensure_seed_sequence(seed)
    map_reduce_seed_sequence, cvxpy_seed = tuple(seed_sequence.spawn(2))

    map_reduce_job: MapReduceJob[Utility, Tuple[NDArray, NDArray]] = MapReduceJob(
        u,
        map_func=_group_testing_shapley,
        reduce_func=reducer,
        map_kwargs=dict(n_samples=samples_per_job, progress=progress),
        parallel_backend=parallel_backend,
        n_jobs=n_jobs,
    )
    uu, betas = map_reduce_job(seed=map_reduce_seed_sequence)

    # Matrix of estimated differences. See Eqs. (3) and (4) in the paper.
    C = np.zeros(shape=(n, n))
    for i in range(n):
        for j in range(i + 1, n):
            C[i, j] = np.dot(uu, betas[:, i] - betas[:, j])
    C *= const.Z / T
    total_utility = u(u.data.indices)

    ###########################################################################
    # Solution of the constraint problem with cvxpy

    v = cp.Variable(n)
    constraints = [cp.sum(v) == total_utility]
    for i in range(n):
        for j in range(i + 1, n):
            constraints.append(v[i] - v[j] <= epsilon + C[i, j])
            constraints.append(v[j] - v[i] <= epsilon - C[i, j])

    problem = cp.Problem(cp.Minimize(0), constraints)
    solver = options.pop("solver", cp.SCS)
    problem.solve(solver=solver, **options)

    if problem.status != "optimal":
        log.warning(f"cvxpy returned status {problem.status}")
        values = (
            np.nan * np.ones_like(u.data.indices)
            if not hasattr(v.value, "__len__")
            else v.value
        )
        status = Status.Failed
    else:
        values = v.value
        status = Status.Converged

    return ValuationResult(
        algorithm="group_testing_shapley",
        status=status,
        values=values,
        data_names=u.data.data_names,
        solver_status=problem.status,
    )