pydvl.value.shapley.naive ¶

This module implements exact Shapley values using either the combinatorial or permutation definition.

The exact computation of \(n\) values takes \(\mathcal{O}(2^n)\) evaluations of the utility and is therefore only possible for small datasets. For larger datasets, consider using any of the approximations, such as Monte Carlo, or proxy models like kNN.

See Data valuation for details.

permutation_exact_shapley ¶

permutation_exact_shapley(
    u: Utility, *, progress: bool = True
) -> ValuationResult

Computes the exact Shapley value using the formulation with permutations:

\[v_u(x_i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)} [u(\sigma_{i-1} \cup {i}) − u(\sigma_{i})].\]

See Data valuation for details.

When the length of the training set is > 10 this prints a warning since the computation becomes too expensive. Used mostly for internal testing and simple use cases. Please refer to the Monte Carlo approximations for practical applications.

PARAMETER	DESCRIPTION
`u`	Utility object with model, data, and scoring function TYPE: `Utility`
`progress`	Whether to display progress bars for each job. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`ValuationResult`	Object with the data values.

Source code in src/pydvl/value/shapley/naive.py

def permutation_exact_shapley(u: Utility, *, progress: bool = True) -> ValuationResult:
    r"""Computes the exact Shapley value using the formulation with permutations:

    $$v_u(x_i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)} [u(\sigma_{i-1}
    \cup {i}) − u(\sigma_{i})].$$

    See [Data valuation][data-valuation] for details.

    When the length of the training set is > 10 this prints a warning since the
    computation becomes too expensive. Used mostly for internal testing and
    simple use cases. Please refer to the [Monte Carlo
    approximations][pydvl.value.shapley.montecarlo] for practical applications.

    Args:
        u: Utility object with model, data, and scoring function
        progress: Whether to display progress bars for each job.

    Returns:
        Object with the data values.
    """

    n = len(u.data)
    # Note that the cache in utility saves most of the refitting because we
    # use frozenset for the input.
    if n > 10:
        warnings.warn(
            f"Large dataset! Computation requires {n}! calls to utility()",
            RuntimeWarning,
        )

    values = np.zeros(n)
    for p in tqdm(
        permutations(u.data.indices),
        disable=not progress,
        desc="Permutation",
        total=math.factorial(n),
    ):
        for i, idx in enumerate(p):
            values[idx] += u(p[: i + 1]) - u(p[:i])
    values /= math.factorial(n)

    return ValuationResult(
        algorithm="permutation_exact_shapley",
        status=Status.Converged,
        values=values,
        data_names=u.data.data_names,
    )

combinatorial_exact_shapley ¶

combinatorial_exact_shapley(
    u: Utility,
    *,
    n_jobs: int = 1,
    parallel_backend: Optional[ParallelBackend] = None,
    config: Optional[ParallelConfig] = None,
    progress: bool = False
) -> ValuationResult

Computes the exact Shapley value using the combinatorial definition.

\[v_u(i) = \frac{1}{n} \sum_{S \subseteq N \setminus \{i\}} \binom{n-1}{ | S | }^{-1} [u(S \cup \{i\}) − u(S)].\]

See Data valuation for details.

Note

If the length of the training set is > n_jobs*20 this prints a warning because the computation is very expensive. Used mostly for internal testing and simple use cases. Please refer to the Monte Carlo approximations for practical applications.

PARAMETER	DESCRIPTION
`u`	Utility object with model, data, and scoring function TYPE: `Utility`
`n_jobs`	Number of parallel jobs to use TYPE: `int` DEFAULT: `1`
`parallel_backend`	Parallel backend instance to use for parallelizing computations. If `None`, use JoblibParallelBackend backend. See the Parallel Backends package for available options. TYPE: `Optional[ParallelBackend]` DEFAULT: `None`
`config`	(DEPRECATED) Object configuring parallel computation, with cluster address, number of cpus, etc. TYPE: `Optional[ParallelConfig]` DEFAULT: `None`
`progress`	Whether to display progress bars for each job. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`ValuationResult`	Object with the data values.

Changed in version 0.9.0

Deprecated config argument and added a parallel_backend argument to allow users to pass the Parallel Backend instance directly.

Source code in src/pydvl/value/shapley/naive.py

@deprecated(
    target=True,
    args_mapping={"config": "config"},
    deprecated_in="0.9.0",
    remove_in="0.10.0",
)
def combinatorial_exact_shapley(
    u: Utility,
    *,
    n_jobs: int = 1,
    parallel_backend: Optional[ParallelBackend] = None,
    config: Optional[ParallelConfig] = None,
    progress: bool = False,
) -> ValuationResult:
    r"""Computes the exact Shapley value using the combinatorial definition.

    $$v_u(i) = \frac{1}{n} \sum_{S \subseteq N \setminus \{i\}}
    \binom{n-1}{ | S | }^{-1} [u(S \cup \{i\}) − u(S)].$$

    See [Data valuation][data-valuation] for details.

    !!! Note
        If the length of the training set is > n_jobs*20 this prints a warning
        because the computation is very expensive. Used mostly for internal
        testing and simple use cases. Please refer to the
        [Monte Carlo][pydvl.value.shapley.montecarlo] approximations for
        practical applications.

    Args:
        u: Utility object with model, data, and scoring function
        n_jobs: Number of parallel jobs to use
        parallel_backend: Parallel backend instance to use
            for parallelizing computations. If `None`,
            use [JoblibParallelBackend][pydvl.parallel.backends.JoblibParallelBackend] backend.
            See the [Parallel Backends][pydvl.parallel.backends] package
            for available options.
        config: (**DEPRECATED**) Object configuring parallel computation,
            with cluster address, number of cpus, etc.
        progress: Whether to display progress bars for each job.

    Returns:
        Object with the data values.

    !!! tip "Changed in version 0.9.0"
        Deprecated `config` argument and added a `parallel_backend`
        argument to allow users to pass the Parallel Backend instance
        directly.
    """
    # Arbitrary choice, will depend on time required, caching, etc.
    if len(u.data) // n_jobs > 20:
        warnings.warn(
            f"Large dataset! Computation requires 2^{len(u.data)} calls to model.fit()"
        )

    def reduce_fun(results: List[NDArray]) -> NDArray:
        return np.array(results).sum(axis=0)  # type: ignore

    parallel_backend = _maybe_init_parallel_backend(parallel_backend, config)

    map_reduce_job: MapReduceJob[NDArray, NDArray] = MapReduceJob(
        u.data.indices,
        map_func=_combinatorial_exact_shapley,
        map_kwargs=dict(u=u, progress=progress),
        reduce_func=reduce_fun,
        n_jobs=n_jobs,
        parallel_backend=parallel_backend,
    )
    values = map_reduce_job()
    return ValuationResult(
        algorithm="combinatorial_exact_shapley",
        status=Status.Converged,
        values=values,
        data_names=u.data.data_names,
    )