Skip to content

Naive

permutation_exact_shapley(u, *, progress=True)

Computes the exact Shapley value using the formulation with permutations:

\[v_u(x_i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)} [u(\sigma_{i-1} \cup {i}) − u(\sigma_{i})].\]

See Data valuation for details.

When the length of the training set is > 10 this prints a warning since the computation becomes too expensive. Used mostly for internal testing and simple use cases. Please refer to the Monte Carlo approximations for practical applications.

PARAMETER DESCRIPTION
u

Utility object with model, data, and scoring function

TYPE: Utility

progress

Whether to display progress bars for each job.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
ValuationResult

Object with the data values.

Source code in src/pydvl/value/shapley/naive.py
def permutation_exact_shapley(u: Utility, *, progress: bool = True) -> ValuationResult:
    r"""Computes the exact Shapley value using the formulation with permutations:

    $$v_u(x_i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)} [u(\sigma_{i-1} \cup {i}) − u(\sigma_{i})].$$

    See [Data valuation][computing-data-values] for details.

    When the length of the training set is > 10 this prints a warning since the
    computation becomes too expensive. Used mostly for internal testing and
    simple use cases. Please refer to the [Monte Carlo
    approximations][pydvl.value.shapley.montecarlo] for practical applications.

    Args:
        u: Utility object with model, data, and scoring function
        progress: Whether to display progress bars for each job.

    Returns:
        Object with the data values.
    """

    n = len(u.data)
    # Note that the cache in utility saves most of the refitting because we
    # use frozenset for the input.
    if n > 10:
        warnings.warn(
            f"Large dataset! Computation requires {n}! calls to utility()",
            RuntimeWarning,
        )

    values = np.zeros(n)
    for p in tqdm(
        permutations(u.data.indices),
        disable=not progress,
        desc="Permutation",
        total=math.factorial(n),
    ):
        for i, idx in enumerate(p):
            values[idx] += u(p[: i + 1]) - u(p[:i])
    values /= math.factorial(n)

    return ValuationResult(
        algorithm="permutation_exact_shapley",
        status=Status.Converged,
        values=values,
        data_names=u.data.data_names,
    )

combinatorial_exact_shapley(u, *, n_jobs=1, config=ParallelConfig(), progress=False)

Computes the exact Shapley value using the combinatorial definition.

\[v_u(i) = \frac{1}{n} \sum_{S \subseteq N \setminus \{i\}} \binom{n-1}{ | S | }^{-1} [u(S \cup \{i\}) − u(S)].\]

See Data valuation for details.

Note

If the length of the training set is > n_jobs*20 this prints a warning because the computation is very expensive. Used mostly for internal testing and simple use cases. Please refer to the Monte Carlo approximations for practical applications.

PARAMETER DESCRIPTION
u

Utility object with model, data, and scoring function

TYPE: Utility

n_jobs

Number of parallel jobs to use

TYPE: int DEFAULT: 1

config

Object configuring parallel computation, with cluster address, number of cpus, etc.

TYPE: ParallelConfig DEFAULT: ParallelConfig()

progress

Whether to display progress bars for each job.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
ValuationResult

Object with the data values.

Source code in src/pydvl/value/shapley/naive.py
def combinatorial_exact_shapley(
    u: Utility,
    *,
    n_jobs: int = 1,
    config: ParallelConfig = ParallelConfig(),
    progress: bool = False,
) -> ValuationResult:
    r"""Computes the exact Shapley value using the combinatorial definition.

    $$v_u(i) = \frac{1}{n} \sum_{S \subseteq N \setminus \{i\}} \binom{n-1}{ | S | }^{-1} [u(S \cup \{i\}) − u(S)].$$

    See [Data valuation][computing-data-values] for details.

    !!! Note
        If the length of the training set is > n_jobs*20 this prints a warning
        because the computation is very expensive. Used mostly for internal testing
        and simple use cases. Please refer to the
        [Monte Carlo][pydvl.value.shapley.montecarlo] approximations for practical
        applications.

    Args:
        u: Utility object with model, data, and scoring function
        n_jobs: Number of parallel jobs to use
        config: Object configuring parallel computation, with cluster address,
            number of cpus, etc.
        progress: Whether to display progress bars for each job.

    Returns:
        Object with the data values.
    """
    # Arbitrary choice, will depend on time required, caching, etc.
    if len(u.data) // n_jobs > 20:
        warnings.warn(
            f"Large dataset! Computation requires 2^{len(u.data)} calls to model.fit()"
        )

    def reduce_fun(results: List[NDArray]) -> NDArray:
        return np.array(results).sum(axis=0)  # type: ignore

    map_reduce_job: MapReduceJob[NDArray, NDArray] = MapReduceJob(
        u.data.indices,
        map_func=_combinatorial_exact_shapley,
        map_kwargs=dict(u=u, progress=progress),
        reduce_func=reduce_fun,
        n_jobs=n_jobs,
        config=config,
    )
    values = map_reduce_job()
    return ValuationResult(
        algorithm="combinatorial_exact_shapley",
        status=Status.Converged,
        values=values,
        data_names=u.data.data_names,
    )