Semivalues

This module provides the core functionality for the computation of generic semi-values. A semi-value is any valuation function with the form:

\[v_\text{semi}(i) = \sum_{i=1}^n w(k) \sum_{S \subset D_{-i}^{(k)}} [U(S_{+i})-U(S)],\]

where the coefficients \(w(k)\) satisfy the property:

\[\sum_{k=1}^n w(k) = 1.\]

Note

For implementation consistency, we slightly depart from the common definition of semi-values, which includes a factor \(1/n\) in the sum over subsets. Instead, we subsume this factor into the coefficient \(w(k)\).

As such, the computation of a semi-value requires two components:

A subset sampler that generates subsets of the set \(D\) of interest.
A coefficient \(w(k)\) that assigns a weight to each subset size \(k\).

Samplers can be found in sampler, and can be classified into two categories: powerset samplers and permutation samplers. Powerset samplers generate subsets of \(D_{-i}\), while the permutation sampler generates permutations of \(D\). The former conform to the above definition of semi-values, while the latter reformulates it as:

\[ v(i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)} \tilde{w}( | \sigma_{:i} | )[U(\sigma_{:i} \cup \{i\}) − U(\sigma_{:i})], \]

where \(\sigma_{:i}\) denotes the set of indices in permutation sigma before the position where \(i\) appears (see Data valuation for details), and

\[ \tilde{w} (k) = n \binom{n - 1}{k} w (k) \]

is the weight correction due to the reformulation.

Warning

Both PermutationSampler and DeterministicPermutationSampler require caching to be enabled or computation will be doubled wrt. a 'direct' implementation of permutation MC.

There are several pre-defined coefficients, including the Shapley value of (Ghorbani and Zou, 2019)¹, the Banzhaf index of (Wang and Jia)³, and the Beta coefficient of (Kwon and Zou, 2022)². For each of these methods, there is a convenience wrapper function. Respectively, these are: compute_shapley_semivalues, compute_banzhaf_semivalues, and compute_beta_shapley_semivalues. instead.

References¶

Ghorbani, A., Zou, J., 2019. Data Shapley: Equitable Valuation of Data for Machine Learning. In: Proceedings of the 36th International Conference on Machine Learning, PMLR, pp. 2242–2251. ↩
Kwon, Y. and Zou, J., 2022. Beta Shapley: A Unified and Noise-reduced Data Valuation Framework for Machine Learning. In: Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022, Vol. 151. PMLR, Valencia, Spain. ↩
Wang, J.T. and Jia, R., 2022. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. ArXiv preprint arXiv:2205.15466. ↩

`SVCoefficient` ¶

Bases: Protocol

A coefficient for the computation of semi-values.

`call(n, k)` ¶

Computes the coefficient for a given subset size.

PARAMETER	DESCRIPTION
`n`	Total number of elements in the set. TYPE: `int`
`k`	Size of the subset for which the coefficient is being computed TYPE: `int`

Source code in src/pydvl/value/semivalues.py

def __call__(self, n: int, k: int) -> float:
    """Computes the coefficient for a given subset size.

    Args:
        n: Total number of elements in the set.
        k: Size of the subset for which the coefficient is being computed
    """
    ...

`SemiValueMode` ¶

Bases: str, Enum

Enumeration of semi-value modes.

Deprecation notice

This enum and the associated methods are deprecated and will be removed in 0.8.0.

`compute_generic_semivalues(sampler, u, coefficient, done, *, n_jobs=1, config=ParallelConfig(), progress=False)` ¶

Computes semi-values for a given utility function and subset sampler.

PARAMETER	DESCRIPTION
`sampler`	The subset sampler to use for utility computations. TYPE: `PowersetSampler`
`u`	Utility object with model, data, and scoring function. TYPE: `Utility`
`coefficient`	The semi-value coefficient TYPE: `SVCoefficient`
`done`	Stopping criterion. TYPE: `StoppingCriterion`
`n_jobs`	Number of parallel jobs to use. TYPE: `int` DEFAULT: `1`
`config`	Object configuring parallel computation, with cluster address, number of cpus, etc. TYPE: `ParallelConfig` DEFAULT: `ParallelConfig()`
`progress`	Whether to display a progress bar. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`ValuationResult`	Object with the results.

Source code in src/pydvl/value/semivalues.py

def compute_generic_semivalues(
    sampler: PowersetSampler,
    u: Utility,
    coefficient: SVCoefficient,
    done: StoppingCriterion,
    *,
    n_jobs: int = 1,
    config: ParallelConfig = ParallelConfig(),
    progress: bool = False,
) -> ValuationResult:
    """Computes semi-values for a given utility function and subset sampler.

    Args:
        sampler: The subset sampler to use for utility computations.
        u: Utility object with model, data, and scoring function.
        coefficient: The semi-value coefficient
        done: Stopping criterion.
        n_jobs: Number of parallel jobs to use.
        config: Object configuring parallel computation, with cluster
            address, number of cpus, etc.
        progress: Whether to display a progress bar.

    Returns:
        Object with the results.
    """
    from concurrent.futures import FIRST_COMPLETED, Future, wait

    from pydvl.utils import effective_n_jobs, init_executor, init_parallel_backend

    if isinstance(sampler, PermutationSampler) and not u.enable_cache:
        log.warning(
            "PermutationSampler requires caching to be enabled or computation "
            "will be doubled wrt. a 'direct' implementation of permutation MC"
        )

    result = ValuationResult.zeros(
        algorithm=f"semivalue-{str(sampler)}-{coefficient.__name__}",  # type: ignore
        indices=u.data.indices,
        data_names=u.data.data_names,
    )

    parallel_backend = init_parallel_backend(config)
    u = parallel_backend.put(u)
    correction = parallel_backend.put(
        lambda n, k: coefficient(n, k) * sampler.weight(n, k)
    )

    max_workers = effective_n_jobs(n_jobs, config)
    n_submitted_jobs = 2 * max_workers  # number of jobs in the queue

    sampler_it = iter(sampler)
    pbar = tqdm(disable=not progress, total=100, unit="%")

    with init_executor(
        max_workers=max_workers, config=config, cancel_futures=True
    ) as executor:
        pending: set[Future] = set()
        while True:
            pbar.n = 100 * done.completion()
            pbar.refresh()

            completed, pending = wait(pending, timeout=1, return_when=FIRST_COMPLETED)
            for future in completed:
                idx, marginal = future.result()
                result.update(idx, marginal)
                if done(result):
                    return result

            # Ensure that we always have n_submitted_jobs running
            try:
                for _ in range(n_submitted_jobs - len(pending)):
                    pending.add(
                        executor.submit(
                            _marginal,
                            u=u,
                            coefficient=correction,
                            sample=next(sampler_it),
                        )
                    )
            except StopIteration:
                if len(pending) == 0:
                    return result

`compute_shapley_semivalues(u, *, done=MaxUpdates(100), sampler_t=PermutationSampler, n_jobs=1, config=ParallelConfig(), progress=False, seed=None)` ¶

Computes Shapley values for a given utility function.

This is a convenience wrapper for compute_generic_semivalues with the Shapley coefficient. Use compute_shapley_values for a more flexible interface and additional methods, including TMCS.

PARAMETER	DESCRIPTION
`u`	Utility object with model, data, and scoring function. TYPE: `Utility`
`done`	Stopping criterion. TYPE: `StoppingCriterion` DEFAULT: `MaxUpdates(100)`
`sampler_t`	The sampler type to use. See :mod:`pydvl.value.sampler` for a list. TYPE: `Type[StochasticSampler]` DEFAULT: `PermutationSampler`
`n_jobs`	Number of parallel jobs to use. TYPE: `int` DEFAULT: `1`
`config`	Object configuring parallel computation, with cluster address, number of cpus, etc. TYPE: `ParallelConfig` DEFAULT: `ParallelConfig()`
`seed`	Either an instance of a numpy random number generator or a seed for it. TYPE: `Optional[Seed]` DEFAULT: `None`
`progress`	Whether to display a progress bar. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`ValuationResult`	Object with the results.

Source code in src/pydvl/value/semivalues.py

def compute_shapley_semivalues(
    u: Utility,
    *,
    done: StoppingCriterion = MaxUpdates(100),
    sampler_t: Type[StochasticSampler] = PermutationSampler,
    n_jobs: int = 1,
    config: ParallelConfig = ParallelConfig(),
    progress: bool = False,
    seed: Optional[Seed] = None,
) -> ValuationResult:
    """Computes Shapley values for a given utility function.

    This is a convenience wrapper for
    [compute_generic_semivalues][pydvl.value.semivalues.compute_generic_semivalues]
    with the Shapley coefficient. Use
    [compute_shapley_values][pydvl.value.shapley.common.compute_shapley_values]
    for a more flexible interface and additional methods, including TMCS.

    Args:
        u: Utility object with model, data, and scoring function.
        done: Stopping criterion.
        sampler_t: The sampler type to use. See :mod:`pydvl.value.sampler`
            for a list.
        n_jobs: Number of parallel jobs to use.
        config: Object configuring parallel computation, with cluster
            address, number of cpus, etc.
        seed: Either an instance of a numpy random number generator or a seed for it.
        progress: Whether to display a progress bar.

    Returns:
        Object with the results.
    """
    return compute_generic_semivalues(
        sampler_t(u.data.indices, seed=seed),
        u,
        shapley_coefficient,
        done,
        n_jobs=n_jobs,
        config=config,
        progress=progress,
    )

`compute_banzhaf_semivalues(u, *, done=MaxUpdates(100), sampler_t=PermutationSampler, n_jobs=1, config=ParallelConfig(), progress=False, seed=None)` ¶

Computes Banzhaf values for a given utility function.

This is a convenience wrapper for compute_generic_semivalues with the Banzhaf coefficient.

PARAMETER	DESCRIPTION
`u`	Utility object with model, data, and scoring function. TYPE: `Utility`
`done`	Stopping criterion. TYPE: `StoppingCriterion` DEFAULT: `MaxUpdates(100)`
`sampler_t`	The sampler type to use. See :mod:`pydvl.value.sampler` for a list. TYPE: `Type[StochasticSampler]` DEFAULT: `PermutationSampler`
`n_jobs`	Number of parallel jobs to use. TYPE: `int` DEFAULT: `1`
`seed`	Either an instance of a numpy random number generator or a seed for it. TYPE: `Optional[Seed]` DEFAULT: `None`
`config`	Object configuring parallel computation, with cluster address, number of cpus, etc. TYPE: `ParallelConfig` DEFAULT: `ParallelConfig()`
`progress`	Whether to display a progress bar. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`ValuationResult`	Object with the results.

Source code in src/pydvl/value/semivalues.py

def compute_banzhaf_semivalues(
    u: Utility,
    *,
    done: StoppingCriterion = MaxUpdates(100),
    sampler_t: Type[StochasticSampler] = PermutationSampler,
    n_jobs: int = 1,
    config: ParallelConfig = ParallelConfig(),
    progress: bool = False,
    seed: Optional[Seed] = None,
) -> ValuationResult:
    """Computes Banzhaf values for a given utility function.

    This is a convenience wrapper for
    [compute_generic_semivalues][pydvl.value.semivalues.compute_generic_semivalues]
    with the Banzhaf coefficient.

    Args:
        u: Utility object with model, data, and scoring function.
        done: Stopping criterion.
        sampler_t: The sampler type to use. See :mod:`pydvl.value.sampler` for a
            list.
        n_jobs: Number of parallel jobs to use.
        seed: Either an instance of a numpy random number generator or a seed for it.
        config: Object configuring parallel computation, with cluster address,
            number of cpus, etc.
        progress: Whether to display a progress bar.

    Returns:
        Object with the results.
    """
    return compute_generic_semivalues(
        sampler_t(u.data.indices, seed=seed),
        u,
        banzhaf_coefficient,
        done,
        n_jobs=n_jobs,
        config=config,
        progress=progress,
    )

`compute_beta_shapley_semivalues(u, *, alpha=1, beta=1, done=MaxUpdates(100), sampler_t=PermutationSampler, n_jobs=1, config=ParallelConfig(), progress=False, seed=None)` ¶

Computes Beta Shapley values for a given utility function.

This is a convenience wrapper for compute_generic_semivalues with the Beta Shapley coefficient.

PARAMETER	DESCRIPTION
`u`	Utility object with model, data, and scoring function. TYPE: `Utility`
`alpha`	Alpha parameter of the Beta distribution. TYPE: `float` DEFAULT: `1`
`beta`	Beta parameter of the Beta distribution. TYPE: `float` DEFAULT: `1`
`done`	Stopping criterion. TYPE: `StoppingCriterion` DEFAULT: `MaxUpdates(100)`
`sampler_t`	The sampler type to use. See :mod:`pydvl.value.sampler` for a list. TYPE: `Type[StochasticSampler]` DEFAULT: `PermutationSampler`
`n_jobs`	Number of parallel jobs to use. TYPE: `int` DEFAULT: `1`
`seed`	Either an instance of a numpy random number generator or a seed for it. TYPE: `Optional[Seed]` DEFAULT: `None`
`config`	Object configuring parallel computation, with cluster address, number of cpus, etc. TYPE: `ParallelConfig` DEFAULT: `ParallelConfig()`
`progress`	Whether to display a progress bar. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`ValuationResult`	Object with the results.

Source code in src/pydvl/value/semivalues.py

def compute_beta_shapley_semivalues(
    u: Utility,
    *,
    alpha: float = 1,
    beta: float = 1,
    done: StoppingCriterion = MaxUpdates(100),
    sampler_t: Type[StochasticSampler] = PermutationSampler,
    n_jobs: int = 1,
    config: ParallelConfig = ParallelConfig(),
    progress: bool = False,
    seed: Optional[Seed] = None,
) -> ValuationResult:
    """Computes Beta Shapley values for a given utility function.

    This is a convenience wrapper for
    [compute_generic_semivalues][pydvl.value.semivalues.compute_generic_semivalues]
    with the Beta Shapley coefficient.

    Args:
        u: Utility object with model, data, and scoring function.
        alpha: Alpha parameter of the Beta distribution.
        beta: Beta parameter of the Beta distribution.
        done: Stopping criterion.
        sampler_t: The sampler type to use. See :mod:`pydvl.value.sampler` for a list.
        n_jobs: Number of parallel jobs to use.
        seed: Either an instance of a numpy random number generator or a seed for it.
        config: Object configuring parallel computation, with cluster address, number of
            cpus, etc.
        progress: Whether to display a progress bar.

    Returns:
        Object with the results.
    """
    return compute_generic_semivalues(
        sampler_t(u.data.indices, seed=seed),
        u,
        beta_coefficient(alpha, beta),
        done,
        n_jobs=n_jobs,
        config=config,
        progress=progress,
    )

`compute_semivalues(u, *, done=MaxUpdates(100), mode=SemiValueMode.Shapley, sampler_t=PermutationSampler, n_jobs=1, seed=None, **kwargs)` ¶

Convenience entry point for most common semi-value computations.

Deprecation warning

This method is deprecated and will be replaced in 0.8.0 by the more general implementation of compute_generic_semivalues. Use compute_shapley_semivalues, compute_banzhaf_semivalues, or compute_beta_shapley_semivalues instead.

The modes supported with this interface are the following. For greater flexibility use compute_generic_semivalues directly.

SemiValueMode.Shapley: Shapley values.
[SemiValueMode.BetaShapley][pydvl.value.semivalues.SemiValueMode.BetaShapley]: Implements the Beta Shapley semi-value as introduced in (Kwon and Zou, 2022)¹. Pass additional keyword arguments alpha and beta to set the parameters of the Beta distribution (both default to 1).
[SemiValueMode.Banzhaf][]: Implements the Banzhaf semi-value as introduced in (Wang and Jia, 2022)¹.

See [[data-valuation]] for an overview of valuation. - SemiValueMode.Banzhaf: Implements the Banzhaf semi-value as introduced in [@wang_data_2022].

PARAMETER	DESCRIPTION
`u`	Utility object with model, data, and scoring function. TYPE: `Utility`
`done`	Stopping criterion. TYPE: `StoppingCriterion` DEFAULT: `MaxUpdates(100)`
`mode`	The semi-value mode to use. See SemiValueMode for a list. TYPE: `SemiValueMode` DEFAULT: `Shapley`
`sampler_t`	The sampler type to use. See sampler for a list. TYPE: `Type[StochasticSampler]` DEFAULT: `PermutationSampler`
`n_jobs`	Number of parallel jobs to use. TYPE: `int` DEFAULT: `1`
`seed`	Either an instance of a numpy random number generator or a seed for it. TYPE: `Optional[Seed]` DEFAULT: `None`
`kwargs`	Additional keyword arguments passed to compute_generic_semivalues. DEFAULT: `{}`

RETURNS	DESCRIPTION
`ValuationResult`	Object with the results.

Source code in src/pydvl/value/semivalues.py

@deprecated(target=True, deprecated_in="0.7.0", remove_in="0.8.0")
def compute_semivalues(
    u: Utility,
    *,
    done: StoppingCriterion = MaxUpdates(100),
    mode: SemiValueMode = SemiValueMode.Shapley,
    sampler_t: Type[StochasticSampler] = PermutationSampler,
    n_jobs: int = 1,
    seed: Optional[Seed] = None,
    **kwargs,
) -> ValuationResult:
    """Convenience entry point for most common semi-value computations.

    !!! warning "Deprecation warning"
        This method is deprecated and will be replaced in 0.8.0 by the more
        general implementation of
        [compute_generic_semivalues][pydvl.value.semivalues.compute_generic_semivalues].
        Use
        [compute_shapley_semivalues][pydvl.value.semivalues.compute_shapley_semivalues],
        [compute_banzhaf_semivalues][pydvl.value.semivalues.compute_banzhaf_semivalues],
        or
        [compute_beta_shapley_semivalues][pydvl.value.semivalues.compute_beta_shapley_semivalues]
        instead.

    The modes supported with this interface are the following. For greater
    flexibility use
    [compute_generic_semivalues][pydvl.value.semivalues.compute_generic_semivalues]
    directly.

    - [SemiValueMode.Shapley][pydvl.value.semivalues.SemiValueMode]:
      Shapley values.
    - [SemiValueMode.BetaShapley][pydvl.value.semivalues.SemiValueMode.BetaShapley]:
      Implements the Beta Shapley semi-value as introduced in
      (Kwon and Zou, 2022)<sup><a href="#kwon_beta_2022">1</a></sup>.
      Pass additional keyword arguments `alpha` and `beta` to set the
      parameters of the Beta distribution (both default to 1).
    - [SemiValueMode.Banzhaf][SemiValueMode.Banzhaf]: Implements the Banzhaf
      semi-value as introduced in (Wang and Jia, 2022)<sup><a href="#wang_data_2022">1</a></sup>.

    See [[data-valuation]] for an overview of valuation.
    - [SemiValueMode.Banzhaf][pydvl.value.semivalues.SemiValueMode]: Implements
      the Banzhaf semi-value as introduced in [@wang_data_2022].

    Args:
        u: Utility object with model, data, and scoring function.
        done: Stopping criterion.
        mode: The semi-value mode to use. See
            [SemiValueMode][pydvl.value.semivalues.SemiValueMode] for a list.
        sampler_t: The sampler type to use. See [sampler][pydvl.value.sampler]
            for a list.
        n_jobs: Number of parallel jobs to use.
        seed: Either an instance of a numpy random number generator or a seed for it.
        kwargs: Additional keyword arguments passed to
            [compute_generic_semivalues][pydvl.value.semivalues.compute_generic_semivalues].

    Returns:
        Object with the results.
    """
    if mode == SemiValueMode.Shapley:
        coefficient = shapley_coefficient
    elif mode == SemiValueMode.BetaShapley:
        alpha = kwargs.pop("alpha", 1)
        beta = kwargs.pop("beta", 1)
        coefficient = beta_coefficient(alpha, beta)
    elif mode == SemiValueMode.Banzhaf:
        coefficient = banzhaf_coefficient
    else:
        raise ValueError(f"Unknown mode {mode}")
    coefficient = cast(SVCoefficient, coefficient)
    return compute_generic_semivalues(
        sampler_t(u.data.indices, seed=seed),
        u,
        coefficient,
        done,
        n_jobs=n_jobs,
        **kwargs,
    )

Last update: 2023-09-02
Created: 2023-09-02

Semivalues

References¶

SVCoefficient ¶

__call__(n, k) ¶

SemiValueMode ¶

compute_generic_semivalues(sampler, u, coefficient, done, *, n_jobs=1, config=ParallelConfig(), progress=False) ¶

compute_shapley_semivalues(u, *, done=MaxUpdates(100), sampler_t=PermutationSampler, n_jobs=1, config=ParallelConfig(), progress=False, seed=None) ¶

compute_banzhaf_semivalues(u, *, done=MaxUpdates(100), sampler_t=PermutationSampler, n_jobs=1, config=ParallelConfig(), progress=False, seed=None) ¶

compute_beta_shapley_semivalues(u, *, alpha=1, beta=1, done=MaxUpdates(100), sampler_t=PermutationSampler, n_jobs=1, config=ParallelConfig(), progress=False, seed=None) ¶

compute_semivalues(u, *, done=MaxUpdates(100), mode=SemiValueMode.Shapley, sampler_t=PermutationSampler, n_jobs=1, seed=None, **kwargs) ¶

`SVCoefficient` ¶

`call(n, k)` ¶

`SemiValueMode` ¶

`compute_generic_semivalues(sampler, u, coefficient, done, *, n_jobs=1, config=ParallelConfig(), progress=False)` ¶

`compute_shapley_semivalues(u, *, done=MaxUpdates(100), sampler_t=PermutationSampler, n_jobs=1, config=ParallelConfig(), progress=False, seed=None)` ¶

`compute_banzhaf_semivalues(u, *, done=MaxUpdates(100), sampler_t=PermutationSampler, n_jobs=1, config=ParallelConfig(), progress=False, seed=None)` ¶

`compute_beta_shapley_semivalues(u, *, alpha=1, beta=1, done=MaxUpdates(100), sampler_t=PermutationSampler, n_jobs=1, config=ParallelConfig(), progress=False, seed=None)` ¶

`compute_semivalues(u, *, done=MaxUpdates(100), mode=SemiValueMode.Shapley, sampler_t=PermutationSampler, n_jobs=1, seed=None, **kwargs)` ¶