Skip to content

Semivalues

This module provides the core functionality for the computation of generic semi-values. A semi-value is any valuation function with the form:

\[v_\text{semi}(i) = \sum_{i=1}^n w(k) \sum_{S \subset D_{-i}^{(k)}} [U(S_{+i})-U(S)],\]

where the coefficients \(w(k)\) satisfy the property:

\[\sum_{k=1}^n w(k) = 1.\]

Note

For implementation consistency, we slightly depart from the common definition of semi-values, which includes a factor \(1/n\) in the sum over subsets. Instead, we subsume this factor into the coefficient \(w(k)\).

As such, the computation of a semi-value requires two components:

  1. A subset sampler that generates subsets of the set \(D\) of interest.
  2. A coefficient \(w(k)\) that assigns a weight to each subset size \(k\).

Samplers can be found in sampler, and can be classified into two categories: powerset samplers and permutation samplers. Powerset samplers generate subsets of \(D_{-i}\), while the permutation sampler generates permutations of \(D\). The former conform to the above definition of semi-values, while the latter reformulates it as:

\[ v(i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)} \tilde{w}( | \sigma_{:i} | )[U(\sigma_{:i} \cup \{i\}) − U(\sigma_{:i})], \]

where \(\sigma_{:i}\) denotes the set of indices in permutation sigma before the position where \(i\) appears (see Data valuation for details), and

\[ \tilde{w} (k) = n \binom{n - 1}{k} w (k) \]

is the weight correction due to the reformulation.

Warning

Both PermutationSampler and DeterministicPermutationSampler require caching to be enabled or computation will be doubled wrt. a 'direct' implementation of permutation MC.

There are several pre-defined coefficients, including the Shapley value of (Ghorbani and Zou, 2019)1, the Banzhaf index of (Wang and Jia)3, and the Beta coefficient of (Kwon and Zou, 2022)2. For each of these methods, there is a convenience wrapper function. Respectively, these are: compute_shapley_semivalues, compute_banzhaf_semivalues, and compute_beta_shapley_semivalues. instead.

References


  1. Ghorbani, A., Zou, J., 2019. Data Shapley: Equitable Valuation of Data for Machine Learning. In: Proceedings of the 36th International Conference on Machine Learning, PMLR, pp. 2242–2251. 

  2. Kwon, Y. and Zou, J., 2022. Beta Shapley: A Unified and Noise-reduced Data Valuation Framework for Machine Learning. In: Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022, Vol. 151. PMLR, Valencia, Spain. 

  3. Wang, J.T. and Jia, R., 2022. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. ArXiv preprint arXiv:2205.15466. 

SVCoefficient

Bases: Protocol

A coefficient for the computation of semi-values.

__call__(n, k)

Computes the coefficient for a given subset size.

PARAMETER DESCRIPTION
n

Total number of elements in the set.

TYPE: int

k

Size of the subset for which the coefficient is being computed

TYPE: int

Source code in src/pydvl/value/semivalues.py
def __call__(self, n: int, k: int) -> float:
    """Computes the coefficient for a given subset size.

    Args:
        n: Total number of elements in the set.
        k: Size of the subset for which the coefficient is being computed
    """
    ...

SemiValueMode

Bases: str, Enum

Enumeration of semi-value modes.

Deprecation notice

This enum and the associated methods are deprecated and will be removed in 0.8.0.

compute_generic_semivalues(sampler, u, coefficient, done, *, n_jobs=1, config=ParallelConfig(), progress=False)

Computes semi-values for a given utility function and subset sampler.

PARAMETER DESCRIPTION
sampler

The subset sampler to use for utility computations.

TYPE: PowersetSampler

u

Utility object with model, data, and scoring function.

TYPE: Utility

coefficient

The semi-value coefficient

TYPE: SVCoefficient

done

Stopping criterion.

TYPE: StoppingCriterion

n_jobs

Number of parallel jobs to use.

TYPE: int DEFAULT: 1

config

Object configuring parallel computation, with cluster address, number of cpus, etc.

TYPE: ParallelConfig DEFAULT: ParallelConfig()

progress

Whether to display a progress bar.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
ValuationResult

Object with the results.

Source code in src/pydvl/value/semivalues.py
def compute_generic_semivalues(
    sampler: PowersetSampler,
    u: Utility,
    coefficient: SVCoefficient,
    done: StoppingCriterion,
    *,
    n_jobs: int = 1,
    config: ParallelConfig = ParallelConfig(),
    progress: bool = False,
) -> ValuationResult:
    """Computes semi-values for a given utility function and subset sampler.

    Args:
        sampler: The subset sampler to use for utility computations.
        u: Utility object with model, data, and scoring function.
        coefficient: The semi-value coefficient
        done: Stopping criterion.
        n_jobs: Number of parallel jobs to use.
        config: Object configuring parallel computation, with cluster
            address, number of cpus, etc.
        progress: Whether to display a progress bar.

    Returns:
        Object with the results.
    """
    from concurrent.futures import FIRST_COMPLETED, Future, wait

    from pydvl.utils import effective_n_jobs, init_executor, init_parallel_backend

    if isinstance(sampler, PermutationSampler) and not u.enable_cache:
        log.warning(
            "PermutationSampler requires caching to be enabled or computation "
            "will be doubled wrt. a 'direct' implementation of permutation MC"
        )

    result = ValuationResult.zeros(
        algorithm=f"semivalue-{str(sampler)}-{coefficient.__name__}",  # type: ignore
        indices=u.data.indices,
        data_names=u.data.data_names,
    )

    parallel_backend = init_parallel_backend(config)
    u = parallel_backend.put(u)
    correction = parallel_backend.put(
        lambda n, k: coefficient(n, k) * sampler.weight(n, k)
    )

    max_workers = effective_n_jobs(n_jobs, config)
    n_submitted_jobs = 2 * max_workers  # number of jobs in the queue

    sampler_it = iter(sampler)
    pbar = tqdm(disable=not progress, total=100, unit="%")

    with init_executor(
        max_workers=max_workers, config=config, cancel_futures=True
    ) as executor:
        pending: set[Future] = set()
        while True:
            pbar.n = 100 * done.completion()
            pbar.refresh()

            completed, pending = wait(pending, timeout=1, return_when=FIRST_COMPLETED)
            for future in completed:
                idx, marginal = future.result()
                result.update(idx, marginal)
                if done(result):
                    return result

            # Ensure that we always have n_submitted_jobs running
            try:
                for _ in range(n_submitted_jobs - len(pending)):
                    pending.add(
                        executor.submit(
                            _marginal,
                            u=u,
                            coefficient=correction,
                            sample=next(sampler_it),
                        )
                    )
            except StopIteration:
                if len(pending) == 0:
                    return result

compute_shapley_semivalues(u, *, done=MaxUpdates(100), sampler_t=PermutationSampler, n_jobs=1, config=ParallelConfig(), progress=False, seed=None)

Computes Shapley values for a given utility function.

This is a convenience wrapper for compute_generic_semivalues with the Shapley coefficient. Use compute_shapley_values for a more flexible interface and additional methods, including TMCS.

PARAMETER DESCRIPTION
u

Utility object with model, data, and scoring function.

TYPE: Utility

done

Stopping criterion.

TYPE: StoppingCriterion DEFAULT: MaxUpdates(100)

sampler_t

The sampler type to use. See :mod:pydvl.value.sampler for a list.

TYPE: Type[StochasticSampler] DEFAULT: PermutationSampler

n_jobs

Number of parallel jobs to use.

TYPE: int DEFAULT: 1

config

Object configuring parallel computation, with cluster address, number of cpus, etc.

TYPE: ParallelConfig DEFAULT: ParallelConfig()

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

progress

Whether to display a progress bar.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
ValuationResult

Object with the results.

Source code in src/pydvl/value/semivalues.py
def compute_shapley_semivalues(
    u: Utility,
    *,
    done: StoppingCriterion = MaxUpdates(100),
    sampler_t: Type[StochasticSampler] = PermutationSampler,
    n_jobs: int = 1,
    config: ParallelConfig = ParallelConfig(),
    progress: bool = False,
    seed: Optional[Seed] = None,
) -> ValuationResult:
    """Computes Shapley values for a given utility function.

    This is a convenience wrapper for
    [compute_generic_semivalues][pydvl.value.semivalues.compute_generic_semivalues]
    with the Shapley coefficient. Use
    [compute_shapley_values][pydvl.value.shapley.common.compute_shapley_values]
    for a more flexible interface and additional methods, including TMCS.

    Args:
        u: Utility object with model, data, and scoring function.
        done: Stopping criterion.
        sampler_t: The sampler type to use. See :mod:`pydvl.value.sampler`
            for a list.
        n_jobs: Number of parallel jobs to use.
        config: Object configuring parallel computation, with cluster
            address, number of cpus, etc.
        seed: Either an instance of a numpy random number generator or a seed for it.
        progress: Whether to display a progress bar.

    Returns:
        Object with the results.
    """
    return compute_generic_semivalues(
        sampler_t(u.data.indices, seed=seed),
        u,
        shapley_coefficient,
        done,
        n_jobs=n_jobs,
        config=config,
        progress=progress,
    )

compute_banzhaf_semivalues(u, *, done=MaxUpdates(100), sampler_t=PermutationSampler, n_jobs=1, config=ParallelConfig(), progress=False, seed=None)

Computes Banzhaf values for a given utility function.

This is a convenience wrapper for compute_generic_semivalues with the Banzhaf coefficient.

PARAMETER DESCRIPTION
u

Utility object with model, data, and scoring function.

TYPE: Utility

done

Stopping criterion.

TYPE: StoppingCriterion DEFAULT: MaxUpdates(100)

sampler_t

The sampler type to use. See :mod:pydvl.value.sampler for a list.

TYPE: Type[StochasticSampler] DEFAULT: PermutationSampler

n_jobs

Number of parallel jobs to use.

TYPE: int DEFAULT: 1

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

config

Object configuring parallel computation, with cluster address, number of cpus, etc.

TYPE: ParallelConfig DEFAULT: ParallelConfig()

progress

Whether to display a progress bar.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
ValuationResult

Object with the results.

Source code in src/pydvl/value/semivalues.py
def compute_banzhaf_semivalues(
    u: Utility,
    *,
    done: StoppingCriterion = MaxUpdates(100),
    sampler_t: Type[StochasticSampler] = PermutationSampler,
    n_jobs: int = 1,
    config: ParallelConfig = ParallelConfig(),
    progress: bool = False,
    seed: Optional[Seed] = None,
) -> ValuationResult:
    """Computes Banzhaf values for a given utility function.

    This is a convenience wrapper for
    [compute_generic_semivalues][pydvl.value.semivalues.compute_generic_semivalues]
    with the Banzhaf coefficient.

    Args:
        u: Utility object with model, data, and scoring function.
        done: Stopping criterion.
        sampler_t: The sampler type to use. See :mod:`pydvl.value.sampler` for a
            list.
        n_jobs: Number of parallel jobs to use.
        seed: Either an instance of a numpy random number generator or a seed for it.
        config: Object configuring parallel computation, with cluster address,
            number of cpus, etc.
        progress: Whether to display a progress bar.

    Returns:
        Object with the results.
    """
    return compute_generic_semivalues(
        sampler_t(u.data.indices, seed=seed),
        u,
        banzhaf_coefficient,
        done,
        n_jobs=n_jobs,
        config=config,
        progress=progress,
    )

compute_beta_shapley_semivalues(u, *, alpha=1, beta=1, done=MaxUpdates(100), sampler_t=PermutationSampler, n_jobs=1, config=ParallelConfig(), progress=False, seed=None)

Computes Beta Shapley values for a given utility function.

This is a convenience wrapper for compute_generic_semivalues with the Beta Shapley coefficient.

PARAMETER DESCRIPTION
u

Utility object with model, data, and scoring function.

TYPE: Utility

alpha

Alpha parameter of the Beta distribution.

TYPE: float DEFAULT: 1

beta

Beta parameter of the Beta distribution.

TYPE: float DEFAULT: 1

done

Stopping criterion.

TYPE: StoppingCriterion DEFAULT: MaxUpdates(100)

sampler_t

The sampler type to use. See :mod:pydvl.value.sampler for a list.

TYPE: Type[StochasticSampler] DEFAULT: PermutationSampler

n_jobs

Number of parallel jobs to use.

TYPE: int DEFAULT: 1

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

config

Object configuring parallel computation, with cluster address, number of cpus, etc.

TYPE: ParallelConfig DEFAULT: ParallelConfig()

progress

Whether to display a progress bar.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
ValuationResult

Object with the results.

Source code in src/pydvl/value/semivalues.py
def compute_beta_shapley_semivalues(
    u: Utility,
    *,
    alpha: float = 1,
    beta: float = 1,
    done: StoppingCriterion = MaxUpdates(100),
    sampler_t: Type[StochasticSampler] = PermutationSampler,
    n_jobs: int = 1,
    config: ParallelConfig = ParallelConfig(),
    progress: bool = False,
    seed: Optional[Seed] = None,
) -> ValuationResult:
    """Computes Beta Shapley values for a given utility function.

    This is a convenience wrapper for
    [compute_generic_semivalues][pydvl.value.semivalues.compute_generic_semivalues]
    with the Beta Shapley coefficient.

    Args:
        u: Utility object with model, data, and scoring function.
        alpha: Alpha parameter of the Beta distribution.
        beta: Beta parameter of the Beta distribution.
        done: Stopping criterion.
        sampler_t: The sampler type to use. See :mod:`pydvl.value.sampler` for a list.
        n_jobs: Number of parallel jobs to use.
        seed: Either an instance of a numpy random number generator or a seed for it.
        config: Object configuring parallel computation, with cluster address, number of
            cpus, etc.
        progress: Whether to display a progress bar.

    Returns:
        Object with the results.
    """
    return compute_generic_semivalues(
        sampler_t(u.data.indices, seed=seed),
        u,
        beta_coefficient(alpha, beta),
        done,
        n_jobs=n_jobs,
        config=config,
        progress=progress,
    )

compute_semivalues(u, *, done=MaxUpdates(100), mode=SemiValueMode.Shapley, sampler_t=PermutationSampler, n_jobs=1, seed=None, **kwargs)

Convenience entry point for most common semi-value computations.

Deprecation warning

This method is deprecated and will be replaced in 0.8.0 by the more general implementation of compute_generic_semivalues. Use compute_shapley_semivalues, compute_banzhaf_semivalues, or compute_beta_shapley_semivalues instead.

The modes supported with this interface are the following. For greater flexibility use compute_generic_semivalues directly.

  • SemiValueMode.Shapley: Shapley values.
  • [SemiValueMode.BetaShapley][pydvl.value.semivalues.SemiValueMode.BetaShapley]: Implements the Beta Shapley semi-value as introduced in (Kwon and Zou, 2022)1. Pass additional keyword arguments alpha and beta to set the parameters of the Beta distribution (both default to 1).
  • [SemiValueMode.Banzhaf][]: Implements the Banzhaf semi-value as introduced in (Wang and Jia, 2022)1.

See [[data-valuation]] for an overview of valuation. - SemiValueMode.Banzhaf: Implements the Banzhaf semi-value as introduced in [@wang_data_2022].

PARAMETER DESCRIPTION
u

Utility object with model, data, and scoring function.

TYPE: Utility

done

Stopping criterion.

TYPE: StoppingCriterion DEFAULT: MaxUpdates(100)

mode

The semi-value mode to use. See SemiValueMode for a list.

TYPE: SemiValueMode DEFAULT: Shapley

sampler_t

The sampler type to use. See sampler for a list.

TYPE: Type[StochasticSampler] DEFAULT: PermutationSampler

n_jobs

Number of parallel jobs to use.

TYPE: int DEFAULT: 1

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

kwargs

Additional keyword arguments passed to compute_generic_semivalues.

DEFAULT: {}

RETURNS DESCRIPTION
ValuationResult

Object with the results.

Source code in src/pydvl/value/semivalues.py
@deprecated(target=True, deprecated_in="0.7.0", remove_in="0.8.0")
def compute_semivalues(
    u: Utility,
    *,
    done: StoppingCriterion = MaxUpdates(100),
    mode: SemiValueMode = SemiValueMode.Shapley,
    sampler_t: Type[StochasticSampler] = PermutationSampler,
    n_jobs: int = 1,
    seed: Optional[Seed] = None,
    **kwargs,
) -> ValuationResult:
    """Convenience entry point for most common semi-value computations.

    !!! warning "Deprecation warning"
        This method is deprecated and will be replaced in 0.8.0 by the more
        general implementation of
        [compute_generic_semivalues][pydvl.value.semivalues.compute_generic_semivalues].
        Use
        [compute_shapley_semivalues][pydvl.value.semivalues.compute_shapley_semivalues],
        [compute_banzhaf_semivalues][pydvl.value.semivalues.compute_banzhaf_semivalues],
        or
        [compute_beta_shapley_semivalues][pydvl.value.semivalues.compute_beta_shapley_semivalues]
        instead.

    The modes supported with this interface are the following. For greater
    flexibility use
    [compute_generic_semivalues][pydvl.value.semivalues.compute_generic_semivalues]
    directly.

    - [SemiValueMode.Shapley][pydvl.value.semivalues.SemiValueMode]:
      Shapley values.
    - [SemiValueMode.BetaShapley][pydvl.value.semivalues.SemiValueMode.BetaShapley]:
      Implements the Beta Shapley semi-value as introduced in
      (Kwon and Zou, 2022)<sup><a href="#kwon_beta_2022">1</a></sup>.
      Pass additional keyword arguments `alpha` and `beta` to set the
      parameters of the Beta distribution (both default to 1).
    - [SemiValueMode.Banzhaf][SemiValueMode.Banzhaf]: Implements the Banzhaf
      semi-value as introduced in (Wang and Jia, 2022)<sup><a href="#wang_data_2022">1</a></sup>.

    See [[data-valuation]] for an overview of valuation.
    - [SemiValueMode.Banzhaf][pydvl.value.semivalues.SemiValueMode]: Implements
      the Banzhaf semi-value as introduced in [@wang_data_2022].

    Args:
        u: Utility object with model, data, and scoring function.
        done: Stopping criterion.
        mode: The semi-value mode to use. See
            [SemiValueMode][pydvl.value.semivalues.SemiValueMode] for a list.
        sampler_t: The sampler type to use. See [sampler][pydvl.value.sampler]
            for a list.
        n_jobs: Number of parallel jobs to use.
        seed: Either an instance of a numpy random number generator or a seed for it.
        kwargs: Additional keyword arguments passed to
            [compute_generic_semivalues][pydvl.value.semivalues.compute_generic_semivalues].

    Returns:
        Object with the results.
    """
    if mode == SemiValueMode.Shapley:
        coefficient = shapley_coefficient
    elif mode == SemiValueMode.BetaShapley:
        alpha = kwargs.pop("alpha", 1)
        beta = kwargs.pop("beta", 1)
        coefficient = beta_coefficient(alpha, beta)
    elif mode == SemiValueMode.Banzhaf:
        coefficient = banzhaf_coefficient
    else:
        raise ValueError(f"Unknown mode {mode}")
    coefficient = cast(SVCoefficient, coefficient)
    return compute_generic_semivalues(
        sampler_t(u.data.indices, seed=seed),
        u,
        coefficient,
        done,
        n_jobs=n_jobs,
        **kwargs,
    )

Last update: 2023-09-02
Created: 2023-09-02