Skip to content

pydvl.valuation.samplers.owen

Samplers for Owen Shapley values.

The Owen Shapley value is a sampling-based method to estimate Shapley values. It samples probability values between 0 and 1 and then draws subsets of the complement of the current index where each element is sampled with the given probability.

Possible configurations

The basic sampler is OwenSampler. It can be configured with a deterministic grid of probability values or a uniform distribution between 0 and 1. The former follows the idea of the original paper (Okhrati and Lipani, 2021)1.

This strategy for the sampling of probability values can be specified with the GridOwenStrategy or UniformOwenStrategy classes.

In addition, and as is the case for all PowerSetSamplers, one can configure the way the sampler iterates over indices to be updated. This can be done with the IndexIteration strategy.

When using infinite index iteration, the sampler can be used with a stopping criterion to estimate Shapley values. This follows more closely the typical usage pattern in PyDVL than the original sampling method described in Okhrati and Lipani (2021)1.

Antithetic Owen sampling

We also provide an AntitheticOwenSampler, which draws probability values \(q\) between 0 and 0.5 again, either deterministically over a discrete grid or at uniformly at random, and then generates two samples for each index, one with the probability \(q\) and another with probability \(1-q\).

It can be configured in the same manner as the regular Owen sampler.

The four main configurations
standard_owen = OwenSampler(
    outer_sampling_strategy=GridOwenStrategy(n_samples_outer=100),
    n_samples_inner=2,
    index_iteration=FiniteSequentialIndexIteration,
    )
antithetic_owen = AntitheticOwenSampler(
    outer_sampling_strategy=GridOwenStrategy(n_samples_outer=100),
    n_samples_inner=2,
    index_iteration=FiniteSequentialIndexIteration,
    )
infinite_owen = OwenSampler(
    outer_sampling_strategy=UniformOwenStrategy(seed=42),
    n_samples_inner=2,
    index_iteration=RandomIndexIteration,
    seed=42
)
infinite_antithetic_owen = AntitheticOwenSampler(
    outer_sampling_strategy=UniformOwenStrategy(seed=42),
    n_samples_inner=2,
    index_iteration=RandomIndexIteration,
    seed=42
)

References


  1. Okhrati, R., Lipani, A., 2021. A Multilinear Sampling Algorithm to Estimate Shapley Values. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 7992–7999. IEEE. 

OwenStrategy

OwenStrategy(n_samples_outer: int)

Bases: ABC

Base class for strategies for the Owen sampler to sample probability values.

Source code in src/pydvl/valuation/samplers/owen.py
def __init__(self, n_samples_outer: int):
    self.n_samples_outer = n_samples_outer

UniformOwenStrategy

UniformOwenStrategy(n_samples_outer: int, seed: Seed | None = None)

Bases: OwenStrategy

A strategy for OwenSampler to sample probability values uniformly between 0 and \(q_ ext{stop}\).

PARAMETER DESCRIPTION
n_samples_outer

The number of probability values \(q\) used for the outer loop. Since samples are taken anew for each index, a high number will delay updating new indices and has no effect on the final accuracy if using an infinite index iteration. In general, it only makes sense to change this number if using a finite index iteration.

TYPE: int

seed

The seed for the random number generator.

TYPE: Seed | None DEFAULT: None

Source code in src/pydvl/valuation/samplers/owen.py
def __init__(self, n_samples_outer: int, seed: Seed | None = None):
    super().__init__(n_samples_outer=n_samples_outer)
    self.rng = np.random.default_rng(seed)

GridOwenStrategy

GridOwenStrategy(n_samples_outer: int)

Bases: OwenStrategy

A strategy for OwenSampler to sample probability values on a linear grid.

PARAMETER DESCRIPTION
n_samples_outer

The number of probability values \(q\) used for the outer loop. These will be linearly spaced between 0 and \(q_ ext{stop}\).

TYPE: int

Source code in src/pydvl/valuation/samplers/owen.py
def __init__(self, n_samples_outer: int):
    super().__init__(n_samples_outer=n_samples_outer)

OwenSampler

OwenSampler(
    outer_sampling_strategy: OwenStrategy,
    n_samples_inner: int = 2,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
)

Bases: StochasticSamplerMixin, PowersetSampler

A sampler for semi-values using the Owen method.

For each index \(i\) we sample n_samples_outer probability values \(q_j\) between 0 and 1 and then, for each \(j\) we draw n_samples_inner subsets of the complement of the current index where each element is sampled probability \(q_j\).

The distribution for the outer sampling can be either uniform or deterministic. The default is deterministic on a grid, which is the original method described in Okhrati and Lipani (2021)1. This can be achieved by using the GridOwenStrategy strategy.

Alternatively, the distribution can be uniform between 0 and 1. This can be achieved by using the UniformOwenStrategy strategy.

By combining a UniformOwenStrategy with an infinite IndexIteration strategy, this sampler can be used with a stopping criterion to estimate semi-values. This follows more closely the typical usage pattern in PyDVL than the original sampling method described in Okhrati and Lipani (2021)1.

Example usage
sampler = OwenSampler(
    outer_sampling_strategy=GridOwenStrategy(n_samples_outer=200),
    n_samples_inner=8,
    index_iteration=FiniteSequentialIndexIteration,
)
PARAMETER DESCRIPTION
n_samples_inner

The number of samples drawn for each probability. In the original paper this was fixed to 2 for all experiments.

TYPE: int DEFAULT: 2

batch_size

The batch size of the sampler.

TYPE: int DEFAULT: 1

index_iteration

The index iteration strategy, sequential or random, finite or infinite.

TYPE: Type[IndexIteration] DEFAULT: FiniteSequentialIndexIteration

seed

The seed for the random number generator.

TYPE: Seed | None DEFAULT: None

Source code in src/pydvl/valuation/samplers/owen.py
def __init__(
    self,
    outer_sampling_strategy: OwenStrategy,
    n_samples_inner: int = 2,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
):
    super().__init__(
        batch_size=batch_size, index_iteration=index_iteration, seed=seed
    )
    self.n_samples_inner = n_samples_inner
    self.sampling_probabilities = outer_sampling_strategy
    self.q_stop = 1.0

skip_indices property writable

skip_indices

Set of indices to skip in the outer loop.

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

index_iterator

index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]

Iterates over indices with the method specified at construction.

Source code in src/pydvl/valuation/samplers/powerset.py
def index_iterator(
    self, indices: IndexSetT
) -> Generator[IndexT | None, None, None]:
    """Iterates over indices with the method specified at construction."""
    try:
        self._index_iterator = self._index_iterator_cls(indices, seed=self._rng)  # type: ignore
    except (AttributeError, TypeError):
        self._index_iterator = self._index_iterator_cls(indices)
    for idx in self._index_iterator:
        if idx not in self.skip_indices:
            yield idx

log_weight

log_weight(n: int, subset_len: int) -> float

For each \(q_j, j \in \{1, ..., N\}\) in the outer probabilities, the probability of drawing a subset \(S_k\) of size \(k\) is:

\[ P (| S_{q_j} | = k) = \binom{n}{k} \ q_j^k (1 - q_j)^{n - k}.\]

So, if each \(q_j\) is chosen with equal weight (or more generally with probability \(p_j\)),then by total probability, the overall probability of obtaining a subset of size \(k\) is a mixture of the binomials: $$ P (| S | = k) = \sum_{j = 1}^N p_j \ \binom{n}{k} \ q_j^k (1 - q_j)^{n - k}. $$

In our case \(p_j = 1/N\), so that \(P(|S|=k) = \frac{1}{N} \sum_{j=1}^N P (| S_{q_j} | = k)\). For large enough \(N\) this is

\[ P(|S|=k) \approx \binom{n}{k} \int_0^1 q^k (1 - q)^{n - k} \, dq = \frac{1}{ n+1}, \]

where we computed the integral using the beta function and its expression as products of gamma functions.

Now, given the symmetry wrt. the indices in the sampling procedure, any given set \(S\) of size \(k\) is equally likely to be drawn. So the probability of a set being of size \(k\) must be equally divided by the number of sets of that size, and the weight of a set of size \(k\) is:

\[ P(S) = \frac{1}{n+1} \binom{n}{|S|}^{-1}. \]
PARAMETER DESCRIPTION
n

Size of the index set.

TYPE: int

subset_len

Size of the subset.

TYPE: int

Returns: The logarithm of the weight of a subset of size subset_len.

Source code in src/pydvl/valuation/samplers/owen.py
def log_weight(self, n: int, subset_len: int) -> float:
    r"""For each $q_j, j \in \{1, ..., N\}$ in the outer probabilities, the
    probability of drawing a subset $S_k$ of size $k$ is:

    $$ P (| S_{q_j} | = k) = \binom{n}{k} \  q_j^k  (1 - q_j)^{n - k}.$$

    So, if each $q_j$ is chosen with equal weight (or more generally with
    probability $p_j$),then by total probability, the overall probability of
    obtaining a subset of size $k$ is a mixture of the binomials:
    $$
    P (| S | = k) = \sum_{j = 1}^N p_j \ \binom{n}{k} \ q_j^k  (1 - q_j)^{n - k}.
    $$

    In our case $p_j = 1/N$, so that $P(|S|=k) = \frac{1}{N} \sum_{j=1}^N P (|
    S_{q_j} | = k)$. For large enough $N$ this is

    $$
    P(|S|=k) \approx \binom{n}{k} \int_0^1 q^k (1 - q)^{n - k} \, dq = \frac{1}{
    n+1},
    $$

    where we computed the integral using the beta function and its expression as
    products of gamma functions.

    Now, given the symmetry wrt. the indices in the sampling procedure, any given
    set $S$ of size $k$ is equally likely to be drawn. So the probability of a set
    being of size $k$ must be equally divided by the number of sets of that size,
    and the weight of a set of size $k$ is:

    $$ P(S) = \frac{1}{n+1} \binom{n}{|S|}^{-1}. $$

    Args:
        n: Size of the index set.
        subset_len: Size of the subset.
    Returns:
        The logarithm of the weight of a subset of size `subset_len`.
    """
    m = self._index_iterator_cls.complement_size(n)
    return float(-logcomb(m, subset_len) - np.log(m + 1))

sample_limit

sample_limit(indices: IndexSetT) -> int | None

The number of samples that will be generated by the sampler.

PARAMETER DESCRIPTION
indices

TYPE: IndexSetT

RETURNS DESCRIPTION
int | None

0 if there are no indices, None if there's no limit and the number of

int | None

samples otherwise.

Source code in src/pydvl/valuation/samplers/owen.py
def sample_limit(self, indices: IndexSetT) -> int | None:
    """The number of samples that will be generated by the sampler.

    Args:
        indices:

    Returns:
        0 if there are no indices, `None` if there's no limit and the number of
        samples otherwise.
    """
    if len(indices) == 0:
        return 0
    if not self._index_iterator_cls.is_finite():
        return None

    return (
        cast(int, self._index_iterator_cls.length(len(indices)))
        * self.sampling_probabilities.n_samples_outer
        * self.n_samples_inner
    )

AntitheticOwenSampler

AntitheticOwenSampler(
    outer_sampling_strategy: OwenStrategy,
    n_samples_inner: int = 2,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
)

Bases: OwenSampler

A sampler for antithetic Owen shapley values.

For each sample obtained with the method of OwenSampler, a second sample is generated by taking the complement of the first sample.

For the same number of total samples, the antithetic Owen sampler yields usually more precise estimates of shapley values than the regular Owen sampler.

Source code in src/pydvl/valuation/samplers/owen.py
def __init__(
    self,
    outer_sampling_strategy: OwenStrategy,
    n_samples_inner: int = 2,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
):
    super().__init__(
        outer_sampling_strategy=outer_sampling_strategy,
        n_samples_inner=n_samples_inner,
        batch_size=batch_size,
        index_iteration=index_iteration,
        seed=seed,
    )
    self.q_stop = 0.5

skip_indices property writable

skip_indices

Set of indices to skip in the outer loop.

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

log_weight

log_weight(n: int, subset_len: int) -> float

For each \(q_j, j \in \{1, ..., N\}\) in the outer probabilities, the probability of drawing a subset \(S_k\) of size \(k\) is:

\[ P (| S_{q_j} | = k) = \binom{n}{k} \ q_j^k (1 - q_j)^{n - k}.\]

So, if each \(q_j\) is chosen with equal weight (or more generally with probability \(p_j\)),then by total probability, the overall probability of obtaining a subset of size \(k\) is a mixture of the binomials: $$ P (| S | = k) = \sum_{j = 1}^N p_j \ \binom{n}{k} \ q_j^k (1 - q_j)^{n - k}. $$

In our case \(p_j = 1/N\), so that \(P(|S|=k) = \frac{1}{N} \sum_{j=1}^N P (| S_{q_j} | = k)\). For large enough \(N\) this is

\[ P(|S|=k) \approx \binom{n}{k} \int_0^1 q^k (1 - q)^{n - k} \, dq = \frac{1}{ n+1}, \]

where we computed the integral using the beta function and its expression as products of gamma functions.

Now, given the symmetry wrt. the indices in the sampling procedure, any given set \(S\) of size \(k\) is equally likely to be drawn. So the probability of a set being of size \(k\) must be equally divided by the number of sets of that size, and the weight of a set of size \(k\) is:

\[ P(S) = \frac{1}{n+1} \binom{n}{|S|}^{-1}. \]
PARAMETER DESCRIPTION
n

Size of the index set.

TYPE: int

subset_len

Size of the subset.

TYPE: int

Returns: The logarithm of the weight of a subset of size subset_len.

Source code in src/pydvl/valuation/samplers/owen.py
def log_weight(self, n: int, subset_len: int) -> float:
    r"""For each $q_j, j \in \{1, ..., N\}$ in the outer probabilities, the
    probability of drawing a subset $S_k$ of size $k$ is:

    $$ P (| S_{q_j} | = k) = \binom{n}{k} \  q_j^k  (1 - q_j)^{n - k}.$$

    So, if each $q_j$ is chosen with equal weight (or more generally with
    probability $p_j$),then by total probability, the overall probability of
    obtaining a subset of size $k$ is a mixture of the binomials:
    $$
    P (| S | = k) = \sum_{j = 1}^N p_j \ \binom{n}{k} \ q_j^k  (1 - q_j)^{n - k}.
    $$

    In our case $p_j = 1/N$, so that $P(|S|=k) = \frac{1}{N} \sum_{j=1}^N P (|
    S_{q_j} | = k)$. For large enough $N$ this is

    $$
    P(|S|=k) \approx \binom{n}{k} \int_0^1 q^k (1 - q)^{n - k} \, dq = \frac{1}{
    n+1},
    $$

    where we computed the integral using the beta function and its expression as
    products of gamma functions.

    Now, given the symmetry wrt. the indices in the sampling procedure, any given
    set $S$ of size $k$ is equally likely to be drawn. So the probability of a set
    being of size $k$ must be equally divided by the number of sets of that size,
    and the weight of a set of size $k$ is:

    $$ P(S) = \frac{1}{n+1} \binom{n}{|S|}^{-1}. $$

    Args:
        n: Size of the index set.
        subset_len: Size of the subset.
    Returns:
        The logarithm of the weight of a subset of size `subset_len`.
    """
    m = self._index_iterator_cls.complement_size(n)
    return float(-logcomb(m, subset_len) - np.log(m + 1))

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

index_iterator

index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]

Iterates over indices with the method specified at construction.

Source code in src/pydvl/valuation/samplers/powerset.py
def index_iterator(
    self, indices: IndexSetT
) -> Generator[IndexT | None, None, None]:
    """Iterates over indices with the method specified at construction."""
    try:
        self._index_iterator = self._index_iterator_cls(indices, seed=self._rng)  # type: ignore
    except (AttributeError, TypeError):
        self._index_iterator = self._index_iterator_cls(indices)
    for idx in self._index_iterator:
        if idx not in self.skip_indices:
            yield idx