pydvl.valuation.samplers.owen ¶

Samplers for Owen Shapley values.

The Owen Shapley value is a sampling-based method to estimate Shapley values. It samples probability values between 0 and 1 and then draws subsets of the complement of the current index where each element is sampled with the given probability.

Possible configurations¶

The basic sampler is OwenSampler. It can be configured with a deterministic grid of probability values or a uniform distribution between 0 and 1. The former follows the idea of the original paper (Okhrati and Lipani, 2021)¹.

This strategy for the sampling of probability values can be specified with the GridOwenStrategy or UniformOwenStrategy classes.

In addition, and as is the case for all PowerSetSamplers, one can configure the way the sampler iterates over indices to be updated. This can be done with the IndexIteration strategy.

When using infinite index iteration, the sampler can be used with a stopping criterion to estimate Shapley values. This follows more closely the typical usage pattern in PyDVL than the original sampling method described in Okhrati and Lipani (2021)¹.

Antithetic Owen sampling¶

We also provide an AntitheticOwenSampler, which draws probability values $q$ between 0 and 0.5 again, either deterministically over a discrete grid or at uniformly at random, and then generates two samples for each index, one with the probability $q$ and another with probability $1-q$.

It can be configured in the same manner as the regular Owen sampler.

The four main configurations

standard_owen = OwenSampler(
    outer_sampling_strategy=GridOwenStrategy(n_samples_outer=100),
    n_samples_inner=2,
    index_iteration=FiniteSequentialIndexIteration,
    )
antithetic_owen = AntitheticOwenSampler(
    outer_sampling_strategy=GridOwenStrategy(n_samples_outer=100),
    n_samples_inner=2,
    index_iteration=FiniteSequentialIndexIteration,
    )
infinite_owen = OwenSampler(
    outer_sampling_strategy=UniformOwenStrategy(seed=42),
    n_samples_inner=2,
    index_iteration=RandomIndexIteration,
    seed=42
)
infinite_antithetic_owen = AntitheticOwenSampler(
    outer_sampling_strategy=UniformOwenStrategy(seed=42),
    n_samples_inner=2,
    index_iteration=RandomIndexIteration,
    seed=42
)

References¶

Okhrati, R., Lipani, A., 2021. A Multilinear Sampling Algorithm to Estimate Shapley Values. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 7992–7999. IEEE. ↩↩

AntitheticOwenSampler ¶

AntitheticOwenSampler(
    outer_sampling_strategy: OwenStrategy,
    n_samples_inner: int = 2,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
)

Bases: OwenSampler

A sampler for antithetic Owen shapley values.

For each sample obtained with the method of OwenSampler, a second sample is generated by taking the complement of the first sample.

For the same number of total samples, the antithetic Owen sampler yields usually more precise estimates of shapley values than the regular Owen sampler.

Source code in src/pydvl/valuation/samplers/owen.py

def __init__(
    self,
    outer_sampling_strategy: OwenStrategy,
    n_samples_inner: int = 2,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
):
    super().__init__(
        outer_sampling_strategy=outer_sampling_strategy,
        n_samples_inner=n_samples_inner,
        batch_size=batch_size,
        index_iteration=index_iteration,
        seed=seed,
    )
    self.q_stop = 0.5

interrupted `property` ¶

interrupted: bool

Whether the sampler has been interrupted.

skip_indices `property` `writable` ¶

skip_indices

Set of indices to skip in the outer loop.

len ¶

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES	DESCRIPTION
`TypeError`	if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py

def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

repr ¶

__repr__() -> str

FIXME: This is not a proper representation of the sampler.

Source code in src/pydvl/valuation/samplers/base.py

def __repr__(self) -> str:
    """FIXME: This is not a proper representation of the sampler."""
    return f"{self.__class__.__name__}"

generate_batches ¶

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py

def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

index_iterable ¶

index_iterable(indices: IndexSetT) -> Generator[IndexT | None, None, None]

Iterates over indices with the method specified at construction.

Source code in src/pydvl/valuation/samplers/powerset.py

def index_iterable(
    self, indices: IndexSetT
) -> Generator[IndexT | None, None, None]:
    """Iterates over indices with the method specified at construction."""
    try:
        iterable = self._index_iterator_cls(indices, seed=self._rng)  # type: ignore
    except (AttributeError, TypeError):
        iterable = self._index_iterator_cls(indices)
    for idx in iterable:
        if idx not in self.skip_indices:
            yield idx

interrupt ¶

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py

def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

log_weight ¶

log_weight(n: int, subset_len: int) -> float

For each $q_j, j \in \{1, ..., N\}$ in the outer probabilities, the probability of drawing a subset $S_k$ of size $k$ is:

\[ P (| S_{q_j} | = k) = \binom{n}{k} \ q_j^k (1 - q_j)^{n - k}.\]

So, if each $q_j$ is chosen with equal weight (or more generally with probability $p_j$),then by total probability, the overall probability of obtaining a subset of size $k$ is a mixture of the binomials: $$ P (| S | = k) = \sum_{j = 1}^N p_j \ \binom{n}{k} \ q_j^k (1 - q_j)^{n - k}. $$

In our case $p_j = 1/N$, so that $P(|S|=k) = \frac{1}{N} \sum_{j=1}^N P (| S_{q_j} | = k)$. For large enough $N$ this is

\[ P(|S|=k) \approx \binom{n}{k} \int_0^1 q^k (1 - q)^{n - k} \, dq = \frac{1}{ n+1}, \]

where we computed the integral using the beta function and its expression as products of gamma functions.

Now, given the symmetry wrt. the indices in the sampling procedure, any given set $S$ of size $k$ is equally likely to be drawn. So the probability of a set being of size $k$ must be equally divided by the number of sets of that size, and the weight of a set of size $k$ is:

\[ P(S) = \frac{1}{n+1} \binom{n}{|S|}^{-1}. \]

PARAMETER	DESCRIPTION
`n`	Size of the index set. TYPE: `int`
`subset_len`	Size of the subset. TYPE: `int`

Returns: The logarithm of the weight of a subset of size subset_len.

Source code in src/pydvl/valuation/samplers/owen.py

def log_weight(self, n: int, subset_len: int) -> float:
    r"""For each $q_j, j \in \{1, ..., N\}$ in the outer probabilities, the
    probability of drawing a subset $S_k$ of size $k$ is:

    $$ P (| S_{q_j} | = k) = \binom{n}{k} \  q_j^k  (1 - q_j)^{n - k}.$$

    So, if each $q_j$ is chosen with equal weight (or more generally with
    probability $p_j$),then by total probability, the overall probability of
    obtaining a subset of size $k$ is a mixture of the binomials:
    $$
    P (| S | = k) = \sum_{j = 1}^N p_j \ \binom{n}{k} \ q_j^k  (1 - q_j)^{n - k}.
    $$

    In our case $p_j = 1/N$, so that $P(|S|=k) = \frac{1}{N} \sum_{j=1}^N P (|
    S_{q_j} | = k)$. For large enough $N$ this is

    $$
    P(|S|=k) \approx \binom{n}{k} \int_0^1 q^k (1 - q)^{n - k} \, dq = \frac{1}{
    n+1},
    $$

    where we computed the integral using the beta function and its expression as
    products of gamma functions.

    Now, given the symmetry wrt. the indices in the sampling procedure, any given
    set $S$ of size $k$ is equally likely to be drawn. So the probability of a set
    being of size $k$ must be equally divided by the number of sets of that size,
    and the weight of a set of size $k$ is:

    $$ P(S) = \frac{1}{n+1} \binom{n}{|S|}^{-1}. $$

    Args:
        n: Size of the index set.
        subset_len: Size of the subset.
    Returns:
        The logarithm of the weight of a subset of size `subset_len`.
    """
    m = self.complement_size(n)
    return float(-logcomb(m, subset_len) - np.log(m + 1))

result_updater ¶

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns an object that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER	DESCRIPTION
`result`	The result to update TYPE: `ValuationResult`

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py

def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns an object that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

GridOwenStrategy ¶

GridOwenStrategy(n_samples_outer: int)

Bases: OwenStrategy

A strategy for OwenSampler to sample probability values on a linear grid.

PARAMETER	DESCRIPTION
`n_samples_outer`	The number of probability values $q$ used for the outer loop. These will be linearly spaced between 0 and $q_ ext{stop}$. TYPE: `int`

Source code in src/pydvl/valuation/samplers/owen.py

def __init__(self, n_samples_outer: int):
    super().__init__(n_samples_outer=n_samples_outer)

OwenSampler ¶

OwenSampler(
    outer_sampling_strategy: OwenStrategy,
    n_samples_inner: int = 2,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
)

Bases: StochasticSamplerMixin, PowersetSampler

A sampler for semi-values using the Owen method.

For each index $i$ we sample n_samples_outer probability values $q_j$ between 0 and 1 and then, for each $j$ we draw n_samples_inner subsets of the complement of the current index where each element is sampled probability $q_j$.

The distribution for the outer sampling can be either uniform or deterministic. The default is deterministic on a grid, which is the original method described in Okhrati and Lipani (2021)¹. This can be achieved by using the GridOwenStrategy strategy.

Alternatively, the distribution can be uniform between 0 and 1. This can be achieved by using the UniformOwenStrategy strategy.

By combining a UniformOwenStrategy with an infinite IndexIteration strategy, this sampler can be used with a stopping criterion to estimate semi-values. This follows more closely the typical usage pattern in PyDVL than the original sampling method described in Okhrati and Lipani (2021)¹.

Example usage

sampler = OwenSampler(
    outer_sampling_strategy=GridOwenStrategy(n_samples_outer=200),
    n_samples_inner=8,
    index_iteration=FiniteSequentialIndexIteration,
)

PARAMETER	DESCRIPTION
`n_samples_inner`	The number of samples drawn for each probability. In the original paper this was fixed to 2 for all experiments. TYPE: `int` DEFAULT: `2`
`batch_size`	The batch size of the sampler. TYPE: `int` DEFAULT: `1`
`index_iteration`	The index iteration strategy, sequential or random, finite or infinite. TYPE: `Type[IndexIteration]` DEFAULT: `FiniteSequentialIndexIteration`
`seed`	The seed for the random number generator. TYPE: `Seed \| None` DEFAULT: `None`

Source code in src/pydvl/valuation/samplers/owen.py

def __init__(
    self,
    outer_sampling_strategy: OwenStrategy,
    n_samples_inner: int = 2,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
):
    super().__init__(
        batch_size=batch_size, index_iteration=index_iteration, seed=seed
    )
    self.n_samples_inner = n_samples_inner
    self.sampling_probabilities = outer_sampling_strategy
    self.q_stop = 1.0

interrupted `property` ¶

interrupted: bool

Whether the sampler has been interrupted.

skip_indices `property` `writable` ¶

skip_indices

Set of indices to skip in the outer loop.

len ¶

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES	DESCRIPTION
`TypeError`	if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py

def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

repr ¶

__repr__() -> str

FIXME: This is not a proper representation of the sampler.

Source code in src/pydvl/valuation/samplers/base.py

def __repr__(self) -> str:
    """FIXME: This is not a proper representation of the sampler."""
    return f"{self.__class__.__name__}"

generate_batches ¶

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py

def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

index_iterable ¶

index_iterable(indices: IndexSetT) -> Generator[IndexT | None, None, None]

Iterates over indices with the method specified at construction.

Source code in src/pydvl/valuation/samplers/powerset.py

def index_iterable(
    self, indices: IndexSetT
) -> Generator[IndexT | None, None, None]:
    """Iterates over indices with the method specified at construction."""
    try:
        iterable = self._index_iterator_cls(indices, seed=self._rng)  # type: ignore
    except (AttributeError, TypeError):
        iterable = self._index_iterator_cls(indices)
    for idx in iterable:
        if idx not in self.skip_indices:
            yield idx

interrupt ¶

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py

def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

log_weight ¶

log_weight(n: int, subset_len: int) -> float

For each $q_j, j \in \{1, ..., N\}$ in the outer probabilities, the probability of drawing a subset $S_k$ of size $k$ is:

\[ P (| S_{q_j} | = k) = \binom{n}{k} \ q_j^k (1 - q_j)^{n - k}.\]

So, if each $q_j$ is chosen with equal weight (or more generally with probability $p_j$),then by total probability, the overall probability of obtaining a subset of size $k$ is a mixture of the binomials: $$ P (| S | = k) = \sum_{j = 1}^N p_j \ \binom{n}{k} \ q_j^k (1 - q_j)^{n - k}. $$

In our case $p_j = 1/N$, so that $P(|S|=k) = \frac{1}{N} \sum_{j=1}^N P (| S_{q_j} | = k)$. For large enough $N$ this is

\[ P(|S|=k) \approx \binom{n}{k} \int_0^1 q^k (1 - q)^{n - k} \, dq = \frac{1}{ n+1}, \]

where we computed the integral using the beta function and its expression as products of gamma functions.

Now, given the symmetry wrt. the indices in the sampling procedure, any given set $S$ of size $k$ is equally likely to be drawn. So the probability of a set being of size $k$ must be equally divided by the number of sets of that size, and the weight of a set of size $k$ is:

\[ P(S) = \frac{1}{n+1} \binom{n}{|S|}^{-1}. \]

PARAMETER	DESCRIPTION
`n`	Size of the index set. TYPE: `int`
`subset_len`	Size of the subset. TYPE: `int`

Returns: The logarithm of the weight of a subset of size subset_len.

Source code in src/pydvl/valuation/samplers/owen.py

def log_weight(self, n: int, subset_len: int) -> float:
    r"""For each $q_j, j \in \{1, ..., N\}$ in the outer probabilities, the
    probability of drawing a subset $S_k$ of size $k$ is:

    $$ P (| S_{q_j} | = k) = \binom{n}{k} \  q_j^k  (1 - q_j)^{n - k}.$$

    So, if each $q_j$ is chosen with equal weight (or more generally with
    probability $p_j$),then by total probability, the overall probability of
    obtaining a subset of size $k$ is a mixture of the binomials:
    $$
    P (| S | = k) = \sum_{j = 1}^N p_j \ \binom{n}{k} \ q_j^k  (1 - q_j)^{n - k}.
    $$

    In our case $p_j = 1/N$, so that $P(|S|=k) = \frac{1}{N} \sum_{j=1}^N P (|
    S_{q_j} | = k)$. For large enough $N$ this is

    $$
    P(|S|=k) \approx \binom{n}{k} \int_0^1 q^k (1 - q)^{n - k} \, dq = \frac{1}{
    n+1},
    $$

    where we computed the integral using the beta function and its expression as
    products of gamma functions.

    Now, given the symmetry wrt. the indices in the sampling procedure, any given
    set $S$ of size $k$ is equally likely to be drawn. So the probability of a set
    being of size $k$ must be equally divided by the number of sets of that size,
    and the weight of a set of size $k$ is:

    $$ P(S) = \frac{1}{n+1} \binom{n}{|S|}^{-1}. $$

    Args:
        n: Size of the index set.
        subset_len: Size of the subset.
    Returns:
        The logarithm of the weight of a subset of size `subset_len`.
    """
    m = self.complement_size(n)
    return float(-logcomb(m, subset_len) - np.log(m + 1))

result_updater ¶

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns an object that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER	DESCRIPTION
`result`	The result to update TYPE: `ValuationResult`

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py

def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns an object that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

sample_limit ¶

sample_limit(indices: IndexSetT) -> int | None

The number of samples that will be generated by the sampler.

PARAMETER	DESCRIPTION
`indices`	TYPE: `IndexSetT`

RETURNS	DESCRIPTION
`int \| None`	0 if there are no indices, `None` if there's no limit and the number of
`int \| None`	samples otherwise.

Source code in src/pydvl/valuation/samplers/owen.py

def sample_limit(self, indices: IndexSetT) -> int | None:
    """The number of samples that will be generated by the sampler.

    Args:
        indices:

    Returns:
        0 if there are no indices, `None` if there's no limit and the number of
        samples otherwise.
    """
    if len(indices) == 0:
        return 0
    if not self._index_iterator_cls.is_finite():
        return None

    return (
        cast(int, self._index_iterator_cls.length(len(indices)))
        * self.sampling_probabilities.n_samples_outer
        * self.n_samples_inner
    )

OwenStrategy ¶

OwenStrategy(n_samples_outer: int)

Bases: ABC

Base class for strategies for the Owen sampler to sample probability values.

Source code in src/pydvl/valuation/samplers/owen.py

def __init__(self, n_samples_outer: int):
    self.n_samples_outer = n_samples_outer

UniformOwenStrategy ¶

UniformOwenStrategy(n_samples_outer: int, seed: Seed | None = None)

Bases: OwenStrategy

A strategy for OwenSampler to sample probability values uniformly between 0 and $q_ ext{stop}$.

PARAMETER	DESCRIPTION
`n_samples_outer`	The number of probability values $q$ used for the outer loop. Since samples are taken anew for each index, a high number will delay updating new indices and has no effect on the final accuracy if using an infinite index iteration. In general, it only makes sense to change this number if using a finite index iteration. TYPE: `int`
`seed`	The seed for the random number generator. TYPE: `Seed \| None` DEFAULT: `None`

Source code in src/pydvl/valuation/samplers/owen.py

def __init__(self, n_samples_outer: int, seed: Seed | None = None):
    super().__init__(n_samples_outer=n_samples_outer)
    self.rng = np.random.default_rng(seed)

PARAMETER	DESCRIPTION
`n_samples_outer`	The number of probability values \(q\) used for the outer loop. Since samples are taken anew for each index, a high number will delay updating new indices and has no effect on the final accuracy if using an infinite index iteration. In general, it only makes sense to change this number if using a finite index iteration. TYPE: `int`
`seed`	The seed for the random number generator. TYPE: `Seed \| None` DEFAULT: `None`

pydvl.valuation.samplers.owen ¶

Possible configurations¶

Antithetic Owen sampling¶

References¶

AntitheticOwenSampler ¶

interrupted property ¶

skip_indices property writable ¶

__len__ ¶

__repr__ ¶

generate_batches ¶

index_iterable ¶

interrupt ¶

log_weight ¶

result_updater ¶

GridOwenStrategy ¶

OwenSampler ¶

interrupted property ¶

skip_indices property writable ¶

__len__ ¶

__repr__ ¶

generate_batches ¶

index_iterable ¶

interrupt ¶

log_weight ¶

result_updater ¶

sample_limit ¶

OwenStrategy ¶

UniformOwenStrategy ¶

interrupted `property` ¶

skip_indices `property` `writable` ¶

len ¶

repr ¶

interrupted `property` ¶

skip_indices `property` `writable` ¶

len ¶

repr ¶