Skip to content

pydvl.value.sampler

Samplers iterate over subsets of indices.

The classes in this module are used to iterate over indices and subsets of their complement in the whole set, as required for the computation of marginal utility for semi-values. The elements returned when iterating over any subclass of PowersetSampler are tuples of the form (idx, subset), where idx is the index of the element being added to the subset, and subset is the subset of the complement of idx. The classes in this module are used to iterate over an index set \(I\) as required for the computation of marginal utility for semi-values. The elements returned when iterating over any subclass of :class:PowersetSampler are tuples of the form \((i, S)\), where \(i\) is an index of interest, and \(S \subset I \setminus \{i\}\) is a subset of the complement of \(i\).

The iteration happens in two nested loops. An outer loop iterates over \(I\), and an inner loop iterates over the powerset of \(I \setminus \{i\}\). The outer iteration can be either sequential or at random.

Note

This is the natural mode of iteration for the combinatorial definition of semi-values, in particular Shapley value. For the computation using permutations, adhering to this interface is not ideal, but we stick to it for consistency.

The samplers are used in the semivalues module to compute any semi-value, in particular Shapley and Beta values, and Banzhaf indices.

Slicing of samplers

The samplers can be sliced for parallel computation. For those which are embarrassingly parallel, this is done by slicing the set of "outer" indices and returning new samplers over those slices. This includes all truly powerset-based samplers, such as DeterministicUniformSampler and UniformSampler. In contrast, slicing a PermutationSampler creates a new sampler which iterates over the same indices.

References


  1. Mitchell, Rory, Joshua Cooper, Eibe Frank, and Geoffrey Holmes. Sampling Permutations for Shapley Value Estimation. Journal of Machine Learning Research 23, no. 43 (2022): 1–46. 

  2. Wang, J.T. and Jia, R., 2023. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. In: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, pp. 6388-6421. 

PowersetSampler

PowersetSampler(
    indices: NDArray[IndexT],
    index_iteration: IndexIteration = IndexIteration.Sequential,
    outer_indices: NDArray[IndexT] | None = None,
    **kwargs
)

Bases: ABC, Iterable[SampleT], Generic[IndexT]

Samplers are custom iterables over subsets of indices.

Calling iter() on a sampler returns an iterator over tuples of the form \((i, S)\), where \(i\) is an index of interest, and \(S \subset I \setminus \{i\}\) is a subset of the complement of \(i\).

This is done in two nested loops, where the outer loop iterates over the set of indices, and the inner loop iterates over subsets of the complement of the current index. The outer iteration can be either sequential or at random.

Note

Samplers are not iterators themselves, so that each call to iter() e.g. in a for loop creates a new iterator.

Example
>>>for idx, s in DeterministicUniformSampler(np.arange(2)):
>>>    print(s, end="")
[][2,][][1,]

Methods required in subclasses

Samplers must implement a weight() function to be used as a multiplier in Monte Carlo sums, so that the limit expectation coincides with the semi-value.

Slicing of samplers

The samplers can be sliced for parallel computation. For those which are embarrassingly parallel, this is done by slicing the set of "outer" indices and returning new samplers over those slices.

index_iteration: the order in which indices are iterated over
outer_indices: The set of items (indices) over which to iterate
    when sampling. Subsets are taken from the complement of each index
    in succession. For embarrassingly parallel computations, this set
    is sliced and the samplers are used to iterate over the slices.
Source code in src/pydvl/value/sampler.py
def __init__(
    self,
    indices: NDArray[IndexT],
    index_iteration: IndexIteration = IndexIteration.Sequential,
    outer_indices: NDArray[IndexT] | None = None,
    **kwargs,
):
    """
    Args:
        indices: The set of items (indices) to sample from.
        index_iteration: the order in which indices are iterated over
        outer_indices: The set of items (indices) over which to iterate
            when sampling. Subsets are taken from the complement of each index
            in succession. For embarrassingly parallel computations, this set
            is sliced and the samplers are used to iterate over the slices.
    """
    self._indices = indices
    self._index_iteration = index_iteration
    self._outer_indices = outer_indices if outer_indices is not None else indices
    self._n = len(indices)
    self._n_samples = 0

iterindices

iterindices() -> Iterator[IndexT]

Iterates over indices in the order specified at construction.

this is probably not very useful, but I couldn't decide

which method is better

Source code in src/pydvl/value/sampler.py
def iterindices(self) -> Iterator[IndexT]:
    """Iterates over indices in the order specified at construction.

    FIXME: this is probably not very useful, but I couldn't decide
      which method is better
    """
    if self._index_iteration is PowersetSampler.IndexIteration.Sequential:
        for idx in self._outer_indices:
            yield idx
    elif self._index_iteration is PowersetSampler.IndexIteration.Random:
        while True:
            yield np.random.choice(self._outer_indices, size=1).item()

__len__

__len__() -> int

Returns the number of outer indices over which the sampler iterates.

Source code in src/pydvl/value/sampler.py
def __len__(self) -> int:
    """Returns the number of outer indices over which the sampler iterates."""
    return len(self._outer_indices)

weight abstractmethod classmethod

weight(n: int, subset_len: int) -> float

Factor by which to multiply Monte Carlo samples, so that the mean converges to the desired expression.

By the Law of Large Numbers, the sample mean of \(\delta_i(S_j)\) converges to the expectation under the distribution from which \(S_j\) is sampled.

\[ \frac{1}{m} \sum_{j = 1}^m \delta_i (S_j) c (S_j) \longrightarrow \underset{S \sim \mathcal{D}_{- i}}{\mathbb{E}} [\delta_i (S) c ( S)]\]

We add a factor \(c(S_j)\) in order to have this expectation coincide with the desired expression.

Source code in src/pydvl/value/sampler.py
@classmethod
@abc.abstractmethod
def weight(cls, n: int, subset_len: int) -> float:
    r"""Factor by which to multiply Monte Carlo samples, so that the
    mean converges to the desired expression.

    By the Law of Large Numbers, the sample mean of $\delta_i(S_j)$
    converges to the expectation under the distribution from which $S_j$ is
    sampled.

    $$ \frac{1}{m}  \sum_{j = 1}^m \delta_i (S_j) c (S_j) \longrightarrow
       \underset{S \sim \mathcal{D}_{- i}}{\mathbb{E}} [\delta_i (S) c (
       S)]$$

    We add a factor $c(S_j)$ in order to have this expectation coincide with
    the desired expression.
    """
    ...

StochasticSamplerMixin

StochasticSamplerMixin(*args, seed: Optional[Seed] = None, **kwargs)

Mixin class for samplers which use a random number generator.

Source code in src/pydvl/value/sampler.py
def __init__(self, *args, seed: Optional[Seed] = None, **kwargs):
    super().__init__(*args, **kwargs)
    self._rng = np.random.default_rng(seed)

DeterministicUniformSampler

DeterministicUniformSampler(indices: NDArray[IndexT], *args, **kwargs)

Bases: PowersetSampler[IndexT]

For every index \(i\), each subset of the complement indices - {i} is returned.

Note

Indices are always iterated over sequentially, irrespective of the value of index_iteration upon construction.

Example
>>> for idx, s in DeterministicUniformSampler(np.arange(2)):
>>>    print(f"{idx} - {s}", end=", ")
1 - [], 1 - [2], 2 - [], 2 - [1],
PARAMETER DESCRIPTION
indices

The set of items (indices) to sample from.

TYPE: NDArray[IndexT]

Source code in src/pydvl/value/sampler.py
def __init__(self, indices: NDArray[IndexT], *args, **kwargs):
    """An iterator to perform uniform deterministic sampling of subsets.

    For every index $i$, each subset of the complement `indices - {i}` is
    returned.

    !!! Note
        Indices are always iterated over sequentially, irrespective of
        the value of `index_iteration` upon construction.

    ??? Example
        ``` pycon
        >>> for idx, s in DeterministicUniformSampler(np.arange(2)):
        >>>    print(f"{idx} - {s}", end=", ")
        1 - [], 1 - [2], 2 - [], 2 - [1],
        ```

    Args:
        indices: The set of items (indices) to sample from.
    """
    # Force sequential iteration
    kwargs.update({"index_iteration": PowersetSampler.IndexIteration.Sequential})
    super().__init__(indices, *args, **kwargs)

iterindices

iterindices() -> Iterator[IndexT]

Iterates over indices in the order specified at construction.

this is probably not very useful, but I couldn't decide

which method is better

Source code in src/pydvl/value/sampler.py
def iterindices(self) -> Iterator[IndexT]:
    """Iterates over indices in the order specified at construction.

    FIXME: this is probably not very useful, but I couldn't decide
      which method is better
    """
    if self._index_iteration is PowersetSampler.IndexIteration.Sequential:
        for idx in self._outer_indices:
            yield idx
    elif self._index_iteration is PowersetSampler.IndexIteration.Random:
        while True:
            yield np.random.choice(self._outer_indices, size=1).item()

__len__

__len__() -> int

Returns the number of outer indices over which the sampler iterates.

Source code in src/pydvl/value/sampler.py
def __len__(self) -> int:
    """Returns the number of outer indices over which the sampler iterates."""
    return len(self._outer_indices)

UniformSampler

UniformSampler(*args, seed: Optional[Seed] = None, **kwargs)

Bases: StochasticSamplerMixin, PowersetSampler[IndexT]

An iterator to perform uniform random sampling of subsets.

Iterating over every index \(i\), either in sequence or at random depending on the value of index_iteration, one subset of the complement indices - {i} is sampled with equal probability \(2^{n-1}\). The iterator never ends.

Example

The code

for idx, s in UniformSampler(np.arange(3)):
   print(f"{idx} - {s}", end=", ")
Produces the output:
0 - [1 4], 1 - [2 3], 2 - [0 1 3], 3 - [], 4 - [2], 0 - [1 3 4], 1 - [0 2]
(...)

Source code in src/pydvl/value/sampler.py
def __init__(self, *args, seed: Optional[Seed] = None, **kwargs):
    super().__init__(*args, **kwargs)
    self._rng = np.random.default_rng(seed)

iterindices

iterindices() -> Iterator[IndexT]

Iterates over indices in the order specified at construction.

this is probably not very useful, but I couldn't decide

which method is better

Source code in src/pydvl/value/sampler.py
def iterindices(self) -> Iterator[IndexT]:
    """Iterates over indices in the order specified at construction.

    FIXME: this is probably not very useful, but I couldn't decide
      which method is better
    """
    if self._index_iteration is PowersetSampler.IndexIteration.Sequential:
        for idx in self._outer_indices:
            yield idx
    elif self._index_iteration is PowersetSampler.IndexIteration.Random:
        while True:
            yield np.random.choice(self._outer_indices, size=1).item()

__len__

__len__() -> int

Returns the number of outer indices over which the sampler iterates.

Source code in src/pydvl/value/sampler.py
def __len__(self) -> int:
    """Returns the number of outer indices over which the sampler iterates."""
    return len(self._outer_indices)

weight classmethod

weight(n: int, subset_len: int) -> float

Correction coming from Monte Carlo integration so that the mean of the marginals converges to the value: the uniform distribution over the powerset of a set with n-1 elements has mass 2^{n-1} over each subset.

Source code in src/pydvl/value/sampler.py
@classmethod
def weight(cls, n: int, subset_len: int) -> float:
    """Correction coming from Monte Carlo integration so that the mean of
    the marginals converges to the value: the uniform distribution over the
    powerset of a set with n-1 elements has mass 2^{n-1} over each subset."""
    return float(2 ** (n - 1)) if n > 0 else 1.0

MSRSampler

MSRSampler(*args, seed: Optional[Seed] = None, **kwargs)

Bases: StochasticSamplerMixin, PowersetSampler[IndexT]

An iterator to perform sampling of random subsets.

This sampler does not return any index, it only returns subsets of the data. This sampler is used in (Wang et. al.)2.

Source code in src/pydvl/value/sampler.py
def __init__(self, *args, seed: Optional[Seed] = None, **kwargs):
    super().__init__(*args, **kwargs)
    self._rng = np.random.default_rng(seed)

iterindices

iterindices() -> Iterator[IndexT]

Iterates over indices in the order specified at construction.

this is probably not very useful, but I couldn't decide

which method is better

Source code in src/pydvl/value/sampler.py
def iterindices(self) -> Iterator[IndexT]:
    """Iterates over indices in the order specified at construction.

    FIXME: this is probably not very useful, but I couldn't decide
      which method is better
    """
    if self._index_iteration is PowersetSampler.IndexIteration.Sequential:
        for idx in self._outer_indices:
            yield idx
    elif self._index_iteration is PowersetSampler.IndexIteration.Random:
        while True:
            yield np.random.choice(self._outer_indices, size=1).item()

__len__

__len__() -> int

Returns the number of outer indices over which the sampler iterates.

Source code in src/pydvl/value/sampler.py
def __len__(self) -> int:
    """Returns the number of outer indices over which the sampler iterates."""
    return len(self._outer_indices)

AntitheticSampler

AntitheticSampler(*args, seed: Optional[Seed] = None, **kwargs)

Bases: StochasticSamplerMixin, PowersetSampler[IndexT]

An iterator to perform uniform random sampling of subsets, and their complements.

Works as UniformSampler, but for every tuple \((i,S)\), it subsequently returns \((i,S^c)\), where \(S^c\) is the complement of the set \(S\) in the set of indices, excluding \(i\).

Source code in src/pydvl/value/sampler.py
def __init__(self, *args, seed: Optional[Seed] = None, **kwargs):
    super().__init__(*args, **kwargs)
    self._rng = np.random.default_rng(seed)

iterindices

iterindices() -> Iterator[IndexT]

Iterates over indices in the order specified at construction.

this is probably not very useful, but I couldn't decide

which method is better

Source code in src/pydvl/value/sampler.py
def iterindices(self) -> Iterator[IndexT]:
    """Iterates over indices in the order specified at construction.

    FIXME: this is probably not very useful, but I couldn't decide
      which method is better
    """
    if self._index_iteration is PowersetSampler.IndexIteration.Sequential:
        for idx in self._outer_indices:
            yield idx
    elif self._index_iteration is PowersetSampler.IndexIteration.Random:
        while True:
            yield np.random.choice(self._outer_indices, size=1).item()

__len__

__len__() -> int

Returns the number of outer indices over which the sampler iterates.

Source code in src/pydvl/value/sampler.py
def __len__(self) -> int:
    """Returns the number of outer indices over which the sampler iterates."""
    return len(self._outer_indices)

PermutationSampler

PermutationSampler(*args, seed: Optional[Seed] = None, **kwargs)

Bases: StochasticSamplerMixin, PowersetSampler[IndexT]

Sample permutations of indices and iterate through each returning increasing subsets, as required for the permutation definition of semi-values.

This sampler does not implement the two loops described in PowersetSampler. Instead, for a permutation (3,1,4,2), it returns in sequence the tuples of index and sets: (3, {}), (1, {3}), (4, {3,1}) and (2, {3,1,4}).

Note that the full index set is never returned.

Warning

This sampler requires caching to be enabled or computation will be doubled wrt. a "direct" implementation of permutation MC

Source code in src/pydvl/value/sampler.py
def __init__(self, *args, seed: Optional[Seed] = None, **kwargs):
    super().__init__(*args, **kwargs)
    self._rng = np.random.default_rng(seed)

iterindices

iterindices() -> Iterator[IndexT]

Iterates over indices in the order specified at construction.

this is probably not very useful, but I couldn't decide

which method is better

Source code in src/pydvl/value/sampler.py
def iterindices(self) -> Iterator[IndexT]:
    """Iterates over indices in the order specified at construction.

    FIXME: this is probably not very useful, but I couldn't decide
      which method is better
    """
    if self._index_iteration is PowersetSampler.IndexIteration.Sequential:
        for idx in self._outer_indices:
            yield idx
    elif self._index_iteration is PowersetSampler.IndexIteration.Random:
        while True:
            yield np.random.choice(self._outer_indices, size=1).item()

__len__

__len__() -> int

Returns the number of outer indices over which the sampler iterates.

Source code in src/pydvl/value/sampler.py
def __len__(self) -> int:
    """Returns the number of outer indices over which the sampler iterates."""
    return len(self._outer_indices)

__getitem__

__getitem__(key: slice | list[int]) -> PowersetSampler[IndexT]

Permutation samplers cannot be split across indices, so we return a copy of the full sampler.

Source code in src/pydvl/value/sampler.py
def __getitem__(self, key: slice | list[int]) -> PowersetSampler[IndexT]:
    """Permutation samplers cannot be split across indices, so we return
    a copy of the full sampler."""
    return super().__getitem__(slice(None))

AntitheticPermutationSampler

AntitheticPermutationSampler(*args, seed: Optional[Seed] = None, **kwargs)

Bases: PermutationSampler[IndexT]

Samples permutations like PermutationSampler, but after each permutation, it returns the same permutation in reverse order.

This sampler was suggested in (Mitchell et al. 2022)1

New in version 0.7.1

Source code in src/pydvl/value/sampler.py
def __init__(self, *args, seed: Optional[Seed] = None, **kwargs):
    super().__init__(*args, **kwargs)
    self._rng = np.random.default_rng(seed)

iterindices

iterindices() -> Iterator[IndexT]

Iterates over indices in the order specified at construction.

this is probably not very useful, but I couldn't decide

which method is better

Source code in src/pydvl/value/sampler.py
def iterindices(self) -> Iterator[IndexT]:
    """Iterates over indices in the order specified at construction.

    FIXME: this is probably not very useful, but I couldn't decide
      which method is better
    """
    if self._index_iteration is PowersetSampler.IndexIteration.Sequential:
        for idx in self._outer_indices:
            yield idx
    elif self._index_iteration is PowersetSampler.IndexIteration.Random:
        while True:
            yield np.random.choice(self._outer_indices, size=1).item()

__getitem__

__getitem__(key: slice | list[int]) -> PowersetSampler[IndexT]

Permutation samplers cannot be split across indices, so we return a copy of the full sampler.

Source code in src/pydvl/value/sampler.py
def __getitem__(self, key: slice | list[int]) -> PowersetSampler[IndexT]:
    """Permutation samplers cannot be split across indices, so we return
    a copy of the full sampler."""
    return super().__getitem__(slice(None))

__len__

__len__() -> int

Returns the number of outer indices over which the sampler iterates.

Source code in src/pydvl/value/sampler.py
def __len__(self) -> int:
    """Returns the number of outer indices over which the sampler iterates."""
    return len(self._outer_indices)

DeterministicPermutationSampler

DeterministicPermutationSampler(*args, seed: Optional[Seed] = None, **kwargs)

Bases: PermutationSampler[IndexT]

Samples all n! permutations of the indices deterministically, and iterates through them, returning sets as required for the permutation-based definition of semi-values.

Warning

This sampler requires caching to be enabled or computation will be doubled wrt. a "direct" implementation of permutation MC

Warning

This sampler is not parallelizable, as it always iterates over the whole set of permutations in the same order. Different processes would always return the same values for all indices.

Source code in src/pydvl/value/sampler.py
def __init__(self, *args, seed: Optional[Seed] = None, **kwargs):
    super().__init__(*args, **kwargs)
    self._rng = np.random.default_rng(seed)

iterindices

iterindices() -> Iterator[IndexT]

Iterates over indices in the order specified at construction.

this is probably not very useful, but I couldn't decide

which method is better

Source code in src/pydvl/value/sampler.py
def iterindices(self) -> Iterator[IndexT]:
    """Iterates over indices in the order specified at construction.

    FIXME: this is probably not very useful, but I couldn't decide
      which method is better
    """
    if self._index_iteration is PowersetSampler.IndexIteration.Sequential:
        for idx in self._outer_indices:
            yield idx
    elif self._index_iteration is PowersetSampler.IndexIteration.Random:
        while True:
            yield np.random.choice(self._outer_indices, size=1).item()

__getitem__

__getitem__(key: slice | list[int]) -> PowersetSampler[IndexT]

Permutation samplers cannot be split across indices, so we return a copy of the full sampler.

Source code in src/pydvl/value/sampler.py
def __getitem__(self, key: slice | list[int]) -> PowersetSampler[IndexT]:
    """Permutation samplers cannot be split across indices, so we return
    a copy of the full sampler."""
    return super().__getitem__(slice(None))

__len__

__len__() -> int

Returns the number of outer indices over which the sampler iterates.

Source code in src/pydvl/value/sampler.py
def __len__(self) -> int:
    """Returns the number of outer indices over which the sampler iterates."""
    return len(self._outer_indices)

RandomHierarchicalSampler

RandomHierarchicalSampler(*args, seed: Optional[Seed] = None, **kwargs)

Bases: StochasticSamplerMixin, PowersetSampler[IndexT]

For every index, sample a set size, then a set of that size.

Todo

This is unnecessary, but a step towards proper stratified sampling.

Source code in src/pydvl/value/sampler.py
def __init__(self, *args, seed: Optional[Seed] = None, **kwargs):
    super().__init__(*args, **kwargs)
    self._rng = np.random.default_rng(seed)

iterindices

iterindices() -> Iterator[IndexT]

Iterates over indices in the order specified at construction.

this is probably not very useful, but I couldn't decide

which method is better

Source code in src/pydvl/value/sampler.py
def iterindices(self) -> Iterator[IndexT]:
    """Iterates over indices in the order specified at construction.

    FIXME: this is probably not very useful, but I couldn't decide
      which method is better
    """
    if self._index_iteration is PowersetSampler.IndexIteration.Sequential:
        for idx in self._outer_indices:
            yield idx
    elif self._index_iteration is PowersetSampler.IndexIteration.Random:
        while True:
            yield np.random.choice(self._outer_indices, size=1).item()

__len__

__len__() -> int

Returns the number of outer indices over which the sampler iterates.

Source code in src/pydvl/value/sampler.py
def __len__(self) -> int:
    """Returns the number of outer indices over which the sampler iterates."""
    return len(self._outer_indices)