pydvl.value.sampler
¶
Samplers iterate over subsets of indices.
The classes in this module are used to iterate over indices and subsets of their
complement in the whole set, as required for the computation of marginal utility
for semi-values. The elements returned when iterating over any subclass of
PowersetSampler are tuples of the form
(idx, subset)
, where idx
is the index of the element being added to the
subset, and subset
is the subset of the complement of idx
.
The classes in this module are used to iterate over an index set \(I\) as required
for the computation of marginal utility for semi-values. The elements returned
when iterating over any subclass of :class:PowersetSampler
are tuples of the
form \((i, S)\), where \(i\) is an index of interest, and \(S \subset I \setminus \{i\}\)
is a subset of the complement of \(i\).
The iteration happens in two nested loops. An outer loop iterates over \(I\), and an inner loop iterates over the powerset of \(I \setminus \{i\}\). The outer iteration can be either sequential or at random.
Note
This is the natural mode of iteration for the combinatorial definition of semi-values, in particular Shapley value. For the computation using permutations, adhering to this interface is not ideal, but we stick to it for consistency.
The samplers are used in the semivalues module to compute any semi-value, in particular Shapley and Beta values, and Banzhaf indices.
Slicing of samplers¶
The samplers can be sliced for parallel computation. For those which are embarrassingly parallel, this is done by slicing the set of "outer" indices and returning new samplers over those slices. This includes all truly powerset-based samplers, such as DeterministicUniformSampler and UniformSampler. In contrast, slicing a PermutationSampler creates a new sampler which iterates over the same indices.
References¶
-
Mitchell, Rory, Joshua Cooper, Eibe Frank, and Geoffrey Holmes. Sampling Permutations for Shapley Value Estimation. Journal of Machine Learning Research 23, no. 43 (2022): 1–46. ↩
-
Wang, J.T. and Jia, R., 2023. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. In: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, pp. 6388-6421. ↩
PowersetSampler
¶
PowersetSampler(
indices: NDArray[IndexT],
index_iteration: IndexIteration = IndexIteration.Sequential,
outer_indices: NDArray[IndexT] | None = None,
**kwargs
)
Bases: ABC
, Iterable[SampleT]
, Generic[IndexT]
Samplers are custom iterables over subsets of indices.
Calling iter()
on a sampler returns an iterator over tuples of the form
\((i, S)\), where \(i\) is an index of interest, and \(S \subset I \setminus \{i\}\)
is a subset of the complement of \(i\).
This is done in two nested loops, where the outer loop iterates over the set of indices, and the inner loop iterates over subsets of the complement of the current index. The outer iteration can be either sequential or at random.
Note
Samplers are not iterators themselves, so that each call to iter()
e.g. in a for loop creates a new iterator.
Example
Methods required in subclasses¶
Samplers must implement a weight() function to be used as a multiplier in Monte Carlo sums, so that the limit expectation coincides with the semi-value.
Slicing of samplers¶
The samplers can be sliced for parallel computation. For those which are embarrassingly parallel, this is done by slicing the set of "outer" indices and returning new samplers over those slices.
index_iteration: the order in which indices are iterated over
outer_indices: The set of items (indices) over which to iterate
when sampling. Subsets are taken from the complement of each index
in succession. For embarrassingly parallel computations, this set
is sliced and the samplers are used to iterate over the slices.
Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
weight
abstractmethod
classmethod
¶
Factor by which to multiply Monte Carlo samples, so that the mean converges to the desired expression.
By the Law of Large Numbers, the sample mean of \(\delta_i(S_j)\) converges to the expectation under the distribution from which \(S_j\) is sampled.
We add a factor \(c(S_j)\) in order to have this expectation coincide with the desired expression.
Source code in src/pydvl/value/sampler.py
DeterministicUniformSampler
¶
DeterministicUniformSampler(indices: NDArray[IndexT], *args, **kwargs)
Bases: PowersetSampler[IndexT]
For every index \(i\), each subset of the complement indices - {i}
is
returned.
Note
Indices are always iterated over sequentially, irrespective of
the value of index_iteration
upon construction.
Example
PARAMETER | DESCRIPTION |
---|---|
indices |
The set of items (indices) to sample from.
TYPE:
|
Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
UniformSampler
¶
UniformSampler(*args, seed: Optional[Seed] = None, **kwargs)
Bases: StochasticSamplerMixin
, PowersetSampler[IndexT]
An iterator to perform uniform random sampling of subsets.
Iterating over every index \(i\), either in sequence or at random depending on
the value of index_iteration
, one subset of the complement
indices - {i}
is sampled with equal probability \(2^{n-1}\). The
iterator never ends.
Example
The code
Produces the output:Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
weight
classmethod
¶
Correction coming from Monte Carlo integration so that the mean of the marginals converges to the value: the uniform distribution over the powerset of a set with n-1 elements has mass 2^{n-1} over each subset.
Source code in src/pydvl/value/sampler.py
MSRSampler
¶
MSRSampler(*args, seed: Optional[Seed] = None, **kwargs)
Bases: StochasticSamplerMixin
, PowersetSampler[IndexT]
An iterator to perform sampling of random subsets.
This sampler does not return any index, it only returns subsets of the data. This sampler is used in (Wang et. al.)2.
Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
AntitheticSampler
¶
AntitheticSampler(*args, seed: Optional[Seed] = None, **kwargs)
Bases: StochasticSamplerMixin
, PowersetSampler[IndexT]
An iterator to perform uniform random sampling of subsets, and their complements.
Works as UniformSampler, but for every tuple \((i,S)\), it subsequently returns \((i,S^c)\), where \(S^c\) is the complement of the set \(S\) in the set of indices, excluding \(i\).
Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
PermutationSampler
¶
PermutationSampler(*args, seed: Optional[Seed] = None, **kwargs)
Bases: StochasticSamplerMixin
, PowersetSampler[IndexT]
Sample permutations of indices and iterate through each returning increasing subsets, as required for the permutation definition of semi-values.
This sampler does not implement the two loops described in
PowersetSampler. Instead, for a
permutation (3,1,4,2)
, it returns in sequence the tuples of index and
sets: (3, {})
, (1, {3})
, (4, {3,1})
and (2, {3,1,4})
.
Note that the full index set is never returned.
Warning
This sampler requires caching to be enabled or computation will be doubled wrt. a "direct" implementation of permutation MC
Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
__getitem__
¶
__getitem__(key: slice | list[int]) -> PowersetSampler[IndexT]
Permutation samplers cannot be split across indices, so we return a copy of the full sampler.
AntitheticPermutationSampler
¶
AntitheticPermutationSampler(*args, seed: Optional[Seed] = None, **kwargs)
Bases: PermutationSampler[IndexT]
Samples permutations like PermutationSampler, but after each permutation, it returns the same permutation in reverse order.
This sampler was suggested in (Mitchell et al. 2022)1
New in version 0.7.1
Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
__getitem__
¶
__getitem__(key: slice | list[int]) -> PowersetSampler[IndexT]
Permutation samplers cannot be split across indices, so we return a copy of the full sampler.
DeterministicPermutationSampler
¶
DeterministicPermutationSampler(*args, seed: Optional[Seed] = None, **kwargs)
Bases: PermutationSampler[IndexT]
Samples all n! permutations of the indices deterministically, and iterates through them, returning sets as required for the permutation-based definition of semi-values.
Warning
This sampler requires caching to be enabled or computation will be doubled wrt. a "direct" implementation of permutation MC
Warning
This sampler is not parallelizable, as it always iterates over the whole set of permutations in the same order. Different processes would always return the same values for all indices.
Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
__getitem__
¶
__getitem__(key: slice | list[int]) -> PowersetSampler[IndexT]
Permutation samplers cannot be split across indices, so we return a copy of the full sampler.
RandomHierarchicalSampler
¶
RandomHierarchicalSampler(*args, seed: Optional[Seed] = None, **kwargs)
Bases: StochasticSamplerMixin
, PowersetSampler[IndexT]
For every index, sample a set size, then a set of that size.
Todo
This is unnecessary, but a step towards proper stratified sampling.
Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better