pydvl.valuation.samplers.stratified
¶
This module implements stratified samplers.
Stratified samplers change the subset sampling distribution to be a function of set size with the goal of reducing the variance of the Monte Carlo estimate of the marginal utility. This is essentially importance sampling, with the heuristic that the utility's variance is a function of the training set size.
In the simplest case, StratifiedSampler takes a strategy with a fixed number of samples \(m_k\) for each set size \(k \in [0, n],\) where \(n\) is the total number of indices in the index set \(N.\) It iterates through all indices exactly once (using FiniteSequentialIndexIteration) and for each index \(i \in N\), iterates through all set sizes \(k\), then samples exactly \(m_k\) subsets \(S \subset N_{-i}\) of size \(k\).
This is the procedure used e.g. in Variance Reduced Stratified Sampling (VRDS), introduced by Wu et al. (2023)2, with a simple heuristic setting \(m_k\) to a decreasing function of \(k.\) Their heuristic is a generalization of the ideas in Maleki et al. (2014)3.
Constructing a VRDS
To create a sampler as in Wu et al. (2023)2, one would use the following code:
Available strategies¶
All components described below can be mixed in most ways, but some specific configurations appear in the literature as follows:
-
Constant sample sizes \(m_k = c\), but restricting \(m_k = 0\) if \(k \notin [l_n, u_n]\) for lower and upper bounds \(l_n\) and \(u_n\) determined as functions of \(n,\) the total number of indices. This sampling method was introduced by Watson et al. (2023)1 for the computation of Shapley values as \(\delta\)-Shapley. ??? Example "Constructing a sampler for \(\delta\)-Shapley"
-
Sample sizes decreasing with a power law. Use PowerLawSampleSize for the strategy. This was also proposed in Wu et al. (2023)2. Empirically they found an exponent between -1 and -0.5 to perform well. ??? Example "Power law heuristic" ```python sampler = StratifiedSampler( sample_sizes=PowerLawSampleSize(n_samples=1000, exponent=-0.5), sample_sizes_iteration=RandomSizeIteration, index_iteration=RandomIndexIteration, )
-
Group Testing Sample Size. This heuristic is used for the stratified sampling required by GroupTestingShapleyValuation.
Iterating over indices and its effect on n_samples
¶
As any other sampler,
StratifiedSampler can iterate
over indices finitely or infinitely many times. It can also use
NoIndexIteration to sample from
the whole powerset. This is configured with the parameter index_iteration
.
In the case of finite iterations, the sampler must distribute a finite total number of
samples among all indices. This is done by the
SampleSizeStrategy, which
therefore requires an argument n_samples
to be set to the number of samples per
index.
Warning
On the other hand, if the sampler iterates over the indices indefinitely,
n_indices
can be set, but only relative frequencies will matter. As we see next,
there is another component that will affect how the sampler behaves.
Iterating over set sizes¶
Additionally, StratifiedSampler must iterate over sample sizes \(k \in [0, n]\), and this can be done in multiple ways, configured via subclasses of SampleSizeIteration.
- DeterministicSizeIteration
will generate exactly \(m_k\) samples for each \(k\) before moving to the next \(k.\) This
implies that
n_samples
must be large enough for the computed \(m_k\) to be valid. - RandomSizeIteration will
sample a set size \(k\) according to the distribution of sizes given by the strategy.
When using this in conjunction with an infinite index iteration for the sampler,
n_samples
can be safely set to 1 since \(m_k\) will be interpreted as a probability. - RoundRobinIteration will iterate over set sizes \(k\) and generate one sample each time, until reaching \(m_k.\)
Choosing set size heuristics¶
Optimal sampling (leading to minimal variance estimators) involves a dynamic choice of the number \(m_k\) of samples at size \(k\) based on the variance of the Monte Carlo integrand, but Wu et al. (2023)2 show that there exist methods applicable to semi-values which precompute these sizes while still providing reasonable performance.
The number of sets of size \(k\)
Recall that uniform sampling from the powerset \(2^{N_{-i}}\) produces a binomial distribution of set sizes: the number of sets of size \(k\) is \(m_k = \binom{n-1}{k},\) which is the (inverse of the) Shapley coefficient. Therefore, setting for instance \(m_k = C\) for some constant will drastically reduce the number of sets of size \(\sym n/2\) while increasing the number of sets of size 1 or \(n-1.\) This will then have stark implications on the Monte Carlo estimate of semi-values, depending on how the marginal utility (i.e. the training of the model) is affected by the size of the training set.
This heuristic is configured with the argument sample_size_strategy
of
StratifiedSampler, which is an
instance of
SampleSizeStrategy.
References¶
-
Watson, Lauren, Zeno Kujawa, Rayna Andreeva, Hao-Tsung Yang, Tariq Elahi, and Rik Sarkar. Accelerated Shapley Value Approximation for Data Evaluation. arXiv, 9 November 2023. ↩
-
Wu, Mengmeng, Ruoxi Jia, Changle Lin, Wei Huang, and Xiangyu Chang. Variance Reduced Shapley Value Estimation for Trustworthy Data Valuation. Computers & Operations Research 159 (1 November 2023): 106305. ↩↩↩↩
-
Maleki, Sasan, Long Tran-Thanh, Greg Hines, Talal Rahwan, and Alex Rogers. Bounding the Estimation Error of Sampling-Based Shapley Value Approximation. arXiv:1306.4265 [Cs], 12 February 2014. ↩
SampleSizeStrategy
¶
SampleSizeStrategy(n_samples: int)
Bases: ABC
An object to compute the number of samples to take for a given set size. Based on Wu et al. (2023)1, Theorem 4.2.
To be used with StratifiedSampler.
Sets the number of sets at size \(k\) to be
for some choice of \(f.\) Implementations of this base class must override the
method fun()
. It is provided both the size \(k\) and the total number of indices \(n\)
as arguments.
PARAMETER | DESCRIPTION |
---|---|
n_samples
|
Number of samples for the stratified sampler to generate, per index. If the sampler uses NoIndexIteration, then this will coincide with the total number of samples.
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
fun
abstractmethod
¶
The function \(f\) to use in the heuristic. Args: n_indices: Size of the index set. subset_len: Size of the subset.
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
This method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.
Note
A naive implementation with e.g.
would not respect the total number of samples, and would not distribute remainders correctly.PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
quantize
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
quantize
is True
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
ConstantSampleSize
¶
Bases: SampleSizeStrategy
Use a constant number of samples for each set size between two (optional) bounds. The total number of samples (per index) is respected.
PARAMETER | DESCRIPTION |
---|---|
n_samples
|
Total number of samples to generate per index.
TYPE:
|
lower_bound
|
Lower bound for the set size. If the set size is smaller than this, the probability of sampling is 0.
TYPE:
|
upper_bound
|
Upper bound for the set size. If the set size is larger than this,
the probability of sampling is 0. If
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
This method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.
Note
A naive implementation with e.g.
would not respect the total number of samples, and would not distribute remainders correctly.PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
quantize
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
quantize
is True
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
GroupTestingSampleSize
¶
GroupTestingSampleSize(n_samples: int = 1)
Bases: SampleSizeStrategy
Heuristic choice of samples per set size used for Group Testing.
GroupTestingShapleyValuation uses this strategy for the stratified sampling of samples with which to construct the linear problem it requires.
This heuristic sets the number of sets at size \(k\) to be
for a total number of samples \(m\) and:
For GT Shapley, \(m=1\) and \(m_k\) is interpreted as a probability of sampling size \(k.\)
Source code in src/pydvl/valuation/samplers/stratified.py
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
This method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.
Note
A naive implementation with e.g.
would not respect the total number of samples, and would not distribute remainders correctly.PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
quantize
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
quantize
is True
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
HarmonicSampleSize
¶
HarmonicSampleSize(n_samples: int)
Bases: SampleSizeStrategy
Heuristic choice of samples per set size for VRDS.
Sets the number of sets at size \(k\) to be
for a total number of samples \(m\) and:
PARAMETER | DESCRIPTION |
---|---|
n_samples
|
Number of samples for the stratified sampler to generate, per index. If the sampler uses NoIndexIteration, then this will coincide with the total number of samples.
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
This method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.
Note
A naive implementation with e.g.
would not respect the total number of samples, and would not distribute remainders correctly.PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
quantize
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
quantize
is True
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
PowerLawSampleSize
¶
Bases: SampleSizeStrategy
Heuristic choice of samples per set size for VRDS.
Sets the number of sets at size \(k\) to be
for a total number of samples \(m\) and:
and some exponent \(a.\) With \(a=1\) one recovers the HarmonicSampleSize heuristic.
PARAMETER | DESCRIPTION |
---|---|
n_samples
|
Total number of samples to generate per index.
TYPE:
|
exponent
|
The exponent to use. Recommended values are between -1 and -0.5.
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
This method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.
Note
A naive implementation with e.g.
would not respect the total number of samples, and would not distribute remainders correctly.PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
quantize
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
quantize
is True
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
SampleSizeIteration
¶
SampleSizeIteration(strategy: SampleSizeStrategy, n_indices: int)
Bases: ABC
Given a strategy and the number of indices, yield tuples (k, count) that the sampler loop will use. Args: strategy: The strategy to use for computing the number of samples to take. n_indices: The number of indices in the index set from which samples are taken.
Source code in src/pydvl/valuation/samplers/stratified.py
DeterministicSizeIteration
¶
DeterministicSizeIteration(strategy: SampleSizeStrategy, n_indices: int)
Bases: SampleSizeIteration
Generates exactly \(m_k\) samples for each set size \(k\) before moving to the next.
Source code in src/pydvl/valuation/samplers/stratified.py
RandomSizeIteration
¶
RandomSizeIteration(
strategy: SampleSizeStrategy, n_indices: int, seed: Seed | None = None
)
Bases: SampleSizeIteration
Draws a set size \(k\) following the distribution of sizes given by the strategy.
Source code in src/pydvl/valuation/samplers/stratified.py
RoundRobinIteration
¶
RoundRobinIteration(strategy: SampleSizeStrategy, n_indices: int)
Bases: SampleSizeIteration
Generates one sample for each set size \(k\) before moving to the next.
This continues yielding until every size \(k\) has been emitted exactly \(m_k\) times.
For example, if strategy.sample_sizes() == [2, 3, 1]
then we want the sequence:
(0,1), (1,1), (2,1), (0,1), (1,1), (1,1)
Source code in src/pydvl/valuation/samplers/stratified.py
StratifiedSampler
¶
StratifiedSampler(
sample_sizes: SampleSizeStrategy,
sample_sizes_iteration: Type[
SampleSizeIteration
] = DeterministicSizeIteration,
batch_size: int = 1,
index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
seed: Seed | None = None,
)
Bases: StochasticSamplerMixin
, PowersetSampler
A sampler stratified by coalition size with variable number of samples per set size.
Variance Reduced Stratified Sampler (VRDS)¶
Stratified sampling was introduced at least as early as Maleki et al. (2014)3. Wu et al. 20232, introduced heuristics adequate for ML tasks.
Choosing the number of samples per set size¶
The idea of VRDS is to allow per-set-size configuration of the total number of samples in order to reduce the variance coming from the marginal utility evaluations.
It is known (Wu et al. (2023), Theorem 4.2) that a minimum variance estimator of
Shapley values samples a number \(m_k\) of sets of size \(k\) based on the variance of
the marginal utility at that set size. However, this quantity is unknown in
practice, so the authors propose a simple heuristic. This function
(sample_sizes
in the arguments) is deterministic, and in particular does
not depend on run-time variance estimates, as an adaptive method might do. Section 4
of Wu et al. (2023) shows a good default choice is based on the harmonic function
of the set size \(k\) (see
HarmonicSampleSize).
PARAMETER | DESCRIPTION |
---|---|
sample_sizes
|
An object which returns the number of samples to
take for a given set size. If
TYPE:
|
sample_sizes_iteration
|
How to loop over sample sizes. The main modes are:
* deterministically. For every k generate m_k samples before moving to k+1.
* stochastically. Sample sizes k according to the distribution given by
TYPE:
|
batch_size
|
The number of samples to generate per batch. Batches are processed together by each subprocess when working in parallel.
TYPE:
|
index_iteration
|
the strategy to use for iterating over indices to update. Note that anything other than returning index exactly once will break the weight computation.
TYPE:
|
seed
|
The seed for the random number generator.
TYPE:
|
New in version 0.10.0
Source code in src/pydvl/valuation/samplers/stratified.py
interrupt
¶
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns a callable that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
index_iterator
¶
index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]
Iterates over indices with the method specified at construction.
Source code in src/pydvl/valuation/samplers/powerset.py
log_weight
¶
The probability of sampling a set of size k is 1/(n choose k) times the probability of choosing size k, which is the number of samples for that size divided by the total number of samples for all sizes:
where \(m_k\) is the number of samples of size \(k\) and \(m\) is the total number of samples.
PARAMETER | DESCRIPTION |
---|---|
n
|
Size of the index set.
TYPE:
|
subset_len
|
Size of the subset.
TYPE:
|
Returns:
The logarithm of the probability of having sampled a set of size subset_len
.