pydvl.valuation.samplers.stratified
¶
This module implements stratified samplers.
Stratified samplers change the subset sampling distribution to be a function of set size with the goal of reducing the variance of the Monte Carlo estimate of the marginal utility. They key assumption / heuristic is that the utility's variance is a function of the training set size.
Stratified sampling was introduced at least as early as Maleki et al. (2014)3. Later on, Wu et al. 20232, extended these heuristics and proposed some for ML tasks, which they called VRDS (see below). Watson et al. (2023)1 used error estimates for certain model classes to propose a different heuristic. See below and \(\delta\)-Shapley.
All stratified samplers in pyDVL are implemented by configuring (or subclassing) the classes StratifiedSampler and StratifiedPermutationSampler.
In the simplest case, StratifiedSampler employs some strategy with a fixed number of samples \(m_k\) for each set size \(k \in [0, n],\) where \(n\) is the total number of indices in the index set \(N.\) It iterates through all indices exactly (e.g. exactly once, if using FiniteSequentialIndexIteration) and for each index \(i \in N\), iterates through all set sizes \(k\), then samples exactly \(m_k\) subsets \(S \subset N_{-i}\) of size \(k\). The iteration over set sizes is configured with SampleSizeIteration.
Choosing set size heuristics¶
Optimal sampling (leading to minimal variance estimators) involves a dynamic choice of the number \(m_k\) of samples at size \(k\) based on the variance of the Monte Carlo integrand, but Wu et al. (2023)2 show that there exist methods applicable to semi-values which precompute these sizes while still providing reasonable performance.
The number of sets of size \(k\)
Recall that uniform sampling from the powerset \(2^{N_{-i}}\) produces a binomial distribution of set sizes: the number of sets of size \(k\) is \(m_k = \binom{n-1}{k},\) which is the (inverse of the) Shapley coefficient. Therefore, setting for instance \(m_k = C\) for some constant will drastically reduce the number of sets of size \(\sim n/2\) while increasing the number of sets of size 1 or \(n-1.\) This will then have stark implications on the Monte Carlo estimate of semi-values, depending on how the marginal utility (i.e. the training of the model) is affected by the size of the training set.
This heuristic is configured with the argument sample_size
of
StratifiedSampler, which is an
instance of
SampleSizeStrategy.
Variance Reduced Stratified Sampler (VRDS)¶
It is known (Wu et al. (2023), Theorem 4.2)2 that a minimum variance estimator of Shapley values samples \(m_k\) sets of size \(k\) based on the variance of the marginal utility at that set size. However, this quantity is unknown in practice, so the authors propose a simple deterministic heuristic, which in particular does not depend on run-time variance estimates, as an adaptive method might do. Section 4 of Wu et al. (2023)2 shows a good default choice is based on the harmonic function of the set size \(k\).
This sampler is available through VRDSSampler.
Constructing a VRDS
It is possible to "manually" replicate VRDSSampler with:
Iterating over indices and its effect on 'n_samples'¶
As any other sampler,
StratifiedSampler can iterate
over indices finitely or infinitely many times. It can also use
NoIndexIteration to sample from
the whole powerset. This is configured with the parameter index_iteration
.
In the case of finite iterations, the sampler must distribute a finite total number of
samples among all indices. This is done by the
SampleSizeStrategy, which
therefore requires an argument n_samples
to be set to the number of samples per
index.
Warning
On the other hand, if the sampler iterates over the indices indefinitely,
n_indices
can be set, but only relative frequencies will matter. As we see next,
there is another component that will affect how the sampler behaves.
Iterating over set sizes¶
Additionally, StratifiedSampler must iterate over sample sizes \(k \in [0, n]\), and this can be done in multiple ways, configured via subclasses of SampleSizeIteration.
- FiniteSequentialSizeIteration
will generate exactly \(m_k\) samples for each \(k\) before moving to the next \(k.\) This
implies that
n_samples
must be large enough for the computed \(m_k\) to be valid. Alternatively, and preferably, some strategies allown_samples = None
to signal them to compute the total number of samples. - RandomSizeIteration will
sample a set size \(k\) according to the distribution of sizes given by the strategy.
When using this in conjunction with an infinite index iteration for the sampler,
n_samples
can be safely left to its defaultNone
since \(m_k\) will be interpreted as a probability. - RoundRobinSizeIteration will iterate over set sizes \(k\) and generate one sample each time, until reaching \(m_k.\)
Other known strategies¶
All components described above can be mixed in most ways, but some specific configurations besides VRDS appear in the literature as follows:
- Sample sizes given by stability bounds related to the algorithm, to ensure good approximation of per-set-size marginal utilities. This sampling method was introduced by Watson et al. (2023)1 for the computation of Shapley values as \(\delta\)-Shapley.
DeltaShapleyNSGDSampleSize implements the choice of \(m_k\) corresponding to non-convex losses minimized with SGD. Alas, it requires many parameters to be set which are hard to estimate in practice. An alternative is to use PowerLawSampleSize (see below) with an exponent of -2, which is the order of \(m_k\) in \(\delta\)-Shapley.
??? Example "Constructing an alternative sampler for DeltaShapley"
```python
config = DeltaShapleyNCSGDConfig(...) # See the docs / paper
sampler = StratifiedSampler(
sample_sizes=DeltaShapleyNSGDSampleSize(config, lower_bound, upper_bound),
sample_sizes_iteration=FiniteSequentialSizeIteration,
index_iteration=SequentialIndexIteration,
)
```
!!! Note "Other bounds"
Implementing the other bounds from the paper is just a matter of subclassing
[SampleSizeStrategy][pydvl.valuation.samplers.stratified.SampleSizeStrategy] and
implementing the `fun` method, as done in
[DeltaShapleyNCSGDSampleSize][pydvl.valuation.samplers.stratified.DeltaShapleyNCSGDSampleSize].
-
Sample sizes decreasing with a power law. Use PowerLawSampleSize for the strategy. This was also proposed in Wu et al. (2023)2. Empirically they found an exponent between -1 and -0.5 to perform well.
-
Group Testing Sample Size. This heuristic is used for the stratified sampling required by GroupTestingShapleyValuation.
References¶
-
Watson, Lauren, Zeno Kujawa, Rayna Andreeva, Hao-Tsung Yang, Tariq Elahi, and Rik Sarkar. Accelerated Shapley Value Approximation for Data Evaluation. arXiv, 9 November 2023. ↩↩
-
Wu, Mengmeng, Ruoxi Jia, Changle Lin, Wei Huang, and Xiangyu Chang. Variance Reduced Shapley Value Estimation for Trustworthy Data Valuation. Computers & Operations Research 159 (1 November 2023): 106305. ↩↩↩↩↩
-
Maleki, Sasan, Long Tran-Thanh, Greg Hines, Talal Rahwan, and Alex Rogers. Bounding the Estimation Error of Sampling-Based Shapley Value Approximation. arXiv:1306.4265 [Cs], 12 February 2014. ↩
ConstantSampleSize
¶
ConstantSampleSize(
n_samples: int | None = None,
lower_bound: int | None = None,
upper_bound: int | None = None,
)
Bases: SampleSizeStrategy
Use a constant number of samples for each set size.
PARAMETER | DESCRIPTION |
---|---|
n_samples
|
Number of samples for the stratified sampler to generate. It can be
TYPE:
|
lower_bound
|
Lower bound for the set sizes.
TYPE:
|
upper_bound
|
Upper bound for the set sizes.
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
effective_bounds
¶
Returns the effective bounds for the sample sizes, given the number of
indices n
from which sets are sampled.
Note
The number of indices n
will typically be complement_size(len(train))
,
i.e. what we sometimes denote as effective_n
.
PARAMETER | DESCRIPTION |
---|---|
n
|
The number of indices from which subsets are drawn.
TYPE:
|
Returns: A tuple of [lower, upper] bounds for sample sizes (inclusive).
Source code in src/pydvl/valuation/samplers/stratified.py
n_samples_per_index
¶
Returns the total number of samples to take for the given number of indices,
or None
if no limit was set and samples are generated with probabilities.
Source code in src/pydvl/valuation/samplers/stratified.py
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
If probs
is True
, the result is a vector of floats, where each element
is the probability of sampling a set of size \(k.\) This is useful e.g. for
RandomSizeIteration
where one needs frequencies. In this case self.n_samples_per_index
can be
None
.
If probs
is False
, the result is a vector of integers, where each element
\(k\) is the number of samples to take for set size \(k.\) The sum of all elements
will depend on the value of n_samples
upon construction. It will be
- equal to
n_samples
ifn_samples > 0
, - the sum of the values of
fun
for all set sizes ifn_samples == 0
, - the sum after a normalization such that the smallest non-zero value is 1, if
n_samples == None
.
This somewhat complex behavior is necessary to ensure that the total number of
samples is respected when provided either globally upon construction, or
implicitly by the sum of the values of fun
when the strategy computes absolute
numbers, but also when the strategy has a fun
that returns frequencies. The
special handling of the case n_samples == None
is used implicitly by
StratifiedPermutationSampler
(yeah, it's not pretty).
When probs
is False
, this method corrects rounding errors taking into
account the fractional parts so that the total number of samples is respected,
while allocating remainders in a way that follows the relative sizes of the
fractional parts.
PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
probs
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
probs
is False
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 |
|
DeltaShapleyNCSGDConfig
dataclass
¶
DeltaShapleyNCSGDConfig(
max_loss: float,
lipschitz_loss: float,
lipschitz_grad: float,
lr_factor: float,
n_sgd_iter: int,
n_val: int,
n_train: int,
eps: float = 0.01,
delta: float = 0.05,
version: Literal["theorem7", "code"] = "theorem7",
)
Configuration for Delta-Shapley non-convex SGD sampling.
See Watson et al. (2023)1 for details. Given that it can be difficult to estimate these constants, an alternative which has a similar decay rate of \(O(1/k^2)\) is to use a PowerLawSampleSize strategy.
PARAMETER | DESCRIPTION |
---|---|
max_loss
|
Maximum of the loss.
TYPE:
|
lipschitz_loss
|
Lipschitz constant of the loss
TYPE:
|
lipschitz_grad
|
Lipschitz constant of the gradient of the loss
TYPE:
|
lr_factor
|
Learning rate factor c, assuming it has the form \(lpha_t = c/t.\)
TYPE:
|
n_sgd_iter
|
Number of SGD iterations.
TYPE:
|
n_val
|
Number of test samples.
TYPE:
|
n_train
|
Number of training samples.
TYPE:
|
eps
|
Epsilon value in the epsilon-delta guarantee, i.e. the distance to the true value.
TYPE:
|
delta
|
Delta value in the epsilon-delta guarantee, i.e. the probability of failure.
TYPE:
|
version
|
Version of the bound to use: either the one from the paper or the one in the code.
TYPE:
|
DeltaShapleyNCSGDSampleSize
¶
DeltaShapleyNCSGDSampleSize(
config: DeltaShapleyNCSGDConfig,
lower_bound: int | None = None,
upper_bound: int | None = None,
)
Bases: SampleSizeStrategy
Heuristic choice of samples per set size for \(\delta\)-Shapley.
This implements the non-convex SGD bound from Watson et al. (2023)1.
Note
This strategy does not accept an n_samples
parameter because it provides the
absolute number of samples to take for each set size based on the constants
provided in the configuration.
PARAMETER | DESCRIPTION |
---|---|
config
|
Configuration for the sample bound.
TYPE:
|
lower_bound
|
Lower bound for the set sizes. If the set size is smaller than this,
the probability of sampling is 0. If
TYPE:
|
upper_bound
|
Upper bound for the set size. If the set size is larger than this,
the probability of sampling is 0. If
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
effective_bounds
¶
Returns the effective bounds for the sample sizes, given the number of
indices n
from which sets are sampled.
Note
The number of indices n
will typically be complement_size(len(train))
,
i.e. what we sometimes denote as effective_n
.
PARAMETER | DESCRIPTION |
---|---|
n
|
The number of indices from which subsets are drawn.
TYPE:
|
Returns: A tuple of [lower, upper] bounds for sample sizes (inclusive).
Source code in src/pydvl/valuation/samplers/stratified.py
fun
¶
Computes the number of samples for the non-convex SGD bound.
Source code in src/pydvl/valuation/samplers/stratified.py
n_samples_per_index
¶
Returns the total number of samples to take for the given number of indices,
or None
if no limit was set and samples are generated with probabilities.
Source code in src/pydvl/valuation/samplers/stratified.py
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
If probs
is True
, the result is a vector of floats, where each element
is the probability of sampling a set of size \(k.\) This is useful e.g. for
RandomSizeIteration
where one needs frequencies. In this case self.n_samples_per_index
can be
None
.
If probs
is False
, the result is a vector of integers, where each element
\(k\) is the number of samples to take for set size \(k.\) The sum of all elements
will depend on the value of n_samples
upon construction. It will be
- equal to
n_samples
ifn_samples > 0
, - the sum of the values of
fun
for all set sizes ifn_samples == 0
, - the sum after a normalization such that the smallest non-zero value is 1, if
n_samples == None
.
This somewhat complex behavior is necessary to ensure that the total number of
samples is respected when provided either globally upon construction, or
implicitly by the sum of the values of fun
when the strategy computes absolute
numbers, but also when the strategy has a fun
that returns frequencies. The
special handling of the case n_samples == None
is used implicitly by
StratifiedPermutationSampler
(yeah, it's not pretty).
When probs
is False
, this method corrects rounding errors taking into
account the fractional parts so that the total number of samples is respected,
while allocating remainders in a way that follows the relative sizes of the
fractional parts.
PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
probs
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
probs
is False
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 |
|
FiniteSequentialSizeIteration
¶
FiniteSequentialSizeIteration(strategy: SampleSizeStrategy, n_indices: int)
Bases: SampleSizeIteration
Generates exactly \(m_k\) samples for each set size \(k\) before moving to the next.
Only for deterministic sample sizes
This iteration is only valid for deterministic sample sizes. In particular,
n_samples
must be set to the total number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
GroupTestingSampleSize
¶
GroupTestingSampleSize(
n_samples: int | None = None,
lower_bound: int | None = None,
upper_bound: int | None = None,
)
Bases: SampleSizeStrategy
Heuristic choice of samples per set size used for Group Testing.
GroupTestingShapleyValuation uses this strategy for the stratified sampling of samples with which to construct the linear problem it requires.
This heuristic sets the number of sets at size \(k\) to be
for a total number of samples \(m\) and:
For GT Shapley, \(m=1\) and \(m_k\) is interpreted as a probability of sampling size \(k.\)
Source code in src/pydvl/valuation/samplers/stratified.py
effective_bounds
¶
Returns the effective bounds for the sample sizes, given the number of
indices n
from which sets are sampled.
Note
The number of indices n
will typically be complement_size(len(train))
,
i.e. what we sometimes denote as effective_n
.
PARAMETER | DESCRIPTION |
---|---|
n
|
The number of indices from which subsets are drawn.
TYPE:
|
Returns: A tuple of [lower, upper] bounds for sample sizes (inclusive).
Source code in src/pydvl/valuation/samplers/stratified.py
n_samples_per_index
¶
Returns the total number of samples to take for the given number of indices,
or None
if no limit was set and samples are generated with probabilities.
Source code in src/pydvl/valuation/samplers/stratified.py
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
If probs
is True
, the result is a vector of floats, where each element
is the probability of sampling a set of size \(k.\) This is useful e.g. for
RandomSizeIteration
where one needs frequencies. In this case self.n_samples_per_index
can be
None
.
If probs
is False
, the result is a vector of integers, where each element
\(k\) is the number of samples to take for set size \(k.\) The sum of all elements
will depend on the value of n_samples
upon construction. It will be
- equal to
n_samples
ifn_samples > 0
, - the sum of the values of
fun
for all set sizes ifn_samples == 0
, - the sum after a normalization such that the smallest non-zero value is 1, if
n_samples == None
.
This somewhat complex behavior is necessary to ensure that the total number of
samples is respected when provided either globally upon construction, or
implicitly by the sum of the values of fun
when the strategy computes absolute
numbers, but also when the strategy has a fun
that returns frequencies. The
special handling of the case n_samples == None
is used implicitly by
StratifiedPermutationSampler
(yeah, it's not pretty).
When probs
is False
, this method corrects rounding errors taking into
account the fractional parts so that the total number of samples is respected,
while allocating remainders in a way that follows the relative sizes of the
fractional parts.
PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
probs
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
probs
is False
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 |
|
HarmonicSampleSize
¶
HarmonicSampleSize(
n_samples: int | None = None,
lower_bound: int | None = None,
upper_bound: int | None = None,
)
Bases: SampleSizeStrategy
Heuristic choice of samples per set size for VRDS.
Sets the number of sets at size \(k\) to be
for a total number of samples \(m\) and:
Source code in src/pydvl/valuation/samplers/stratified.py
effective_bounds
¶
Returns the effective bounds for the sample sizes, given the number of
indices n
from which sets are sampled.
Note
The number of indices n
will typically be complement_size(len(train))
,
i.e. what we sometimes denote as effective_n
.
PARAMETER | DESCRIPTION |
---|---|
n
|
The number of indices from which subsets are drawn.
TYPE:
|
Returns: A tuple of [lower, upper] bounds for sample sizes (inclusive).
Source code in src/pydvl/valuation/samplers/stratified.py
n_samples_per_index
¶
Returns the total number of samples to take for the given number of indices,
or None
if no limit was set and samples are generated with probabilities.
Source code in src/pydvl/valuation/samplers/stratified.py
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
If probs
is True
, the result is a vector of floats, where each element
is the probability of sampling a set of size \(k.\) This is useful e.g. for
RandomSizeIteration
where one needs frequencies. In this case self.n_samples_per_index
can be
None
.
If probs
is False
, the result is a vector of integers, where each element
\(k\) is the number of samples to take for set size \(k.\) The sum of all elements
will depend on the value of n_samples
upon construction. It will be
- equal to
n_samples
ifn_samples > 0
, - the sum of the values of
fun
for all set sizes ifn_samples == 0
, - the sum after a normalization such that the smallest non-zero value is 1, if
n_samples == None
.
This somewhat complex behavior is necessary to ensure that the total number of
samples is respected when provided either globally upon construction, or
implicitly by the sum of the values of fun
when the strategy computes absolute
numbers, but also when the strategy has a fun
that returns frequencies. The
special handling of the case n_samples == None
is used implicitly by
StratifiedPermutationSampler
(yeah, it's not pretty).
When probs
is False
, this method corrects rounding errors taking into
account the fractional parts so that the total number of samples is respected,
while allocating remainders in a way that follows the relative sizes of the
fractional parts.
PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
probs
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
probs
is False
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 |
|
LinearSampleSize
¶
LinearSampleSize(
scale: float,
offset: int = 0,
n_samples: int | None = None,
lower_bound: int | None = None,
upper_bound: int | None = None,
)
Bases: SampleSizeStrategy
Use a linear function of the set size to determine the number of samples to take.
This is mostly intended for testing purposes, as it is not a very useful heuristic.
PARAMETER | DESCRIPTION |
---|---|
scale
|
Slope of the linear function.
TYPE:
|
offset
|
Offset of the linear function.
TYPE:
|
n_samples
|
Number of samples for the stratified sampler to generate. It can be
TYPE:
|
lower_bound
|
Lower bound for the set sizes.
TYPE:
|
upper_bound
|
Upper bound for the set sizes.
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
effective_bounds
¶
Returns the effective bounds for the sample sizes, given the number of
indices n
from which sets are sampled.
Note
The number of indices n
will typically be complement_size(len(train))
,
i.e. what we sometimes denote as effective_n
.
PARAMETER | DESCRIPTION |
---|---|
n
|
The number of indices from which subsets are drawn.
TYPE:
|
Returns: A tuple of [lower, upper] bounds for sample sizes (inclusive).
Source code in src/pydvl/valuation/samplers/stratified.py
n_samples_per_index
¶
Returns the total number of samples to take for the given number of indices,
or None
if no limit was set and samples are generated with probabilities.
Source code in src/pydvl/valuation/samplers/stratified.py
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
If probs
is True
, the result is a vector of floats, where each element
is the probability of sampling a set of size \(k.\) This is useful e.g. for
RandomSizeIteration
where one needs frequencies. In this case self.n_samples_per_index
can be
None
.
If probs
is False
, the result is a vector of integers, where each element
\(k\) is the number of samples to take for set size \(k.\) The sum of all elements
will depend on the value of n_samples
upon construction. It will be
- equal to
n_samples
ifn_samples > 0
, - the sum of the values of
fun
for all set sizes ifn_samples == 0
, - the sum after a normalization such that the smallest non-zero value is 1, if
n_samples == None
.
This somewhat complex behavior is necessary to ensure that the total number of
samples is respected when provided either globally upon construction, or
implicitly by the sum of the values of fun
when the strategy computes absolute
numbers, but also when the strategy has a fun
that returns frequencies. The
special handling of the case n_samples == None
is used implicitly by
StratifiedPermutationSampler
(yeah, it's not pretty).
When probs
is False
, this method corrects rounding errors taking into
account the fractional parts so that the total number of samples is respected,
while allocating remainders in a way that follows the relative sizes of the
fractional parts.
PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
probs
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
probs
is False
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 |
|
PowerLawSampleSize
¶
PowerLawSampleSize(
exponent: float,
n_samples: int | None = None,
lower_bound: int | None = None,
upper_bound: int | None = None,
)
Bases: SampleSizeStrategy
Heuristic choice of samples per set size for VRDS.
Sets the number of sets at size \(k\) to be
for a total number of samples \(m\) and:
and some exponent \(a.\) With \(a=-1\) one recovers the HarmonicSampleSize heuristic. With \(a=-2\) one has the same asymptotic behaviour as the \(\delta\)-Shapley strategies.
PARAMETER | DESCRIPTION |
---|---|
n_samples
|
Total number of samples to generate per index.
TYPE:
|
exponent
|
The exponent to use. Recommended values are between -1 and -0.5.
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
effective_bounds
¶
Returns the effective bounds for the sample sizes, given the number of
indices n
from which sets are sampled.
Note
The number of indices n
will typically be complement_size(len(train))
,
i.e. what we sometimes denote as effective_n
.
PARAMETER | DESCRIPTION |
---|---|
n
|
The number of indices from which subsets are drawn.
TYPE:
|
Returns: A tuple of [lower, upper] bounds for sample sizes (inclusive).
Source code in src/pydvl/valuation/samplers/stratified.py
n_samples_per_index
¶
Returns the total number of samples to take for the given number of indices,
or None
if no limit was set and samples are generated with probabilities.
Source code in src/pydvl/valuation/samplers/stratified.py
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
If probs
is True
, the result is a vector of floats, where each element
is the probability of sampling a set of size \(k.\) This is useful e.g. for
RandomSizeIteration
where one needs frequencies. In this case self.n_samples_per_index
can be
None
.
If probs
is False
, the result is a vector of integers, where each element
\(k\) is the number of samples to take for set size \(k.\) The sum of all elements
will depend on the value of n_samples
upon construction. It will be
- equal to
n_samples
ifn_samples > 0
, - the sum of the values of
fun
for all set sizes ifn_samples == 0
, - the sum after a normalization such that the smallest non-zero value is 1, if
n_samples == None
.
This somewhat complex behavior is necessary to ensure that the total number of
samples is respected when provided either globally upon construction, or
implicitly by the sum of the values of fun
when the strategy computes absolute
numbers, but also when the strategy has a fun
that returns frequencies. The
special handling of the case n_samples == None
is used implicitly by
StratifiedPermutationSampler
(yeah, it's not pretty).
When probs
is False
, this method corrects rounding errors taking into
account the fractional parts so that the total number of samples is respected,
while allocating remainders in a way that follows the relative sizes of the
fractional parts.
PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
probs
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
probs
is False
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 |
|
RandomSizeIteration
¶
RandomSizeIteration(
strategy: SampleSizeStrategy, n_indices: int, seed: Seed | None = None
)
Bases: SampleSizeIteration
Draws a set size \(k\) following the distribution of sizes given by the strategy.
Source code in src/pydvl/valuation/samplers/stratified.py
RoundRobinSizeIteration
¶
RoundRobinSizeIteration(strategy: SampleSizeStrategy, n_indices: int)
Bases: SampleSizeIteration
Generates one sample for each set size \(k\) before moving to the next.
This continues yielding until every size \(k\) has been emitted exactly \(m_k\) times.
For example, if strategy.sample_sizes() == [2, 3, 1]
then we want the sequence:
(0,1), (1,1), (2,1), (0,1), (1,1), (1,1)
Only for deterministic sample sizes
This iteration is only valid for deterministic sample sizes. In particular,
n_samples
must be set to the total number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
SampleSizeIteration
¶
SampleSizeIteration(strategy: SampleSizeStrategy, n_indices: int)
Bases: ABC
Given a strategy and the number of indices, yield tuples (k, count) that the sampler loop will use. Args: strategy: The strategy to use for computing the number of samples to take. n_indices: The number of indices in the index set from which samples are taken.
Source code in src/pydvl/valuation/samplers/stratified.py
SampleSizeStrategy
¶
SampleSizeStrategy(
n_samples: int | None = None,
lower_bound: int | None = None,
upper_bound: int | None = None,
)
Bases: ABC
An object to compute the number of samples to take for a given set size.
To be used with StratifiedSampler.
Following the structure proposed in Wu et al. (2023),2 this sets the number of sets at size \(k\) to be:
for some choice of \(f.\) Implementations of this base class must override the
method fun()
implementing \(f\). It is provided both the size \(k\) and the total
number of indices \(n\) as arguments.
The argument n_samples
can be:
- Fixed to a positive integer. For powerset samplers, this is the number of samples
per index that will be generated, i.e. if the sampler iterates over each index
exactly once, e.g.
FiniteSequentialIndexIteration,
then the total number of samples will be
n_samples * n_indices
. - Fixed to
0
to indicate the strategy produces an absolute number of samples, this only used when subclassing and implementing afun
that does not return frequencies. None
if only sampling probabilities are required, e.g. when using a stochastic sampler and a RandomSizeIteration.
PARAMETER | DESCRIPTION |
---|---|
n_samples
|
Number of samples for the stratified sampler to generate. It can be
TYPE:
|
lower_bound
|
Lower bound for the set sizes. If the set size is smaller than this,
the probability of sampling is 0. If
TYPE:
|
upper_bound
|
Upper bound for the set size. If the set size is larger than this,
the probability of sampling is 0. If
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
effective_bounds
¶
Returns the effective bounds for the sample sizes, given the number of
indices n
from which sets are sampled.
Note
The number of indices n
will typically be complement_size(len(train))
,
i.e. what we sometimes denote as effective_n
.
PARAMETER | DESCRIPTION |
---|---|
n
|
The number of indices from which subsets are drawn.
TYPE:
|
Returns: A tuple of [lower, upper] bounds for sample sizes (inclusive).
Source code in src/pydvl/valuation/samplers/stratified.py
fun
abstractmethod
¶
The function \(f\) to use in the heuristic. Args: n_indices: Size of the index set. subset_len: Size of the subset.
n_samples_per_index
¶
Returns the total number of samples to take for the given number of indices,
or None
if no limit was set and samples are generated with probabilities.
Source code in src/pydvl/valuation/samplers/stratified.py
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
If probs
is True
, the result is a vector of floats, where each element
is the probability of sampling a set of size \(k.\) This is useful e.g. for
RandomSizeIteration
where one needs frequencies. In this case self.n_samples_per_index
can be
None
.
If probs
is False
, the result is a vector of integers, where each element
\(k\) is the number of samples to take for set size \(k.\) The sum of all elements
will depend on the value of n_samples
upon construction. It will be
- equal to
n_samples
ifn_samples > 0
, - the sum of the values of
fun
for all set sizes ifn_samples == 0
, - the sum after a normalization such that the smallest non-zero value is 1, if
n_samples == None
.
This somewhat complex behavior is necessary to ensure that the total number of
samples is respected when provided either globally upon construction, or
implicitly by the sum of the values of fun
when the strategy computes absolute
numbers, but also when the strategy has a fun
that returns frequencies. The
special handling of the case n_samples == None
is used implicitly by
StratifiedPermutationSampler
(yeah, it's not pretty).
When probs
is False
, this method corrects rounding errors taking into
account the fractional parts so that the total number of samples is respected,
while allocating remainders in a way that follows the relative sizes of the
fractional parts.
PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
probs
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
probs
is False
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 |
|
StratifiedPermutation
dataclass
¶
StratifiedPermutation(
idx: IndexT | None,
subset: NDArray[IndexT],
lower_bound: int,
upper_bound: int,
)
Bases: Sample
A sample for the stratified permutation sampling strategy.
This is a subclass of Sample which adds information about the set sizes to sample. It is used by StratifiedPermutationEvaluationStrategy to clip permutations to the required lengths.
__hash__
¶
This type must be hashable for the utility caching to work. We use hashlib.sha256 which is about 4-5x faster than hash(), and returns the same value in all processes, as opposed to hash() which is salted in each process
Source code in src/pydvl/valuation/types.py
with_idx
¶
Return a copy of sample with idx changed.
Returns the original sample if idx is the same.
PARAMETER | DESCRIPTION |
---|---|
idx
|
New value for idx.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sample
|
A copy of the sample with idx changed.
TYPE:
|
Source code in src/pydvl/valuation/types.py
with_idx_in_subset
¶
Return a copy of sample with idx added to the subset.
Returns the original sample if idx was already part of the subset.
RETURNS | DESCRIPTION |
---|---|
Sample
|
A copy of the sample with idx added to the subset.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
ValueError
|
If idx is None. |
Source code in src/pydvl/valuation/types.py
with_subset
¶
with_subset(subset: NDArray[IndexT]) -> Self
Return a copy of sample with subset changed.
Returns the original sample if subset is the same.
PARAMETER | DESCRIPTION |
---|---|
subset
|
New value for subset.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sample
|
A copy of the sample with subset changed.
TYPE:
|
Source code in src/pydvl/valuation/types.py
StratifiedPermutationEvaluationStrategy
¶
StratifiedPermutationEvaluationStrategy(
sampler: PermutationSamplerBase,
utility: UtilityBase,
coefficient: SemivalueCoefficient | None,
)
Bases: PermutationEvaluationStrategy
Evaluation strategy for the StratifiedPermutationSampler.
Experimental
Source code in src/pydvl/valuation/samplers/permutation.py
StratifiedPermutationSampler
¶
StratifiedPermutationSampler(
sample_sizes: SampleSizeStrategy,
truncation: TruncationPolicy | None = None,
seed: Seed | None = None,
batch_size: int = 1,
)
Bases: PermutationSampler
A stratified permutation sampler.
Experimental
This is just an approximation for now. The number of set sizes generated is only roughly equal to that specified by the SampleSizeStrategy. In particular, there is a single counter of sizes for all indices.
PARAMETER | DESCRIPTION |
---|---|
sample_sizes
|
An object which returns the number of samples to take for a given set size. If the strategy fixes a total number of samples for each set size, then this sampler will produce (approximately) that amount of samples for each index. If it does not, then the sampler will generate infinitely many samples, with set sizes following the distribution set by the strategy.
TYPE:
|
truncation
|
A policy to stop the permutation early.
TYPE:
|
seed
|
Seed for the random number generator.
TYPE:
|
batch_size
|
The number of samples (full permutations) to generate at once.
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
complement_size
¶
Size of the complement of an index wrt. set size n
.
Required in certain coefficient computations. Even though we are sampling permutations, updates are always done per-index and the size of the complement is always \(n-1\).
Source code in src/pydvl/valuation/samplers/permutation.py
generate
¶
generate(indices: IndexSetT) -> SampleGenerator[StratifiedPermutation]
Generates the permutation samples.
These samples include information as to what sample sizes can be taken from the permutation by the strategy.
Info
This generator ignores skip_indices
.
PARAMETER | DESCRIPTION |
---|---|
indices
|
The indices to sample from. If empty, no samples are generated.
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
interrupt
¶
log_weight
¶
The probability of sampling a set of size subset_len
from n
indices.
See StratifiedSampler.log_weight()
PARAMETER | DESCRIPTION |
---|---|
n
|
Size of the index set.
TYPE:
|
subset_len
|
Size of the subset.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
The logarithm of the probability of having sampled a set of size
|
Source code in src/pydvl/valuation/samplers/stratified.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns an object that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
StratifiedSampler
¶
StratifiedSampler(
sample_sizes: SampleSizeStrategy,
sample_sizes_iteration: Type[
SampleSizeIteration
] = FiniteSequentialSizeIteration,
batch_size: int = 1,
index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
seed: Seed | None = None,
)
Bases: StochasticSamplerMixin
, PowersetSampler
A sampler stratified by coalition size with variable number of samples per set size.
PARAMETER | DESCRIPTION |
---|---|
sample_sizes
|
An object which returns the number of samples to
take for a given set size. If
TYPE:
|
sample_sizes_iteration
|
How to loop over sample sizes. The main modes are:
* deterministically. For every k generate m_k samples before moving to k+1.
* stochastically. Sample sizes k according to the distribution given by
TYPE:
|
batch_size
|
The number of samples to generate per batch. Batches are processed together by each subprocess when working in parallel.
TYPE:
|
index_iteration
|
the strategy to use for iterating over indices to update.
TYPE:
|
seed
|
The seed for the random number generator.
TYPE:
|
New in version 0.10.0
Source code in src/pydvl/valuation/samplers/stratified.py
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
index_iterable
¶
index_iterable(indices: IndexSetT) -> Generator[IndexT | None, None, None]
Iterates over indices with the method specified at construction.
Source code in src/pydvl/valuation/samplers/powerset.py
interrupt
¶
log_weight
¶
The probability of sampling a set of size k is 1/(n choose k) times the probability of the set having size k, which is the number of samples for that size divided by the total number of samples for all sizes:
where \(m_k\) is the number of samples of size \(k\) and \(m\) is the total number of samples.
PARAMETER | DESCRIPTION |
---|---|
n
|
Size of the index set.
TYPE:
|
subset_len
|
Size of the subset.
TYPE:
|
Returns:
The logarithm of the probability of having sampled a set of size subset_len
.
Source code in src/pydvl/valuation/samplers/stratified.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns an object that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
VRDSSampler
¶
Bases: StratifiedSampler
A sampler stratified by coalition size with variable number of samples per set size.
This sampler iterates once per index and generates a fixed mount of subsets of each size in its complement.
This is a convenience subclass of StratifiedSampler which implements the VRDS heuristic from Wu et al. (2023)2.
It is functionally equivalent to a StratifiedSampler with HarmonicSampleSize, FiniteSequentialSizeIteration, and FiniteSequentialIndexIteration.
PARAMETER | DESCRIPTION |
---|---|
n_samples_per_index
|
The number of samples to generate per index. To compute with the (ε,δ) bound from the paper, use min_samples(). The distribution per set size will follow a harmonic function, as defined in HarmonicSampleSize.
TYPE:
|
batch_size
|
The number of samples to generate per batch. Batches are processed together by each subprocess when working in parallel.
TYPE:
|
seed
|
The seed for the random number generator.
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
index_iterable
¶
index_iterable(indices: IndexSetT) -> Generator[IndexT | None, None, None]
Iterates over indices with the method specified at construction.
Source code in src/pydvl/valuation/samplers/powerset.py
interrupt
¶
log_weight
¶
The probability of sampling a set of size k is 1/(n choose k) times the probability of the set having size k, which is the number of samples for that size divided by the total number of samples for all sizes:
where \(m_k\) is the number of samples of size \(k\) and \(m\) is the total number of samples.
PARAMETER | DESCRIPTION |
---|---|
n
|
Size of the index set.
TYPE:
|
subset_len
|
Size of the subset.
TYPE:
|
Returns:
The logarithm of the probability of having sampled a set of size subset_len
.
Source code in src/pydvl/valuation/samplers/stratified.py
min_samples
staticmethod
¶
Computes the minimal amount of samples for an (ε,δ)-approximation of data Shapley.
This is the bound shown in Theorem 4.3 of Wu et al. (2023)2.
PARAMETER | DESCRIPTION |
---|---|
n_indices
|
The number of indices in the index set.
TYPE:
|
eps
|
The epsilon value in the epsilon-delta guarantee, i.e. the distance to the true value.
TYPE:
|
delta
|
The delta value in the epsilon-delta guarantee, i.e. the probability of failure.
TYPE:
|
Returns: \((2 \log(2/\delta) / \epsilon^2) \log(n + 1)^2.\)
Source code in src/pydvl/valuation/samplers/stratified.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns an object that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update