pydvl.valuation.samplers.base
¶
Base classes for samplers and evaluation strategies.
Read pydvl.valuation.samplers for an architectural overview of the samplers and their evaluation strategies.
For an explanation of the interactions between sampler weights, semi-value coefficients and importance sampling, read [[semi-values-sampling]].
EvaluationStrategy
¶
EvaluationStrategy(
utility: UtilityBase, log_coefficient: SemivalueCoefficient | None
)
Bases: ABC
, Generic[SamplerT, ValueUpdateT]
An evaluation strategy for samplers.
An evaluation strategy is used to process the sample batches generated by a sampler and compute value updates. It's the main loop of the workers.
Different sampling schemes require different strategies for the evaluation of the utilities. This class defines the common interface.
For instance PermutationEvaluationStrategy evaluates the samples from PermutationSampler in sequence to save computation, and MSREvaluationStrategy keeps track of update signs for the two sums required by the MSR method.
Usage pattern in valuation methods
def fit(self, data: Dataset):
self.utility = self.utility.with_dataset(data)
strategy = self.sampler.make_strategy(self.utility, self.log_coefficient)
delayed_batches = Parallel()(
delayed(strategy.process)(batch=list(batch), is_interrupted=flag)
for batch in self.sampler
)
for batch in delayed_batches:
for evaluation in batch:
self._result.update(evaluation.idx, evaluation.update)
if self.is_done(self._result):
flag.set()
break
PARAMETER | DESCRIPTION |
---|---|
utility
|
Required to set up some strategies and to process the samples. Since this contains the training data, it is expensive to pickle and send to workers.
TYPE:
|
log_coefficient
|
An additional coefficient to multiply marginals with. This
depends on the valuation method, hence the delayed setup. If
TYPE:
|
Source code in src/pydvl/valuation/samplers/base.py
process
abstractmethod
¶
process(
batch: SampleBatch, is_interrupted: NullaryPredicate
) -> list[ValueUpdateT]
Processes batches of samples using the evaluator, with the strategy required for the sampler.
Warning
This method is intended to be used by the evaluator to process the samples in one batch, which means it might be sent to another process. Be careful with the objects you use here, as they will be pickled and sent over the wire.
PARAMETER | DESCRIPTION |
---|---|
batch
|
A batch of samples to process.
TYPE:
|
is_interrupted
|
A predicate that returns True if the processing should be interrupted.
TYPE:
|
YIELDS | DESCRIPTION |
---|---|
list[ValueUpdateT]
|
Updates to values as tuples (idx, update) |
Source code in src/pydvl/valuation/samplers/base.py
IndexSampler
¶
IndexSampler(batch_size: int = 1)
Bases: ABC
, Generic[SampleT, ValueUpdateT]
Samplers are custom iterables over batches of subsets of indices.
Calling generate_batches(indices)
on a sampler returns a generator over batches of
Samples. Each batch is a list of samples, and each
Sample
is a tuple of the form \((i, S)\), where \(i\) is an index of interest, and \(S
\subset I \setminus \{i\}\) is a subset of the complement of \(i\) in \(I\).
Warning
Samplers are not iterators themselves, so that each call to
generate_batches()
e.g. in a new for loop creates a new iterator.
Subclassing IndexSampler
Derived samplers must implement several methods, most importantly log_weight() and generate(). See the module's documentation for more details.
Interrupting samplers¶
Calling interrupt() on a sampler will stop the batched generator after the current batch has been yielded.
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
The number of samples to generate per batch. Batches are processed by EvaluationStrategy so that individual valuations in batch are guaranteed to be received in the right sequence.
TYPE:
|
Example
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
The number of samples to generate per batch. Batches are processed by the EvaluationStrategy
TYPE:
|
Source code in src/pydvl/valuation/samplers/base.py
skip_indices
property
writable
¶
Indices being skipped in the sampler. The exact behaviour will be sampler-dependent, so that setting this property is disabled by default.
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate
abstractmethod
¶
Generates single samples.
IndexSampler.generate_batches()
will batch these samples according to the
batch size set upon construction.
PARAMETER | DESCRIPTION |
---|---|
indices
|
TYPE:
|
YIELDS | DESCRIPTION |
---|---|
SampleGenerator
|
A tuple (idx, subset) for each sample. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
interrupt
¶
log_weight
abstractmethod
¶
Log probability of sampling a set S.
We assume that every sampler allows computing \(p(S)\) as a function of the size of the index set \(n\) and the size \(k\) of the subset being sampled:
For details on weighting, importance sampling and usage with semi-values, see [[semi-values-sampling]].
Log-space computation
Because the weight is a probability that can be arbitrarily small, we compute it in log-space for numerical stability.
PARAMETER | DESCRIPTION |
---|---|
n
|
The size of the index set.
TYPE:
|
subset_len
|
The size of the subset being sampled
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
The natural logarithm of the probability of sampling a set of the given
size, when the index set has size |
Source code in src/pydvl/valuation/samplers/base.py
make_strategy
abstractmethod
¶
make_strategy(
utility: UtilityBase, log_coefficient: SemivalueCoefficient | None
) -> EvaluationStrategy
Returns the strategy for this sampler.
The evaluation strategy is used to process the samples generated by the sampler and compute value updates. It's the main loop of the workers.
PARAMETER | DESCRIPTION |
---|---|
utility
|
The utility to use for the evaluation strategy.
TYPE:
|
log_coefficient
|
An additional coefficient to multiply marginals with. This
depends on the valuation method, hence the delayed setup. If
TYPE:
|
Source code in src/pydvl/valuation/samplers/base.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns an object that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
sample_limit
abstractmethod
¶
sample_limit(indices: IndexSetT) -> int | None
Number of samples that can be generated from the indices.
PARAMETER | DESCRIPTION |
---|---|
indices
|
The indices used in the sampler.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
int | None
|
The maximum number of samples that will be generated, or |