pydvl.valuation.samplers.msr
¶
This module implements Maximum Sample Re-use (MSR) sampling for valuation, as described in (Wang et al.)1.
The idea behind MSR is to update all indices in the dataset with every evaluation of the utility function on a sample. Updates are divided into positive, if the index is in the sample, and negative, if it is not. The two running means are later combined into a final result.
Note that this requires defining a special evaluation strategy and result updater, as returned by the make_strategy and result_updater methods, respectively.
For more on the general architecture of samplers see pydvl.valuation.samplers.
References¶
-
Wang, J.T. and Jia, R., 2023. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. In: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, pp. 6388-6421. ↩
MSRValueUpdate
dataclass
¶
Bases: ValueUpdate
Update for Maximum Sample Re-use (MSR) valuation (in log space). Attributes: in_sample: Whether the index to be updated was in the sample.
Source code in src/pydvl/valuation/samplers/msr.py
MSRResultUpdater
¶
MSRResultUpdater(result: ValuationResult)
Bases: ResultUpdater[MSRValueUpdate]
Update running means for MSR valuation (in log-space).
This class is used to update two running means for positive and negative updates separately. The two running means are later combined into a final result.
Since values computed with MSR are not a mean over marginals, both the variances of the marginals and the update counts are ill-defined. We use the following conventions:
-
The counts are defined as the minimum of the two counts. This definition enables us to ensure a minimal number of updates for both running means via stopping criteria and correctly detects that no actual update has taken place if one of the counts is zero.
-
We reverse engineer the variances so that they yield correct standard errors given our convention for the counts and the normal calculation of standard errors in the valuation result.
Note that we cannot use the normal addition or subtraction defined by the ValuationResult because it is weighted with counts. If we were to simply subtract the negative result from the positive we would get wrong variance estimates, misleading update counts and even wrong values if no further precaution is taken.
Source code in src/pydvl/valuation/samplers/msr.py
combine_results
¶
combine_results() -> ValuationResult
Combine the positive and negative running means into a final result. Returns: The combined valuation result.
Verify that the two running means are statistically independent (which is
assumed in the aggregation of variances).
Source code in src/pydvl/valuation/samplers/msr.py
MSRSampler
¶
MSRSampler(batch_size: int = 1, seed: Seed | None = None)
Bases: StochasticSamplerMixin
, IndexSampler[MSRValueUpdate]
Sampler for unweighted Maximum Sample Re-use (MSR) valuation.
The sampling is similar to a UniformSampler but without an outer index. However,the MSR sampler uses a special evaluation strategy and result updater, as returned by the make_strategy() and result_updater() methods, respectively.
Two running means are updated separately for positive and negative updates. The two running means are later combined into a final result.
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
Number of samples to generate in each batch.
TYPE:
|
seed
|
Seed for the random number generator.
TYPE:
|
Source code in src/pydvl/valuation/samplers/msr.py
skip_indices
property
writable
¶
Indices being skipped in the sampler. The exact behaviour will be sampler-dependent, so that setting this property is disabled by default.
interrupt
¶
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
log_weight
¶
Probability of sampling a set of size k.
In the MSR scheme, the sampling is done from the full power set \(2^N\) (each set \(S \subseteq N\) with probability \(1 / 2^n\)), and then for each data point \(i\) one partitions the sample into:
* $\mathcal{S}_{\ni i} = \{S \in \mathcal{S}: i \in S\},$ and
* $\mathcal{S}_{\nni i} = \{S \in \mathcal{S}: i \nin S\}.$.
When we condition on the event \(i \in S\), the remaining part \(S_{- i}\) is uniformly distributed over \(2^{N_{- i}}\). In other words, the act of partitioning recovers the uniform distribution on \(2^{N_{- i}}\) "for free" because
for each \(T \subseteq N_{- i}\).
PARAMETER | DESCRIPTION |
---|---|
n
|
Size of the index set.
TYPE:
|
subset_len
|
Size of the subset.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
The logarithm of the probability of having sampled a set of size
|
Source code in src/pydvl/valuation/samplers/msr.py
make_strategy
¶
make_strategy(
utility: UtilityBase, coefficient: Callable[[int, int], float] | None = None
) -> MSREvaluationStrategy
Returns the strategy for this sampler.
PARAMETER | DESCRIPTION |
---|---|
utility
|
Utility function to evaluate.
TYPE:
|
coefficient
|
Coefficient function for the utility function. |
Source code in src/pydvl/valuation/samplers/msr.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater
Returns a callable that updates a valuation result with an MSR value update.
MSR updates two running means for positive and negative updates separately. The two running means are later combined into a final result.
PARAMETER | DESCRIPTION |
---|---|
result
|
The valuation result to update with each call of the returned callable.
TYPE:
|
Returns: A callable object that updates the valuation result with very MSRValueUpdate.
Source code in src/pydvl/valuation/samplers/msr.py
MSREvaluationStrategy
¶
MSREvaluationStrategy(
sampler: SamplerT,
utility: UtilityBase,
log_coefficient: Callable[[int, int], float] | None = None,
)
Bases: EvaluationStrategy[MSRSampler, MSRValueUpdate]
Evaluation strategy for Maximum Sample Re-use (MSR) valuation in log space.
The MSR evaluation strategy makes one utility evaluation per sample but generates
n_indices
many updates from it. The updates will be used to update two running
means that will later be combined into a final value. We use the field
ValueUpdate.in_sample
field to inform
MSRResultUpdater of which of the two
running means must be updated.