Skip to content

pydvl.valuation.samplers.base

Base classes for samplers and evaluation strategies.

See pydvl.valuation.samplers for details.

ResultUpdater

ResultUpdater(result: ValuationResult)

Bases: Protocol[ValueUpdateT]

Protocol for result updaters.

A result updater is a strategy to update a valuation result with a value update.

Source code in src/pydvl/valuation/samplers/base.py
def __init__(self, result: ValuationResult): ...

IndexSampler

IndexSampler(batch_size: int = 1)

Bases: ABC, Generic[ValueUpdateT]

Samplers are custom iterables over batches of subsets of indices.

Calling from_indices(indexset) on a sampler returns a generator over batches of Samples. A Sample is a tuple of the form \((i, S)\), where \(i\) is an index of interest, and \(S \subset I \setminus \{i\}\) is a subset of the complement of \(i\) in \(I\).

Note

Samplers are not iterators themselves, so that each call to from_indices(data) e.g. in a new for loop creates a new iterator.

Derived samplers must implement log_weight() and generate(). See the module's documentation for more on these.

Interrupting samplers

Calling interrupt() on a sampler will stop the batched generator after the current batch has been yielded.

PARAMETER DESCRIPTION
batch_size

The number of samples to generate per batch. Batches are processed by EvaluationStrategy so that individual valuations in batch are guaranteed to be received in the right sequence.

TYPE: int DEFAULT: 1

Example
>>>from pydvl.valuation.samplers import DeterministicUniformSampler
>>>import numpy as np
>>>sampler = DeterministicUniformSampler()
>>>for idx, s in sampler.generate_batches(np.arange(2)):
>>>    print(s, end="")
[][2,][][1,]
    processed by the
    [EvaluationStrategy][pydvl.valuation.samplers.base.EvaluationStrategy]
Source code in src/pydvl/valuation/samplers/base.py
def __init__(self, batch_size: int = 1):
    """
    Args:
        batch_size: The number of samples to generate per batch. Batches are
            processed by the
            [EvaluationStrategy][pydvl.valuation.samplers.base.EvaluationStrategy]
    """
    self._batch_size = batch_size
    self._n_samples = 0
    self._interrupted = False
    self._skip_indices = np.empty(0, dtype=bool)
    self._len: int | None = None

skip_indices property writable

skip_indices: IndexSetT

Indices being skipped in the sampler. The exact behaviour will be sampler-dependent, so that setting this property is disabled by default.

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

sample_limit abstractmethod

sample_limit(indices: IndexSetT) -> int | None

Number of samples that can be generated from the indices.

PARAMETER DESCRIPTION
indices

The indices used in the sampler.

TYPE: IndexSetT

RETURNS DESCRIPTION
int | None

The maximum number of samples that will be generated, or None if the number of samples is infinite. This will depend, among other things, on the type of IndexIteration.

Source code in src/pydvl/valuation/samplers/base.py
@abstractmethod
def sample_limit(self, indices: IndexSetT) -> int | None:
    """Number of samples that can be generated from the indices.

    Args:
        indices: The indices used in the sampler.

    Returns:
        The maximum number of samples that will be generated, or  `None` if the
            number of samples is infinite. This will depend, among other things,
            on the type of [IndexIteration][pydvl.valuation.samplers.IndexIteration].
    """
    ...

generate abstractmethod

generate(indices: IndexSetT) -> SampleGenerator

Generates single samples.

IndexSampler.generate_batches() will batch these samples according to the batch size set upon construction.

PARAMETER DESCRIPTION
indices

TYPE: IndexSetT

YIELDS DESCRIPTION
SampleGenerator

A tuple (idx, subset) for each sample.

Source code in src/pydvl/valuation/samplers/base.py
@abstractmethod
def generate(self, indices: IndexSetT) -> SampleGenerator:
    """Generates single samples.

    `IndexSampler.generate_batches()` will batch these samples according to the
    batch size set upon construction.

    Args:
        indices:

    Yields:
        A tuple (idx, subset) for each sample.
    """
    ...

log_weight abstractmethod

log_weight(n: int, subset_len: int) -> float

Factor by which to multiply Monte Carlo samples, so that the mean converges to the desired expression.

Log-space computation

Because the weight is a probability that can be arbitrarily small, we compute it in log-space for numerical stability.

By the Law of Large Numbers, the sample mean of \(f(S_j)\) converges to the expectation under the distribution from which \(S_j\) is sampled.

\[ \begin{eqnarray} \frac{1}{m} \sum_{j = 1}^m f (S_j) w (S_j) & \longrightarrow & \underset{S \sim \mathcal{D}_{- i}}{\mathbb{E}} [f (S) w (S)] \\ & & = \sum_{S \subseteq N_{- i}} f (S) w (S) \mathbb{P}_{\mathcal{D}_{- i}} (S) \end{eqnarray}. \]

We add the factor \(w(S_j)\) in order to have this expectation coincide with the desired expression, by cancelling out \(\mathbb{P} (S)\).

PARAMETER DESCRIPTION
n

The size of the index set. Note that the actual size of the set being sampled will often be n-1, as one index might be removed from the set. See IndexIteration for more.

TYPE: int

subset_len

The size of the subset being sampled

TYPE: int

RETURNS DESCRIPTION
float

The natural logarithm of the probability of sampling a set of the given size, when the index set has size n, under the IndexIteration given upon construction.

Source code in src/pydvl/valuation/samplers/base.py
@abstractmethod
def log_weight(self, n: int, subset_len: int) -> float:
    r"""Factor by which to multiply Monte Carlo samples, so that the
    mean converges to the desired expression.

    !!! Info "Log-space computation"
        Because the weight is a probability that can be arbitrarily small, we
        compute it in log-space for numerical stability.

    By the Law of Large Numbers, the sample mean of $f(S_j)$ converges to the
    expectation under the distribution from which $S_j$ is sampled.

    $$
    \begin{eqnarray}
        \frac{1}{m} \sum_{j = 1}^m f (S_j) w (S_j) & \longrightarrow &
            \underset{S \sim \mathcal{D}_{- i}}{\mathbb{E}} [f (S) w (S)] \\
        &  & = \sum_{S \subseteq N_{- i}} f (S) w (S)
            \mathbb{P}_{\mathcal{D}_{- i}} (S)
    \end{eqnarray}.
    $$

    We add the factor $w(S_j)$ in order to have this expectation coincide with the
    desired expression, by cancelling out $\mathbb{P} (S)$.

    Args:
        n: The size of the index set. Note that the actual size of the set being
            sampled will often be n-1, as one index might be removed from the set.
            See [IndexIteration][pydvl.valuation.samplers.IndexIteration] for more.
        subset_len: The size of the subset being sampled

    Returns:
        The natural logarithm of the probability of sampling a set of the given
            size, when the index set has size `n`, under the
            [IndexIteration][pydvl.valuation.samplers.IndexIteration] given upon
            construction.
    """
    ...

make_strategy abstractmethod

make_strategy(
    utility: UtilityBase,
    log_coefficient: Callable[[int, int], float] | None = None,
) -> EvaluationStrategy

Returns the strategy for this sampler.

Source code in src/pydvl/valuation/samplers/base.py
@abstractmethod
def make_strategy(
    self,
    utility: UtilityBase,
    log_coefficient: Callable[[int, int], float] | None = None,
) -> EvaluationStrategy:
    """Returns the strategy for this sampler."""
    ...  # return SomeLogEvaluationStrategy(self)

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

LogResultUpdater

LogResultUpdater(result: ValuationResult)

Bases: ResultUpdater[ValueUpdateT]

Updates a valuation result with a value update in log-space.

Source code in src/pydvl/valuation/samplers/base.py
def __init__(self, result: ValuationResult):
    self.result = result
    self._log_sum_positive = np.full_like(result.values, -np.inf)
    self._log_sum_negative = np.full_like(result.values, -np.inf)
    self._log_sum2 = np.full_like(result.values, -np.inf)

EvaluationStrategy

EvaluationStrategy(
    sampler: SamplerT,
    utility: UtilityBase,
    log_coefficient: Callable[[int, int], float] | None = None,
)

Bases: ABC, Generic[SamplerT, ValueUpdateT]

An evaluation strategy for samplers.

Implements the processing strategy for batches returned by an IndexSampler.

Different sampling schemes require different strategies for the evaluation of the utilities. For instance permutations generated by PermutationSampler must be evaluated in sequence to save computation, see PermutationEvaluationStrategy.

This class defines the common interface.

Usage pattern in valuation methods
    def fit(self, data: Dataset):
        self.utility = self.utility.with_dataset(data)
        strategy = self.sampler.strategy(self.utility, self.log_coefficient)
        delayed_batches = Parallel()(
            delayed(strategy.process)(batch=list(batch), is_interrupted=flag)
            for batch in self.sampler
        )
        for batch in delayed_batches:
            for evaluation in batch:
                self.result.update(evaluation.idx, evaluation.update)
            if self.is_done(self.result):
                flag.set()
                break
PARAMETER DESCRIPTION
sampler

Required to set up some strategies.

TYPE: SamplerT

utility

Required to set up some strategies and to process the samples. Since this contains the training data, it is expensive to pickle and send to workers.

TYPE: UtilityBase

log_coefficient

An additional coefficient to multiply marginals with. This depends on the valuation method, hence the delayed setup.

TYPE: Callable[[int, int], float] | None DEFAULT: None

Source code in src/pydvl/valuation/samplers/base.py
def __init__(
    self,
    sampler: SamplerT,
    utility: UtilityBase,
    log_coefficient: Callable[[int, int], float] | None = None,
):
    self.utility = utility
    # Used by the decorator suppress_warnings:
    self.show_warnings = getattr(utility, "show_warnings", False)
    self.n_indices = (
        len(utility.training_data) if utility.training_data is not None else 0
    )

    if log_coefficient is not None:

        def correction_fun(n: int, subset_len: int) -> float:
            return log_coefficient(n, subset_len) - sampler.log_weight(
                n, subset_len
            )

        self.log_correction = correction_fun
    else:
        self.log_correction = lambda n, subset_len: 0.0

process abstractmethod

process(
    batch: SampleBatch, is_interrupted: NullaryPredicate
) -> list[ValueUpdateT]

Processes batches of samples using the evaluator, with the strategy required for the sampler.

Warning

This method is intended to be used by the evaluator to process the samples in one batch, which means it might be sent to another process. Be careful with the objects you use here, as they will be pickled and sent over the wire.

PARAMETER DESCRIPTION
batch

A batch of samples to process.

TYPE: SampleBatch

is_interrupted

A predicate that returns True if the processing should be interrupted.

TYPE: NullaryPredicate

YIELDS DESCRIPTION
list[ValueUpdateT]

Updates to values as tuples (idx, update)

Source code in src/pydvl/valuation/samplers/base.py
@abstractmethod
def process(
    self, batch: SampleBatch, is_interrupted: NullaryPredicate
) -> list[ValueUpdateT]:
    """Processes batches of samples using the evaluator, with the strategy
    required for the sampler.

    !!! Warning
        This method is intended to be used by the evaluator to process the samples
        in one batch, which means it might be sent to another process. Be careful
        with the objects you use here, as they will be pickled and sent over the
        wire.

    Args:
        batch: A batch of samples to process.
        is_interrupted: A predicate that returns True if the processing should be
            interrupted.

    Yields:
        Updates to values as tuples (idx, update)
    """
    ...