Skip to content

pydvl.valuation.samplers

Samplers iterate over subsets of indices.

The classes in this module are used to iterate over indices, and subsets of their complement in the whole set, as required for the computation of marginal utilities for semi-values and other marginal-utility based methods.

These samplers are used by all game-theoretic valuation methods, as well as for LOO and any other marginal-contribution-based method which iterates over subsets of the training data, and because of intertwining of these algorithms with the sampling, there are several strategies to choose when constructing them.

Index iteration

Subclasses of IndexSampler are iterators over batches of Samples. These are typically of the form \((i, S)\), where \(i\) is an index of interest, and \(S \subset I \setminus \{i\}\) is a subset of the complement of \(i.\)

This type of iteration over indices \(i\) and their complements is configured upon construction of the sampler with the classes SequentialIndexIteration, RandomIndexIteration, or their finite counterparts, when each index must be visited just once (albeit possibly generating many samples per index).

However, some valuation schemes require iteration over subsets of the whole set (as opposed to iterating over complements of individual indices). For this purpose, one can use NoIndexIteration or its finite counterpart.

Sampler evaluation

Different samplers imply different strategies for processing samples, i.e. for evaluating the utility of the subsets. For instance permutation samplers generate increasing subsets of permutations, allowing semi-value calculations to benefit an incremental evaluation of the utility that reuses the previous computation.

This behaviour is communicated to the valuation method through the EvaluationStrategy class. The basic usage pattern inside a valuation method is the following (see below for info on the updater):

    def fit(self, data: Dataset):

        ...

        strategy = self.sampler.make_strategy(self.utility, self.log_coefficient)
        processor = delayed(strategy.process)
        updater = self.sampler.result_updater(self.result)

        delayed_batches = Parallel()(
            processor(batch=list(batch), is_interrupted=flag) for batch in self.sampler
        )
        for batch in delayed_batches:
            for evaluation in batch:
                self.result = updater(evaluation)
            ...

Updating the result

Yet another behaviour that depends on the sampling scheme is the way that results are updated. For instance, the MSRSampler requires tracking updates to two sequences of samples which are then merged in a specific way. This strategy is declared by the sampler through the factory method result_updater(), which returns a callable that updates the result with a single evaluation.

Creating custom samplers

To create a custom sampler, subclass either PowersetSampler or PermutationSamplerBase, or implement the IndexSampler interface directly.

There are three main methods to implement (and others that can be overridden):

  • generate(), which yields samples of the form \((i, S)\). These will be batched together by __iter__ for parallel processing. Note that, if the index set has size \(N\), for PermutationSampler, a batch size of \(B\) implies \(O(B*N)\) evaluations of the utility in one process, since single permutations are always processed in one go.
  • log_weight() to provide a factor by which to multiply Monte Carlo samples in stochastic methods, so that the mean converges to the desired expression. This will typically be the logarithm of the inverse probability of sampling a given subset.
  • make_strategy() to create an evaluation strategy that processes the samples. This is typically a subclass of EvaluationStrategy that computes utilities and weights them with coefficients and sampler weights. One can also use any of the predefined strategies, like the successive marginal evaluations of PowersetEvaluationStrategy or the successive evaluations of PermutationEvaluationStrategy

Finally, if the sampler requires a dedicated result updater, you must override result_updater() to return a callable that updates a ValuationResult with one evaluation ValueUpdate. This is used e.g. for the MSRSampler which uses two running means for positive and negative updates.

Changed in version 0.10.0

All the samplers in this module have been changed to work with the new evaluation strategies.

References


  1. Mitchell, Rory, Joshua Cooper, Eibe Frank, and Geoffrey Holmes. Sampling Permutations for Shapley Value Estimation. Journal of Machine Learning Research 23, no. 43 (2022): 1–46. 

  2. Watson, Lauren, Zeno Kujawa, Rayna Andreeva, Hao-Tsung Yang, Tariq Elahi, and Rik Sarkar. Accelerated Shapley Value Approximation for Data Evaluation. arXiv, 9 November 2023. 

ResultUpdater

ResultUpdater(result: ValuationResult)

Bases: Protocol[ValueUpdateT]

Protocol for result updaters.

A result updater is a strategy to update a valuation result with a value update.

Source code in src/pydvl/valuation/samplers/base.py
def __init__(self, result: ValuationResult): ...

IndexSampler

IndexSampler(batch_size: int = 1)

Bases: ABC, Generic[ValueUpdateT]

Samplers are custom iterables over batches of subsets of indices.

Calling from_indices(indexset) on a sampler returns a generator over batches of Samples. A Sample is a tuple of the form \((i, S)\), where \(i\) is an index of interest, and \(S \subset I \setminus \{i\}\) is a subset of the complement of \(i\) in \(I\).

Note

Samplers are not iterators themselves, so that each call to from_indices(data) e.g. in a new for loop creates a new iterator.

Derived samplers must implement log_weight() and generate(). See the module's documentation for more on these.

Interrupting samplers

Calling interrupt() on a sampler will stop the batched generator after the current batch has been yielded.

PARAMETER DESCRIPTION
batch_size

The number of samples to generate per batch. Batches are processed by EvaluationStrategy so that individual valuations in batch are guaranteed to be received in the right sequence.

TYPE: int DEFAULT: 1

Example
>>>from pydvl.valuation.samplers import DeterministicUniformSampler
>>>import numpy as np
>>>sampler = DeterministicUniformSampler()
>>>for idx, s in sampler.generate_batches(np.arange(2)):
>>>    print(s, end="")
[][2,][][1,]
    processed by the
    [EvaluationStrategy][pydvl.valuation.samplers.base.EvaluationStrategy]
Source code in src/pydvl/valuation/samplers/base.py
def __init__(self, batch_size: int = 1):
    """
    Args:
        batch_size: The number of samples to generate per batch. Batches are
            processed by the
            [EvaluationStrategy][pydvl.valuation.samplers.base.EvaluationStrategy]
    """
    self._batch_size = batch_size
    self._n_samples = 0
    self._interrupted = False
    self._skip_indices = np.empty(0, dtype=bool)
    self._len: int | None = None

skip_indices property writable

skip_indices: IndexSetT

Indices being skipped in the sampler. The exact behaviour will be sampler-dependent, so that setting this property is disabled by default.

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

sample_limit abstractmethod

sample_limit(indices: IndexSetT) -> int | None

Number of samples that can be generated from the indices.

PARAMETER DESCRIPTION
indices

The indices used in the sampler.

TYPE: IndexSetT

RETURNS DESCRIPTION
int | None

The maximum number of samples that will be generated, or None if the number of samples is infinite. This will depend, among other things, on the type of IndexIteration.

Source code in src/pydvl/valuation/samplers/base.py
@abstractmethod
def sample_limit(self, indices: IndexSetT) -> int | None:
    """Number of samples that can be generated from the indices.

    Args:
        indices: The indices used in the sampler.

    Returns:
        The maximum number of samples that will be generated, or  `None` if the
            number of samples is infinite. This will depend, among other things,
            on the type of [IndexIteration][pydvl.valuation.samplers.IndexIteration].
    """
    ...

generate abstractmethod

generate(indices: IndexSetT) -> SampleGenerator

Generates single samples.

IndexSampler.generate_batches() will batch these samples according to the batch size set upon construction.

PARAMETER DESCRIPTION
indices

TYPE: IndexSetT

YIELDS DESCRIPTION
SampleGenerator

A tuple (idx, subset) for each sample.

Source code in src/pydvl/valuation/samplers/base.py
@abstractmethod
def generate(self, indices: IndexSetT) -> SampleGenerator:
    """Generates single samples.

    `IndexSampler.generate_batches()` will batch these samples according to the
    batch size set upon construction.

    Args:
        indices:

    Yields:
        A tuple (idx, subset) for each sample.
    """
    ...

log_weight abstractmethod

log_weight(n: int, subset_len: int) -> float

Factor by which to multiply Monte Carlo samples, so that the mean converges to the desired expression.

Log-space computation

Because the weight is a probability that can be arbitrarily small, we compute it in log-space for numerical stability.

By the Law of Large Numbers, the sample mean of \(f(S_j)\) converges to the expectation under the distribution from which \(S_j\) is sampled.

\[ \begin{eqnarray} \frac{1}{m} \sum_{j = 1}^m f (S_j) w (S_j) & \longrightarrow & \underset{S \sim \mathcal{D}_{- i}}{\mathbb{E}} [f (S) w (S)] \\ & & = \sum_{S \subseteq N_{- i}} f (S) w (S) \mathbb{P}_{\mathcal{D}_{- i}} (S) \end{eqnarray}. \]

We add the factor \(w(S_j)\) in order to have this expectation coincide with the desired expression, by cancelling out \(\mathbb{P} (S)\).

PARAMETER DESCRIPTION
n

The size of the index set. Note that the actual size of the set being sampled will often be n-1, as one index might be removed from the set. See IndexIteration for more.

TYPE: int

subset_len

The size of the subset being sampled

TYPE: int

RETURNS DESCRIPTION
float

The natural logarithm of the probability of sampling a set of the given size, when the index set has size n, under the IndexIteration given upon construction.

Source code in src/pydvl/valuation/samplers/base.py
@abstractmethod
def log_weight(self, n: int, subset_len: int) -> float:
    r"""Factor by which to multiply Monte Carlo samples, so that the
    mean converges to the desired expression.

    !!! Info "Log-space computation"
        Because the weight is a probability that can be arbitrarily small, we
        compute it in log-space for numerical stability.

    By the Law of Large Numbers, the sample mean of $f(S_j)$ converges to the
    expectation under the distribution from which $S_j$ is sampled.

    $$
    \begin{eqnarray}
        \frac{1}{m} \sum_{j = 1}^m f (S_j) w (S_j) & \longrightarrow &
            \underset{S \sim \mathcal{D}_{- i}}{\mathbb{E}} [f (S) w (S)] \\
        &  & = \sum_{S \subseteq N_{- i}} f (S) w (S)
            \mathbb{P}_{\mathcal{D}_{- i}} (S)
    \end{eqnarray}.
    $$

    We add the factor $w(S_j)$ in order to have this expectation coincide with the
    desired expression, by cancelling out $\mathbb{P} (S)$.

    Args:
        n: The size of the index set. Note that the actual size of the set being
            sampled will often be n-1, as one index might be removed from the set.
            See [IndexIteration][pydvl.valuation.samplers.IndexIteration] for more.
        subset_len: The size of the subset being sampled

    Returns:
        The natural logarithm of the probability of sampling a set of the given
            size, when the index set has size `n`, under the
            [IndexIteration][pydvl.valuation.samplers.IndexIteration] given upon
            construction.
    """
    ...

make_strategy abstractmethod

make_strategy(
    utility: UtilityBase,
    log_coefficient: Callable[[int, int], float] | None = None,
) -> EvaluationStrategy

Returns the strategy for this sampler.

Source code in src/pydvl/valuation/samplers/base.py
@abstractmethod
def make_strategy(
    self,
    utility: UtilityBase,
    log_coefficient: Callable[[int, int], float] | None = None,
) -> EvaluationStrategy:
    """Returns the strategy for this sampler."""
    ...  # return SomeLogEvaluationStrategy(self)

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

EvaluationStrategy

EvaluationStrategy(
    sampler: SamplerT,
    utility: UtilityBase,
    log_coefficient: Callable[[int, int], float] | None = None,
)

Bases: ABC, Generic[SamplerT, ValueUpdateT]

An evaluation strategy for samplers.

Implements the processing strategy for batches returned by an IndexSampler.

Different sampling schemes require different strategies for the evaluation of the utilities. For instance permutations generated by PermutationSampler must be evaluated in sequence to save computation, see PermutationEvaluationStrategy.

This class defines the common interface.

Usage pattern in valuation methods
    def fit(self, data: Dataset):
        self.utility = self.utility.with_dataset(data)
        strategy = self.sampler.strategy(self.utility, self.log_coefficient)
        delayed_batches = Parallel()(
            delayed(strategy.process)(batch=list(batch), is_interrupted=flag)
            for batch in self.sampler
        )
        for batch in delayed_batches:
            for evaluation in batch:
                self.result.update(evaluation.idx, evaluation.update)
            if self.is_done(self.result):
                flag.set()
                break
PARAMETER DESCRIPTION
sampler

Required to set up some strategies.

TYPE: SamplerT

utility

Required to set up some strategies and to process the samples. Since this contains the training data, it is expensive to pickle and send to workers.

TYPE: UtilityBase

log_coefficient

An additional coefficient to multiply marginals with. This depends on the valuation method, hence the delayed setup.

TYPE: Callable[[int, int], float] | None DEFAULT: None

Source code in src/pydvl/valuation/samplers/base.py
def __init__(
    self,
    sampler: SamplerT,
    utility: UtilityBase,
    log_coefficient: Callable[[int, int], float] | None = None,
):
    self.utility = utility
    # Used by the decorator suppress_warnings:
    self.show_warnings = getattr(utility, "show_warnings", False)
    self.n_indices = (
        len(utility.training_data) if utility.training_data is not None else 0
    )

    if log_coefficient is not None:

        def correction_fun(n: int, subset_len: int) -> float:
            return log_coefficient(n, subset_len) - sampler.log_weight(
                n, subset_len
            )

        self.log_correction = correction_fun
    else:
        self.log_correction = lambda n, subset_len: 0.0

process abstractmethod

process(
    batch: SampleBatch, is_interrupted: NullaryPredicate
) -> list[ValueUpdateT]

Processes batches of samples using the evaluator, with the strategy required for the sampler.

Warning

This method is intended to be used by the evaluator to process the samples in one batch, which means it might be sent to another process. Be careful with the objects you use here, as they will be pickled and sent over the wire.

PARAMETER DESCRIPTION
batch

A batch of samples to process.

TYPE: SampleBatch

is_interrupted

A predicate that returns True if the processing should be interrupted.

TYPE: NullaryPredicate

YIELDS DESCRIPTION
list[ValueUpdateT]

Updates to values as tuples (idx, update)

Source code in src/pydvl/valuation/samplers/base.py
@abstractmethod
def process(
    self, batch: SampleBatch, is_interrupted: NullaryPredicate
) -> list[ValueUpdateT]:
    """Processes batches of samples using the evaluator, with the strategy
    required for the sampler.

    !!! Warning
        This method is intended to be used by the evaluator to process the samples
        in one batch, which means it might be sent to another process. Be careful
        with the objects you use here, as they will be pickled and sent over the
        wire.

    Args:
        batch: A batch of samples to process.
        is_interrupted: A predicate that returns True if the processing should be
            interrupted.

    Yields:
        Updates to values as tuples (idx, update)
    """
    ...

ClasswiseSampler

ClasswiseSampler(
    in_class: IndexSampler,
    out_of_class: PowersetSampler,
    *,
    min_elements_per_label: int = 1,
    batch_size: int = 1,
)

Bases: IndexSampler

A sampler that samples elements from a dataset in two steps, based on the labels.

It proceeds by sampling out-of-class indices (training points with a different label to the point of interest), and in-class indices (training points with the same label as the point of interest), in the complement.

Used by the class-wise Shapley valuation method.

PARAMETER DESCRIPTION
in_class

Sampling scheme for elements of a given label.

TYPE: IndexSampler

out_of_class

Sampling scheme for elements of different labels, i.e., the complement set.

TYPE: PowersetSampler

min_elements_per_label

Minimum number of elements per label to sample from the complement set, i.e., out of class elements.

TYPE: int DEFAULT: 1

Source code in src/pydvl/valuation/samplers/classwise.py
def __init__(
    self,
    in_class: IndexSampler,
    out_of_class: PowersetSampler,
    *,
    min_elements_per_label: int = 1,
    batch_size: int = 1,
):
    super().__init__(batch_size=batch_size)
    self.in_class = in_class
    self.out_of_class = out_of_class
    self.min_elements_per_label = min_elements_per_label

skip_indices property writable

skip_indices: IndexSetT

Indices being skipped in the sampler. The exact behaviour will be sampler-dependent, so that setting this property is disabled by default.

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

interrupt

interrupt() -> None

Interrupts the current sampler as well as the passed in samplers

Source code in src/pydvl/valuation/samplers/classwise.py
def interrupt(self) -> None:
    """Interrupts the current sampler as well as the passed in samplers"""
    super().interrupt()
    self.in_class.interrupt()
    self.out_of_class.interrupt()

MSRSampler

MSRSampler(batch_size: int = 1, seed: Seed | None = None)

Bases: StochasticSamplerMixin, IndexSampler[MSRValueUpdate]

Sampler for unweighted Maximum Sample Re-use (MSR) valuation.

The sampling is similar to a UniformSampler but without an outer index. However,the MSR sampler uses a special evaluation strategy and result updater, as returned by the make_strategy() and result_updater() methods, respectively.

Two running means are updated separately for positive and negative updates. The two running means are later combined into a final result.

PARAMETER DESCRIPTION
batch_size

Number of samples to generate in each batch.

TYPE: int DEFAULT: 1

seed

Seed for the random number generator.

TYPE: Seed | None DEFAULT: None

Source code in src/pydvl/valuation/samplers/msr.py
def __init__(self, batch_size: int = 1, seed: Seed | None = None):
    super().__init__(batch_size=batch_size, seed=seed)

skip_indices property writable

skip_indices: IndexSetT

Indices being skipped in the sampler. The exact behaviour will be sampler-dependent, so that setting this property is disabled by default.

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

log_weight

log_weight(n: int, subset_len: int) -> float

Probability of sampling a set of size k.

In the MSR scheme, the sampling is done from the full power set \(2^N\) (each set \(S \subseteq N\) with probability \(1 / 2^n\)), and then for each data point \(i\) one partitions the sample into:

* $\mathcal{S}_{\ni i} = \{S \in \mathcal{S}: i \in S\},$ and
* $\mathcal{S}_{\nni i} = \{S \in \mathcal{S}: i \nin S\}.$.

When we condition on the event \(i \in S\), the remaining part \(S_{- i}\) is uniformly distributed over \(2^{N_{- i}}\). In other words, the act of partitioning recovers the uniform distribution on \(2^{N_{- i}}\) "for free" because

\[P (S_{- i} = T \mid i \in S) = \frac{1}{2^{n - 1}},\]

for each \(T \subseteq N_{- i}\).

PARAMETER DESCRIPTION
n

Size of the index set.

TYPE: int

subset_len

Size of the subset.

TYPE: int

RETURNS DESCRIPTION
float

The logarithm of the probability of having sampled a set of size subset_len.

Source code in src/pydvl/valuation/samplers/msr.py
def log_weight(self, n: int, subset_len: int) -> float:
    r"""Probability of sampling a set of size k.

    In the **MSR scheme**, the sampling is done from the full power set $2^N$ (each
    set $S \subseteq N$ with probability $1 / 2^n$), and then for each data point
    $i$ one partitions the sample into:

        * $\mathcal{S}_{\ni i} = \{S \in \mathcal{S}: i \in S\},$ and
        * $\mathcal{S}_{\nni i} = \{S \in \mathcal{S}: i \nin S\}.$.

    When we condition on the event $i \in S$, the remaining part $S_{- i}$ is
    uniformly distributed over $2^{N_{- i}}$. In other words, the act of
    partitioning recovers the uniform distribution on $2^{N_{- i}}$ "for free"
    because

    $$P (S_{- i} = T \mid i \in S) = \frac{1}{2^{n - 1}},$$

    for each $T \subseteq N_{- i}$.

    Args:
        n: Size of the index set.
        subset_len: Size of the subset.

    Returns:
        The logarithm of the probability of having sampled a set of size
            `subset_len`.
    """
    return float(-(n - 1) * np.log(2)) if n > 0 else 0.0

make_strategy

make_strategy(
    utility: UtilityBase, coefficient: Callable[[int, int], float] | None = None
) -> MSREvaluationStrategy

Returns the strategy for this sampler.

PARAMETER DESCRIPTION
utility

Utility function to evaluate.

TYPE: UtilityBase

coefficient

Coefficient function for the utility function.

TYPE: Callable[[int, int], float] | None DEFAULT: None

Source code in src/pydvl/valuation/samplers/msr.py
def make_strategy(
    self,
    utility: UtilityBase,
    coefficient: Callable[[int, int], float] | None = None,
) -> MSREvaluationStrategy:
    """Returns the strategy for this sampler.

    Args:
        utility: Utility function to evaluate.
        coefficient: Coefficient function for the utility function.
    """
    assert coefficient is not None
    return MSREvaluationStrategy(self, utility, coefficient)

result_updater

result_updater(result: ValuationResult) -> ResultUpdater

Returns a callable that updates a valuation result with an MSR value update.

MSR updates two running means for positive and negative updates separately. The two running means are later combined into a final result.

PARAMETER DESCRIPTION
result

The valuation result to update with each call of the returned callable.

TYPE: ValuationResult

Returns: A callable object that updates the valuation result with very MSRValueUpdate.

Source code in src/pydvl/valuation/samplers/msr.py
def result_updater(self, result: ValuationResult) -> ResultUpdater:
    """Returns a callable that updates a valuation result with an MSR value update.

    MSR updates two running means for positive and negative updates separately. The
    two running means are later combined into a final result.

    Args:
        result: The valuation result to update with each call of the returned
            callable.
    Returns:
        A callable object that updates the valuation result with very
            [MSRValueUpdate][pydvl.valuation.samplers.msr.MSRValueUpdate].
    """
    return MSRResultUpdater(result)

OwenStrategy

OwenStrategy(n_samples_outer: int)

Bases: ABC

Base class for strategies for the Owen sampler to sample probability values.

Source code in src/pydvl/valuation/samplers/owen.py
def __init__(self, n_samples_outer: int):
    self.n_samples_outer = n_samples_outer

UniformOwenStrategy

UniformOwenStrategy(n_samples_outer: int, seed: Seed | None = None)

Bases: OwenStrategy

A strategy for OwenSampler to sample probability values uniformly between 0 and \(q_ ext{stop}\).

PARAMETER DESCRIPTION
n_samples_outer

The number of probability values \(q\) used for the outer loop. Since samples are taken anew for each index, a high number will delay updating new indices and has no effect on the final accuracy if using an infinite index iteration. In general, it only makes sense to change this number if using a finite index iteration.

TYPE: int

seed

The seed for the random number generator.

TYPE: Seed | None DEFAULT: None

Source code in src/pydvl/valuation/samplers/owen.py
def __init__(self, n_samples_outer: int, seed: Seed | None = None):
    super().__init__(n_samples_outer=n_samples_outer)
    self.rng = np.random.default_rng(seed)

GridOwenStrategy

GridOwenStrategy(n_samples_outer: int)

Bases: OwenStrategy

A strategy for OwenSampler to sample probability values on a linear grid.

PARAMETER DESCRIPTION
n_samples_outer

The number of probability values \(q\) used for the outer loop. These will be linearly spaced between 0 and \(q_ ext{stop}\).

TYPE: int

Source code in src/pydvl/valuation/samplers/owen.py
def __init__(self, n_samples_outer: int):
    super().__init__(n_samples_outer=n_samples_outer)

OwenSampler

OwenSampler(
    outer_sampling_strategy: OwenStrategy,
    n_samples_inner: int = 2,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
)

Bases: StochasticSamplerMixin, PowersetSampler

A sampler for semi-values using the Owen method.

For each index \(i\) we sample n_samples_outer probability values \(q_j\) between 0 and 1 and then, for each \(j\) we draw n_samples_inner subsets of the complement of the current index where each element is sampled probability \(q_j\).

The distribution for the outer sampling can be either uniform or deterministic. The default is deterministic on a grid, which is the original method described in Okhrati and Lipani (2021)1. This can be achieved by using the GridOwenStrategy strategy.

Alternatively, the distribution can be uniform between 0 and 1. This can be achieved by using the UniformOwenStrategy strategy.

By combining a UniformOwenStrategy with an infinite IndexIteration strategy, this sampler can be used with a stopping criterion to estimate semi-values. This follows more closely the typical usage pattern in PyDVL than the original sampling method described in Okhrati and Lipani (2021)1.

Example usage
sampler = OwenSampler(
    outer_sampling_strategy=GridOwenStrategy(n_samples_outer=200),
    n_samples_inner=8,
    index_iteration=FiniteSequentialIndexIteration,
)
PARAMETER DESCRIPTION
n_samples_inner

The number of samples drawn for each probability. In the original paper this was fixed to 2 for all experiments.

TYPE: int DEFAULT: 2

batch_size

The batch size of the sampler.

TYPE: int DEFAULT: 1

index_iteration

The index iteration strategy, sequential or random, finite or infinite.

TYPE: Type[IndexIteration] DEFAULT: FiniteSequentialIndexIteration

seed

The seed for the random number generator.

TYPE: Seed | None DEFAULT: None

Source code in src/pydvl/valuation/samplers/owen.py
def __init__(
    self,
    outer_sampling_strategy: OwenStrategy,
    n_samples_inner: int = 2,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
):
    super().__init__(
        batch_size=batch_size, index_iteration=index_iteration, seed=seed
    )
    self.n_samples_inner = n_samples_inner
    self.sampling_probabilities = outer_sampling_strategy
    self.q_stop = 1.0

skip_indices property writable

skip_indices

Set of indices to skip in the outer loop.

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

index_iterator

index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]

Iterates over indices with the method specified at construction.

Source code in src/pydvl/valuation/samplers/powerset.py
def index_iterator(
    self, indices: IndexSetT
) -> Generator[IndexT | None, None, None]:
    """Iterates over indices with the method specified at construction."""
    try:
        self._index_iterator = self._index_iterator_cls(indices, seed=self._rng)  # type: ignore
    except (AttributeError, TypeError):
        self._index_iterator = self._index_iterator_cls(indices)
    for idx in self._index_iterator:
        if idx not in self.skip_indices:
            yield idx

log_weight

log_weight(n: int, subset_len: int) -> float

For each \(q_j, j \in \{1, ..., N\}\) in the outer probabilities, the probability of drawing a subset \(S_k\) of size \(k\) is:

\[ P (| S_{q_j} | = k) = \binom{n}{k} \ q_j^k (1 - q_j)^{n - k}.\]

So, if each \(q_j\) is chosen with equal weight (or more generally with probability \(p_j\)),then by total probability, the overall probability of obtaining a subset of size \(k\) is a mixture of the binomials: $$ P (| S | = k) = \sum_{j = 1}^N p_j \ \binom{n}{k} \ q_j^k (1 - q_j)^{n - k}. $$

In our case \(p_j = 1/N\), so that \(P(|S|=k) = \frac{1}{N} \sum_{j=1}^N P (| S_{q_j} | = k)\). For large enough \(N\) this is

\[ P(|S|=k) \approx \binom{n}{k} \int_0^1 q^k (1 - q)^{n - k} \, dq = \frac{1}{ n+1}, \]

where we computed the integral using the beta function and its expression as products of gamma functions.

Now, given the symmetry wrt. the indices in the sampling procedure, any given set \(S\) of size \(k\) is equally likely to be drawn. So the probability of a set being of size \(k\) must be equally divided by the number of sets of that size, and the weight of a set of size \(k\) is:

\[ P(S) = \frac{1}{n+1} \binom{n}{|S|}^{-1}. \]
PARAMETER DESCRIPTION
n

Size of the index set.

TYPE: int

subset_len

Size of the subset.

TYPE: int

Returns: The logarithm of the weight of a subset of size subset_len.

Source code in src/pydvl/valuation/samplers/owen.py
def log_weight(self, n: int, subset_len: int) -> float:
    r"""For each $q_j, j \in \{1, ..., N\}$ in the outer probabilities, the
    probability of drawing a subset $S_k$ of size $k$ is:

    $$ P (| S_{q_j} | = k) = \binom{n}{k} \  q_j^k  (1 - q_j)^{n - k}.$$

    So, if each $q_j$ is chosen with equal weight (or more generally with
    probability $p_j$),then by total probability, the overall probability of
    obtaining a subset of size $k$ is a mixture of the binomials:
    $$
    P (| S | = k) = \sum_{j = 1}^N p_j \ \binom{n}{k} \ q_j^k  (1 - q_j)^{n - k}.
    $$

    In our case $p_j = 1/N$, so that $P(|S|=k) = \frac{1}{N} \sum_{j=1}^N P (|
    S_{q_j} | = k)$. For large enough $N$ this is

    $$
    P(|S|=k) \approx \binom{n}{k} \int_0^1 q^k (1 - q)^{n - k} \, dq = \frac{1}{
    n+1},
    $$

    where we computed the integral using the beta function and its expression as
    products of gamma functions.

    Now, given the symmetry wrt. the indices in the sampling procedure, any given
    set $S$ of size $k$ is equally likely to be drawn. So the probability of a set
    being of size $k$ must be equally divided by the number of sets of that size,
    and the weight of a set of size $k$ is:

    $$ P(S) = \frac{1}{n+1} \binom{n}{|S|}^{-1}. $$

    Args:
        n: Size of the index set.
        subset_len: Size of the subset.
    Returns:
        The logarithm of the weight of a subset of size `subset_len`.
    """
    m = self._index_iterator_cls.complement_size(n)
    return float(-logcomb(m, subset_len) - np.log(m + 1))

sample_limit

sample_limit(indices: IndexSetT) -> int | None

The number of samples that will be generated by the sampler.

PARAMETER DESCRIPTION
indices

TYPE: IndexSetT

RETURNS DESCRIPTION
int | None

0 if there are no indices, None if there's no limit and the number of

int | None

samples otherwise.

Source code in src/pydvl/valuation/samplers/owen.py
def sample_limit(self, indices: IndexSetT) -> int | None:
    """The number of samples that will be generated by the sampler.

    Args:
        indices:

    Returns:
        0 if there are no indices, `None` if there's no limit and the number of
        samples otherwise.
    """
    if len(indices) == 0:
        return 0
    if not self._index_iterator_cls.is_finite():
        return None

    return (
        cast(int, self._index_iterator_cls.length(len(indices)))
        * self.sampling_probabilities.n_samples_outer
        * self.n_samples_inner
    )

AntitheticOwenSampler

AntitheticOwenSampler(
    outer_sampling_strategy: OwenStrategy,
    n_samples_inner: int = 2,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
)

Bases: OwenSampler

A sampler for antithetic Owen shapley values.

For each sample obtained with the method of OwenSampler, a second sample is generated by taking the complement of the first sample.

For the same number of total samples, the antithetic Owen sampler yields usually more precise estimates of shapley values than the regular Owen sampler.

Source code in src/pydvl/valuation/samplers/owen.py
def __init__(
    self,
    outer_sampling_strategy: OwenStrategy,
    n_samples_inner: int = 2,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
):
    super().__init__(
        outer_sampling_strategy=outer_sampling_strategy,
        n_samples_inner=n_samples_inner,
        batch_size=batch_size,
        index_iteration=index_iteration,
        seed=seed,
    )
    self.q_stop = 0.5

skip_indices property writable

skip_indices

Set of indices to skip in the outer loop.

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

log_weight

log_weight(n: int, subset_len: int) -> float

For each \(q_j, j \in \{1, ..., N\}\) in the outer probabilities, the probability of drawing a subset \(S_k\) of size \(k\) is:

\[ P (| S_{q_j} | = k) = \binom{n}{k} \ q_j^k (1 - q_j)^{n - k}.\]

So, if each \(q_j\) is chosen with equal weight (or more generally with probability \(p_j\)),then by total probability, the overall probability of obtaining a subset of size \(k\) is a mixture of the binomials: $$ P (| S | = k) = \sum_{j = 1}^N p_j \ \binom{n}{k} \ q_j^k (1 - q_j)^{n - k}. $$

In our case \(p_j = 1/N\), so that \(P(|S|=k) = \frac{1}{N} \sum_{j=1}^N P (| S_{q_j} | = k)\). For large enough \(N\) this is

\[ P(|S|=k) \approx \binom{n}{k} \int_0^1 q^k (1 - q)^{n - k} \, dq = \frac{1}{ n+1}, \]

where we computed the integral using the beta function and its expression as products of gamma functions.

Now, given the symmetry wrt. the indices in the sampling procedure, any given set \(S\) of size \(k\) is equally likely to be drawn. So the probability of a set being of size \(k\) must be equally divided by the number of sets of that size, and the weight of a set of size \(k\) is:

\[ P(S) = \frac{1}{n+1} \binom{n}{|S|}^{-1}. \]
PARAMETER DESCRIPTION
n

Size of the index set.

TYPE: int

subset_len

Size of the subset.

TYPE: int

Returns: The logarithm of the weight of a subset of size subset_len.

Source code in src/pydvl/valuation/samplers/owen.py
def log_weight(self, n: int, subset_len: int) -> float:
    r"""For each $q_j, j \in \{1, ..., N\}$ in the outer probabilities, the
    probability of drawing a subset $S_k$ of size $k$ is:

    $$ P (| S_{q_j} | = k) = \binom{n}{k} \  q_j^k  (1 - q_j)^{n - k}.$$

    So, if each $q_j$ is chosen with equal weight (or more generally with
    probability $p_j$),then by total probability, the overall probability of
    obtaining a subset of size $k$ is a mixture of the binomials:
    $$
    P (| S | = k) = \sum_{j = 1}^N p_j \ \binom{n}{k} \ q_j^k  (1 - q_j)^{n - k}.
    $$

    In our case $p_j = 1/N$, so that $P(|S|=k) = \frac{1}{N} \sum_{j=1}^N P (|
    S_{q_j} | = k)$. For large enough $N$ this is

    $$
    P(|S|=k) \approx \binom{n}{k} \int_0^1 q^k (1 - q)^{n - k} \, dq = \frac{1}{
    n+1},
    $$

    where we computed the integral using the beta function and its expression as
    products of gamma functions.

    Now, given the symmetry wrt. the indices in the sampling procedure, any given
    set $S$ of size $k$ is equally likely to be drawn. So the probability of a set
    being of size $k$ must be equally divided by the number of sets of that size,
    and the weight of a set of size $k$ is:

    $$ P(S) = \frac{1}{n+1} \binom{n}{|S|}^{-1}. $$

    Args:
        n: Size of the index set.
        subset_len: Size of the subset.
    Returns:
        The logarithm of the weight of a subset of size `subset_len`.
    """
    m = self._index_iterator_cls.complement_size(n)
    return float(-logcomb(m, subset_len) - np.log(m + 1))

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

index_iterator

index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]

Iterates over indices with the method specified at construction.

Source code in src/pydvl/valuation/samplers/powerset.py
def index_iterator(
    self, indices: IndexSetT
) -> Generator[IndexT | None, None, None]:
    """Iterates over indices with the method specified at construction."""
    try:
        self._index_iterator = self._index_iterator_cls(indices, seed=self._rng)  # type: ignore
    except (AttributeError, TypeError):
        self._index_iterator = self._index_iterator_cls(indices)
    for idx in self._index_iterator:
        if idx not in self.skip_indices:
            yield idx

PermutationSampler

PermutationSampler(
    truncation: TruncationPolicy | None = None,
    seed: Seed | None = None,
    batch_size: int = 1,
)

Bases: StochasticSamplerMixin, PermutationSamplerBase

Samples permutations of indices.

Batching

Even though this sampler supports batching, it is not recommended to use it since the PermutationEvaluationStrategy processes whole permutations in one go, effectively batching the computation of up to n-1 marginal utilities in one process.

PARAMETER DESCRIPTION
truncation

A policy to stop the permutation early.

TYPE: TruncationPolicy | None DEFAULT: None

seed

Seed for the random number generator.

TYPE: Seed | None DEFAULT: None

Source code in src/pydvl/valuation/samplers/permutation.py
def __init__(
    self,
    truncation: TruncationPolicy | None = None,
    seed: Seed | None = None,
    batch_size: int = 1,
):
    super().__init__(seed=seed, truncation=truncation, batch_size=batch_size)

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

generate

generate(indices: IndexSetT) -> SampleGenerator

Generates the permutation samples.

PARAMETER DESCRIPTION
indices

The indices to sample from. If empty, no samples are generated. If skip_indices is set, these indices are removed from the set before generating the permutation.

TYPE: IndexSetT

Source code in src/pydvl/valuation/samplers/permutation.py
def generate(self, indices: IndexSetT) -> SampleGenerator:
    """Generates the permutation samples.

    Args:
        indices: The indices to sample from. If empty, no samples are generated. If
            [skip_indices][pydvl.valuation.samplers.base.IndexSampler.skip_indices]
            is set, these indices are removed from the set before generating the
            permutation.
    """
    if len(indices) == 0:
        return
    while True:
        _indices = np.setdiff1d(indices, self.skip_indices)
        yield Sample(None, self._rng.permutation(_indices))

AntitheticPermutationSampler

AntitheticPermutationSampler(
    truncation: TruncationPolicy | None = None,
    seed: Seed | None = None,
    batch_size: int = 1,
)

Bases: PermutationSampler

Samples permutations like PermutationSampler, but after each permutation, it returns the same permutation in reverse order.

This sampler was suggested in (Mitchell et al. 2022)1

New in version 0.7.1

Source code in src/pydvl/valuation/samplers/permutation.py
def __init__(
    self,
    truncation: TruncationPolicy | None = None,
    seed: Seed | None = None,
    batch_size: int = 1,
):
    super().__init__(seed=seed, truncation=truncation, batch_size=batch_size)

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

DeterministicPermutationSampler

DeterministicPermutationSampler(
    *args,
    truncation: TruncationPolicy | None = None,
    batch_size: int = 1,
    **kwargs,
)

Bases: PermutationSamplerBase

Samples all n! permutations of the indices deterministically, and iterates through them, returning sets as required for the permutation-based definition of semi-values.

Source code in src/pydvl/valuation/samplers/permutation.py
def __init__(
    self,
    *args,
    truncation: TruncationPolicy | None = None,
    batch_size: int = 1,
    **kwargs,
):
    super().__init__(batch_size=batch_size)
    self.truncation = truncation or NoTruncation()

skip_indices property writable

skip_indices: IndexSetT

Indices being skipped in the sampler. The exact behaviour will be sampler-dependent, so that setting this property is disabled by default.

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

PermutationEvaluationStrategy

PermutationEvaluationStrategy(
    sampler: PermutationSamplerBase,
    utility: UtilityBase,
    coefficient: Callable[[int, int], float] | None = None,
)

Bases: EvaluationStrategy[PermutationSamplerBase, ValueUpdate]

Computes marginal values for permutation sampling schemes in log-space.

This strategy iterates over permutations from left to right, computing the marginal utility wrt. the previous one at each step to save computation.

Source code in src/pydvl/valuation/samplers/permutation.py
def __init__(
    self,
    sampler: PermutationSamplerBase,
    utility: UtilityBase,
    coefficient: Callable[[int, int], float] | None = None,
):
    super().__init__(sampler, utility, coefficient)
    self.truncation = copy(sampler.truncation)
    self.truncation.reset(utility)  # Perform initial setup (e.g. total_utility)

IndexIteration

IndexIteration(indices: IndexSetT)

Bases: ABC

Source code in src/pydvl/valuation/samplers/powerset.py
def __init__(self, indices: IndexSetT):
    self._indices = indices

length abstractmethod staticmethod

length(n_indices: int) -> int | None

Returns the length of the iteration over the index set

PARAMETER DESCRIPTION
n_indices

The number of indices in the set.

TYPE: int

RETURNS DESCRIPTION
int | None

The length of the iteration. It can be: - a non-negative integer, if the iteration is finite - None if the iteration never ends.

Source code in src/pydvl/valuation/samplers/powerset.py
@staticmethod
@abstractmethod
def length(n_indices: int) -> int | None:
    """Returns the length of the iteration over the index set

    Args:
        n_indices: The number of indices in the set.

    Returns:
        The length of the iteration. It can be:
            - a non-negative integer, if the iteration is finite
            - `None` if the iteration never ends.
    """
    ...

complement_size abstractmethod staticmethod

complement_size(n: int) -> int

Returns the size of complements of sets of size n, with respect to the indices returned by the iteration.

If the iteration returns single indices, then this is n-1, if it returns no indices, then it is n. If it returned tuples, then n-2, etc.

Source code in src/pydvl/valuation/samplers/powerset.py
@staticmethod
@abstractmethod
def complement_size(n: int) -> int:
    """Returns the size of complements of sets of size n, with respect to the
    indices returned by the iteration.

    If the iteration returns single indices, then this is n-1, if it returns no
    indices, then it is n. If it returned tuples, then n-2, etc.
    """
    ...

SequentialIndexIteration

SequentialIndexIteration(indices: IndexSetT)

Bases: InfiniteIterationMixin, IndexIteration

Samples indices sequentially, indefinitely.

Source code in src/pydvl/valuation/samplers/powerset.py
def __init__(self, indices: IndexSetT):
    self._indices = indices

FiniteSequentialIndexIteration

FiniteSequentialIndexIteration(indices: IndexSetT)

Bases: FiniteIterationMixin, SequentialIndexIteration

Samples indices sequentially, once.

Source code in src/pydvl/valuation/samplers/powerset.py
def __init__(self, indices: IndexSetT):
    self._indices = indices

RandomIndexIteration

RandomIndexIteration(indices: NDArray[IndexT], seed: Seed)

Bases: InfiniteIterationMixin, StochasticSamplerMixin, IndexIteration

Samples indices at random, indefinitely

Source code in src/pydvl/valuation/samplers/powerset.py
def __init__(self, indices: NDArray[IndexT], seed: Seed):
    super().__init__(indices, seed=seed)

FiniteRandomIndexIteration

FiniteRandomIndexIteration(indices: NDArray[IndexT], seed: Seed)

Bases: FiniteIterationMixin, RandomIndexIteration

Samples indices at random, once

Source code in src/pydvl/valuation/samplers/powerset.py
def __init__(self, indices: NDArray[IndexT], seed: Seed):
    super().__init__(indices, seed=seed)

NoIndexIteration

NoIndexIteration(indices: IndexSetT)

Bases: InfiniteIterationMixin, IndexIteration

An infinite iteration over no indices.

Source code in src/pydvl/valuation/samplers/powerset.py
def __init__(self, indices: IndexSetT):
    self._indices = indices

FiniteNoIndexIteration

FiniteNoIndexIteration(indices: IndexSetT)

Bases: FiniteIterationMixin, NoIndexIteration

A finite iteration over no indices. The iterator will yield None once and then stop.

Source code in src/pydvl/valuation/samplers/powerset.py
def __init__(self, indices: IndexSetT):
    self._indices = indices

length staticmethod

length(n_indices: int) -> int | None

Returns 1, as the iteration yields exactly one item (None)

Source code in src/pydvl/valuation/samplers/powerset.py
@staticmethod
def length(n_indices: int) -> int | None:
    """Returns 1, as the iteration yields exactly one item (None)"""
    return 1

PowersetSampler

PowersetSampler(
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = SequentialIndexIteration,
)

Bases: IndexSampler, ABC

An abstract class for samplers which iterate over the powerset of the complement of an index in the training set.

This is done in two nested loops, where the outer loop iterates over the set of indices, and the inner loop iterates over subsets of the complement of the current index. The outer iteration can be either sequential or at random.

    processed together by
    [UtilityEvaluator][pydvl.valuation.utility.evaluator.UtilityEvaluator].
index_iteration: the strategy to use for iterating over indices to update
Source code in src/pydvl/valuation/samplers/powerset.py
def __init__(
    self,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = SequentialIndexIteration,
):
    """
    Args:
        batch_size: The number of samples to generate per batch. Batches are
            processed together by
            [UtilityEvaluator][pydvl.valuation.utility.evaluator.UtilityEvaluator].
        index_iteration: the strategy to use for iterating over indices to update
    """
    super().__init__(batch_size)
    self._index_iterator_cls = index_iteration
    self._index_iterator: IndexIteration | None = None

skip_indices property writable

skip_indices

Set of indices to skip in the outer loop.

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

sample_limit abstractmethod

sample_limit(indices: IndexSetT) -> int | None

Number of samples that can be generated from the indices.

PARAMETER DESCRIPTION
indices

The indices used in the sampler.

TYPE: IndexSetT

RETURNS DESCRIPTION
int | None

The maximum number of samples that will be generated, or None if the number of samples is infinite. This will depend, among other things, on the type of IndexIteration.

Source code in src/pydvl/valuation/samplers/base.py
@abstractmethod
def sample_limit(self, indices: IndexSetT) -> int | None:
    """Number of samples that can be generated from the indices.

    Args:
        indices: The indices used in the sampler.

    Returns:
        The maximum number of samples that will be generated, or  `None` if the
            number of samples is infinite. This will depend, among other things,
            on the type of [IndexIteration][pydvl.valuation.samplers.IndexIteration].
    """
    ...

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

index_iterator

index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]

Iterates over indices with the method specified at construction.

Source code in src/pydvl/valuation/samplers/powerset.py
def index_iterator(
    self, indices: IndexSetT
) -> Generator[IndexT | None, None, None]:
    """Iterates over indices with the method specified at construction."""
    try:
        self._index_iterator = self._index_iterator_cls(indices, seed=self._rng)  # type: ignore
    except (AttributeError, TypeError):
        self._index_iterator = self._index_iterator_cls(indices)
    for idx in self._index_iterator:
        if idx not in self.skip_indices:
            yield idx

generate abstractmethod

generate(indices: IndexSetT) -> SampleGenerator

Generates samples over the powerset of indices

Each PowersetSampler defines its own way to generate the subsets by implementing this method. The outer loop is handled by the index_iterator. Batching is handled by the generate_batches method.

PARAMETER DESCRIPTION
indices

The set from which to generate samples.

TYPE: IndexSetT

Source code in src/pydvl/valuation/samplers/powerset.py
@abstractmethod
def generate(self, indices: IndexSetT) -> SampleGenerator:
    """Generates samples over the powerset of `indices`

    Each `PowersetSampler` defines its own way to generate the subsets by
    implementing this method. The outer loop is handled by the `index_iterator`.
    Batching is handled by the `generate_batches` method.

    Args:
        indices: The set from which to generate samples.
    """
    ...

log_weight

log_weight(n: int, subset_len: int) -> float

Correction coming from Monte Carlo integration so that the mean of the marginals converges to the value: the uniform distribution over the powerset of a set with n-1 elements has mass 1/2^{n-1} over each subset.

Source code in src/pydvl/valuation/samplers/powerset.py
def log_weight(self, n: int, subset_len: int) -> float:
    """Correction coming from Monte Carlo integration so that the mean of
    the marginals converges to the value: the uniform distribution over the
    powerset of a set with n-1 elements has mass 1/2^{n-1} over each subset."""
    m = self._index_iterator_cls.complement_size(n)
    return float(-m * np.log(2))

LOOSampler

LOOSampler(
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
)

Bases: PowersetSampler

Leave-One-Out sampler.

In this special case of a powerset sampler, for every index \(i\) in the set \(S\), the sample \((i, S_{-i})\) is returned.

PARAMETER DESCRIPTION
batch_size

The number of samples to generate per batch. Batches are processed together by each subprocess when working in parallel.

TYPE: int DEFAULT: 1

index_iteration

the strategy to use for iterating over indices to update. By default, a finite sequential index iteration is used, which is what LOOValuation expects.

TYPE: Type[IndexIteration] DEFAULT: FiniteSequentialIndexIteration

seed

The seed for the random number generator used in case the index iteration is random.

TYPE: Seed | None DEFAULT: None

New in version 0.10.0

Source code in src/pydvl/valuation/samplers/powerset.py
def __init__(
    self,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
):
    super().__init__(batch_size, index_iteration)
    if not self._index_iterator_cls.is_proper():
        raise ValueError("LOO samplers require a proper index iteration strategy")
    self._rng = np.random.default_rng(seed)

skip_indices property writable

skip_indices

Set of indices to skip in the outer loop.

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

index_iterator

index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]

Iterates over indices with the method specified at construction.

Source code in src/pydvl/valuation/samplers/powerset.py
def index_iterator(
    self, indices: IndexSetT
) -> Generator[IndexT | None, None, None]:
    """Iterates over indices with the method specified at construction."""
    try:
        self._index_iterator = self._index_iterator_cls(indices, seed=self._rng)  # type: ignore
    except (AttributeError, TypeError):
        self._index_iterator = self._index_iterator_cls(indices)
    for idx in self._index_iterator:
        if idx not in self.skip_indices:
            yield idx

log_weight

log_weight(n: int, subset_len: int) -> float

This sampler returns only sets of size n-1. There are n such sets, so the probability of drawing one is 1/n, or 0 if subset_len != n-1.

Source code in src/pydvl/valuation/samplers/powerset.py
def log_weight(self, n: int, subset_len: int) -> float:
    """This sampler returns only sets of size n-1. There are n such sets, so the
    probability of drawing one is 1/n, or 0 if subset_len != n-1."""
    return float(-np.log(n if subset_len == n - 1 else 0))

DeterministicUniformSampler

DeterministicUniformSampler(
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
)

Bases: PowersetSampler

An iterator to perform uniform deterministic sampling of subsets.

For every index \(i\), each subset of the complement indices - {i} is returned.

PARAMETER DESCRIPTION
batch_size

The number of samples to generate per batch. Batches are processed together by each subprocess when working in parallel.

TYPE: int DEFAULT: 1

index_iteration

the strategy to use for iterating over indices to update. This iteration can be either finite or infinite.

TYPE: Type[IndexIteration] DEFAULT: FiniteSequentialIndexIteration

Example

The code:

from pydvl.valuation.samplers import DeterministicUniformSampler
import numpy as np
sampler = DeterministicUniformSampler()
for idx, s in sampler.generate_batches(np.arange(2)):
    print(f"{idx} - {s}", end=", ")

Should produce the output:

1 - [], 1 - [2], 2 - [], 2 - [1],
Source code in src/pydvl/valuation/samplers/powerset.py
def __init__(
    self,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
):
    super().__init__(batch_size=batch_size, index_iteration=index_iteration)

skip_indices property writable

skip_indices

Set of indices to skip in the outer loop.

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

log_weight

log_weight(n: int, subset_len: int) -> float

Correction coming from Monte Carlo integration so that the mean of the marginals converges to the value: the uniform distribution over the powerset of a set with n-1 elements has mass 1/2^{n-1} over each subset.

Source code in src/pydvl/valuation/samplers/powerset.py
def log_weight(self, n: int, subset_len: int) -> float:
    """Correction coming from Monte Carlo integration so that the mean of
    the marginals converges to the value: the uniform distribution over the
    powerset of a set with n-1 elements has mass 1/2^{n-1} over each subset."""
    m = self._index_iterator_cls.complement_size(n)
    return float(-m * np.log(2))

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

index_iterator

index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]

Iterates over indices with the method specified at construction.

Source code in src/pydvl/valuation/samplers/powerset.py
def index_iterator(
    self, indices: IndexSetT
) -> Generator[IndexT | None, None, None]:
    """Iterates over indices with the method specified at construction."""
    try:
        self._index_iterator = self._index_iterator_cls(indices, seed=self._rng)  # type: ignore
    except (AttributeError, TypeError):
        self._index_iterator = self._index_iterator_cls(indices)
    for idx in self._index_iterator:
        if idx not in self.skip_indices:
            yield idx

UniformSampler

UniformSampler(
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = SequentialIndexIteration,
    seed: Seed | None = None,
)

Bases: StochasticSamplerMixin, PowersetSampler

Draws random samples uniformly from the powerset of the index set.

Iterating over every index \(i\), either in sequence or at random depending on the value of index_iteration, one subset of the complement indices - {i} is sampled with equal probability \(2^{n-1}\).

PARAMETER DESCRIPTION
batch_size

The number of samples to generate per batch. Batches are processed together by each subprocess when working in parallel.

TYPE: int DEFAULT: 1

index_iteration

the strategy to use for iterating over indices to update. This iteration can be either finite or infinite.

TYPE: Type[IndexIteration] DEFAULT: SequentialIndexIteration

seed

The seed for the random number generator.

TYPE: Seed | None DEFAULT: None

Example

The code

for idx, s in UniformSampler(np.arange(3)):
   print(f"{idx} - {s}", end=", ")
Produces the output:
0 - [1 4], 1 - [2 3], 2 - [0 1 3], 3 - [], 4 - [2], 0 - [1 3 4], 1 - [0 2]
(...)

Source code in src/pydvl/valuation/samplers/powerset.py
def __init__(
    self,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = SequentialIndexIteration,
    seed: Seed | None = None,
):
    super().__init__(
        batch_size=batch_size, index_iteration=index_iteration, seed=seed
    )

skip_indices property writable

skip_indices

Set of indices to skip in the outer loop.

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

log_weight

log_weight(n: int, subset_len: int) -> float

Correction coming from Monte Carlo integration so that the mean of the marginals converges to the value: the uniform distribution over the powerset of a set with n-1 elements has mass 1/2^{n-1} over each subset.

Source code in src/pydvl/valuation/samplers/powerset.py
def log_weight(self, n: int, subset_len: int) -> float:
    """Correction coming from Monte Carlo integration so that the mean of
    the marginals converges to the value: the uniform distribution over the
    powerset of a set with n-1 elements has mass 1/2^{n-1} over each subset."""
    m = self._index_iterator_cls.complement_size(n)
    return float(-m * np.log(2))

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

index_iterator

index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]

Iterates over indices with the method specified at construction.

Source code in src/pydvl/valuation/samplers/powerset.py
def index_iterator(
    self, indices: IndexSetT
) -> Generator[IndexT | None, None, None]:
    """Iterates over indices with the method specified at construction."""
    try:
        self._index_iterator = self._index_iterator_cls(indices, seed=self._rng)  # type: ignore
    except (AttributeError, TypeError):
        self._index_iterator = self._index_iterator_cls(indices)
    for idx in self._index_iterator:
        if idx not in self.skip_indices:
            yield idx

AntitheticSampler

AntitheticSampler(*args, seed: Seed | None = None, **kwargs)

Bases: StochasticSamplerMixin, PowersetSampler

A sampler that draws samples uniformly and their complements.

Works as UniformSampler, but for every tuple \((i,S)\), it subsequently returns \((i,S^c)\), where \(S^c\) is the complement of the set \(S\) in the set of indices, excluding \(i\).

Source code in src/pydvl/valuation/samplers/utils.py
def __init__(self, *args, seed: Seed | None = None, **kwargs):
    super().__init__(*args, **kwargs)
    self._rng = np.random.default_rng(seed)

skip_indices property writable

skip_indices

Set of indices to skip in the outer loop.

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

log_weight

log_weight(n: int, subset_len: int) -> float

Correction coming from Monte Carlo integration so that the mean of the marginals converges to the value: the uniform distribution over the powerset of a set with n-1 elements has mass 1/2^{n-1} over each subset.

Source code in src/pydvl/valuation/samplers/powerset.py
def log_weight(self, n: int, subset_len: int) -> float:
    """Correction coming from Monte Carlo integration so that the mean of
    the marginals converges to the value: the uniform distribution over the
    powerset of a set with n-1 elements has mass 1/2^{n-1} over each subset."""
    m = self._index_iterator_cls.complement_size(n)
    return float(-m * np.log(2))

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

index_iterator

index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]

Iterates over indices with the method specified at construction.

Source code in src/pydvl/valuation/samplers/powerset.py
def index_iterator(
    self, indices: IndexSetT
) -> Generator[IndexT | None, None, None]:
    """Iterates over indices with the method specified at construction."""
    try:
        self._index_iterator = self._index_iterator_cls(indices, seed=self._rng)  # type: ignore
    except (AttributeError, TypeError):
        self._index_iterator = self._index_iterator_cls(indices)
    for idx in self._index_iterator:
        if idx not in self.skip_indices:
            yield idx

SampleSizeStrategy

SampleSizeStrategy(n_samples: int)

Bases: ABC

An object to compute the number of samples to take for a given set size. Based on Wu et al. (2023)1, Theorem 4.2.

To be used with StratifiedSampler.

Sets the number of sets at size \(k\) to be

\[m(k) = m \frac{f(k)}{\sum_{j=0}^{n} f(j)},\]

for some choice of \(f.\) Implementations of this base class must override the method fun(). It is provided both the size \(k\) and the total number of indices \(n\) as arguments.

PARAMETER DESCRIPTION
n_samples

Number of samples for the stratified sampler to generate, per index. If the sampler uses NoIndexIteration, then this will coincide with the total number of samples.

TYPE: int

Source code in src/pydvl/valuation/samplers/stratified.py
def __init__(self, n_samples: int):
    """Construct a heuristic for the given number of samples.

    Args:
        n_samples: Number of samples for the stratified sampler to generate,
            **per index**. If the sampler uses
            [NoIndexIteration][pydvl.valuation.samplers.NoIndexIteration], then this
            will coincide with the total number of samples.
    """
    self.n_samples = n_samples

fun abstractmethod

fun(n_indices: int, subset_len: int) -> float

The function \(f\) to use in the heuristic. Args: n_indices: Size of the index set. subset_len: Size of the subset.

Source code in src/pydvl/valuation/samplers/stratified.py
@abstractmethod
def fun(self, n_indices: int, subset_len: int) -> float:
    """The function $f$ to use in the heuristic.
    Args:
        n_indices: Size of the index set.
        subset_len: Size of the subset.
    """
    ...

sample_sizes cached

sample_sizes(
    n_indices: int, quantize: bool = True
) -> NDArray[int_] | NDArray[float_]

Precomputes the number of samples to take for each set size, from 0 up to n_indices inclusive.

This method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.

Note

A naive implementation with e.g.

m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
        for k in range(n)]
would not respect the total number of samples, and would not distribute remainders correctly.

PARAMETER DESCRIPTION
n_indices

number of indices in the index set from which to sample. This is typically len(dataset) - 1 with the usual index iterations.

TYPE: int

quantize

Whether to perform the remainder distribution. If False, the raw floating point values are returned. Useful e.g. for RandomSizeIteration where one needs frequencies. In this case n_samples can be 1.

TYPE: bool DEFAULT: True

Returns: The exact (integer) number of samples to take for each set size, if quantize is True. Otherwise, the fractional number of samples.

Source code in src/pydvl/valuation/samplers/stratified.py
@lru_cache
def sample_sizes(
    self, n_indices: int, quantize: bool = True
) -> NDArray[np.int_] | NDArray[np.float_]:
    """Precomputes the number of samples to take for each set size, from 0 up to
    `n_indices` inclusive.

    This method corrects rounding errors taking into account the fractional parts
    so that the total number of samples is respected, while allocating remainders
    in a way that follows the relative sizes of the fractional parts.

    ??? Note
        A naive implementation with e.g.
        ```python
        m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
                for k in range(n)]
        ```
        would not respect the total number of samples, and would not distribute
        remainders correctly.

    Args:
        n_indices: number of indices in the index set from which to sample. This is
            typically `len(dataset) - 1` with the usual index iterations.
        quantize: Whether to perform the remainder distribution. If `False`, the raw
            floating point values are returned. Useful e.g. for
            [RandomSizeIteration][pydvl.valuation.samplers.stratified.RandomSizeIteration]
            where one needs frequencies. In this case `n_samples` can
            be 1.
    Returns:
        The exact (integer) number of samples to take for each set size, if
        `quantize` is `True`. Otherwise, the fractional number of samples.
    """

    # m_k = m * f(k) / sum_j f(j)
    values = np.empty(n_indices + 1, dtype=float)
    s = 0.0

    for k in range(n_indices + 1):
        val = self.fun(n_indices, k)
        values[k] = val
        s += val

    values *= self.n_samples / s
    if not quantize:
        return values

    # Round down and distribute remainder by adjusting the largest fractional parts
    int_values: NDArray[np.int_] = np.floor(values).astype(np.int_)
    remainder = self.n_samples - np.sum(int_values)
    fractional_parts = values - int_values
    fractional_parts_indices = np.argsort(-fractional_parts)[:remainder]
    int_values[fractional_parts_indices] += 1
    return int_values

ConstantSampleSize

ConstantSampleSize(
    n_samples: int, lower_bound: int = 0, upper_bound: int | None = None
)

Bases: SampleSizeStrategy

Use a constant number of samples for each set size between two (optional) bounds. The total number of samples (per index) is respected.

PARAMETER DESCRIPTION
n_samples

Total number of samples to generate per index.

TYPE: int

lower_bound

Lower bound for the set size. If the set size is smaller than this, the probability of sampling is 0.

TYPE: int DEFAULT: 0

upper_bound

Upper bound for the set size. If the set size is larger than this, the probability of sampling is 0. If None, the upper bound is set to the number of indices.

TYPE: int | None DEFAULT: None

Source code in src/pydvl/valuation/samplers/stratified.py
def __init__(
    self,
    n_samples: int,
    lower_bound: int = 0,
    upper_bound: int | None = None,
):
    super().__init__(n_samples)
    self.lower_bound = lower_bound
    self.upper_bound = upper_bound

sample_sizes cached

sample_sizes(
    n_indices: int, quantize: bool = True
) -> NDArray[int_] | NDArray[float_]

Precomputes the number of samples to take for each set size, from 0 up to n_indices inclusive.

This method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.

Note

A naive implementation with e.g.

m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
        for k in range(n)]
would not respect the total number of samples, and would not distribute remainders correctly.

PARAMETER DESCRIPTION
n_indices

number of indices in the index set from which to sample. This is typically len(dataset) - 1 with the usual index iterations.

TYPE: int

quantize

Whether to perform the remainder distribution. If False, the raw floating point values are returned. Useful e.g. for RandomSizeIteration where one needs frequencies. In this case n_samples can be 1.

TYPE: bool DEFAULT: True

Returns: The exact (integer) number of samples to take for each set size, if quantize is True. Otherwise, the fractional number of samples.

Source code in src/pydvl/valuation/samplers/stratified.py
@lru_cache
def sample_sizes(
    self, n_indices: int, quantize: bool = True
) -> NDArray[np.int_] | NDArray[np.float_]:
    """Precomputes the number of samples to take for each set size, from 0 up to
    `n_indices` inclusive.

    This method corrects rounding errors taking into account the fractional parts
    so that the total number of samples is respected, while allocating remainders
    in a way that follows the relative sizes of the fractional parts.

    ??? Note
        A naive implementation with e.g.
        ```python
        m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
                for k in range(n)]
        ```
        would not respect the total number of samples, and would not distribute
        remainders correctly.

    Args:
        n_indices: number of indices in the index set from which to sample. This is
            typically `len(dataset) - 1` with the usual index iterations.
        quantize: Whether to perform the remainder distribution. If `False`, the raw
            floating point values are returned. Useful e.g. for
            [RandomSizeIteration][pydvl.valuation.samplers.stratified.RandomSizeIteration]
            where one needs frequencies. In this case `n_samples` can
            be 1.
    Returns:
        The exact (integer) number of samples to take for each set size, if
        `quantize` is `True`. Otherwise, the fractional number of samples.
    """

    # m_k = m * f(k) / sum_j f(j)
    values = np.empty(n_indices + 1, dtype=float)
    s = 0.0

    for k in range(n_indices + 1):
        val = self.fun(n_indices, k)
        values[k] = val
        s += val

    values *= self.n_samples / s
    if not quantize:
        return values

    # Round down and distribute remainder by adjusting the largest fractional parts
    int_values: NDArray[np.int_] = np.floor(values).astype(np.int_)
    remainder = self.n_samples - np.sum(int_values)
    fractional_parts = values - int_values
    fractional_parts_indices = np.argsort(-fractional_parts)[:remainder]
    int_values[fractional_parts_indices] += 1
    return int_values

GroupTestingSampleSize

GroupTestingSampleSize(n_samples: int = 1)

Bases: SampleSizeStrategy

Heuristic choice of samples per set size used for Group Testing.

GroupTestingShapleyValuation uses this strategy for the stratified sampling of samples with which to construct the linear problem it requires.

This heuristic sets the number of sets at size \(k\) to be

\[m_k = m \frac{f(k)}{\sum_{j=0}^{n-1} f(j)},\]

for a total number of samples \(m\) and:

\[ f(k) = \frac{1}{k} + \frac{1}{n-k}, \text{for} k \in \{1, n-1\}. \]

For GT Shapley, \(m=1\) and \(m_k\) is interpreted as a probability of sampling size \(k.\)

Source code in src/pydvl/valuation/samplers/stratified.py
def __init__(self, n_samples: int = 1):
    super().__init__(n_samples)

sample_sizes cached

sample_sizes(
    n_indices: int, quantize: bool = True
) -> NDArray[int_] | NDArray[float_]

Precomputes the number of samples to take for each set size, from 0 up to n_indices inclusive.

This method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.

Note

A naive implementation with e.g.

m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
        for k in range(n)]
would not respect the total number of samples, and would not distribute remainders correctly.

PARAMETER DESCRIPTION
n_indices

number of indices in the index set from which to sample. This is typically len(dataset) - 1 with the usual index iterations.

TYPE: int

quantize

Whether to perform the remainder distribution. If False, the raw floating point values are returned. Useful e.g. for RandomSizeIteration where one needs frequencies. In this case n_samples can be 1.

TYPE: bool DEFAULT: True

Returns: The exact (integer) number of samples to take for each set size, if quantize is True. Otherwise, the fractional number of samples.

Source code in src/pydvl/valuation/samplers/stratified.py
@lru_cache
def sample_sizes(
    self, n_indices: int, quantize: bool = True
) -> NDArray[np.int_] | NDArray[np.float_]:
    """Precomputes the number of samples to take for each set size, from 0 up to
    `n_indices` inclusive.

    This method corrects rounding errors taking into account the fractional parts
    so that the total number of samples is respected, while allocating remainders
    in a way that follows the relative sizes of the fractional parts.

    ??? Note
        A naive implementation with e.g.
        ```python
        m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
                for k in range(n)]
        ```
        would not respect the total number of samples, and would not distribute
        remainders correctly.

    Args:
        n_indices: number of indices in the index set from which to sample. This is
            typically `len(dataset) - 1` with the usual index iterations.
        quantize: Whether to perform the remainder distribution. If `False`, the raw
            floating point values are returned. Useful e.g. for
            [RandomSizeIteration][pydvl.valuation.samplers.stratified.RandomSizeIteration]
            where one needs frequencies. In this case `n_samples` can
            be 1.
    Returns:
        The exact (integer) number of samples to take for each set size, if
        `quantize` is `True`. Otherwise, the fractional number of samples.
    """

    # m_k = m * f(k) / sum_j f(j)
    values = np.empty(n_indices + 1, dtype=float)
    s = 0.0

    for k in range(n_indices + 1):
        val = self.fun(n_indices, k)
        values[k] = val
        s += val

    values *= self.n_samples / s
    if not quantize:
        return values

    # Round down and distribute remainder by adjusting the largest fractional parts
    int_values: NDArray[np.int_] = np.floor(values).astype(np.int_)
    remainder = self.n_samples - np.sum(int_values)
    fractional_parts = values - int_values
    fractional_parts_indices = np.argsort(-fractional_parts)[:remainder]
    int_values[fractional_parts_indices] += 1
    return int_values

HarmonicSampleSize

HarmonicSampleSize(n_samples: int)

Bases: SampleSizeStrategy

Heuristic choice of samples per set size for VRDS.

Sets the number of sets at size \(k\) to be

\[m_k = m \frac{f(k)}{\sum_{j=0}^{n-1} f(j)},\]

for a total number of samples \(m\) and:

\[f(k) = \frac{1}{1+k}.\]
PARAMETER DESCRIPTION
n_samples

Number of samples for the stratified sampler to generate, per index. If the sampler uses NoIndexIteration, then this will coincide with the total number of samples.

TYPE: int

Source code in src/pydvl/valuation/samplers/stratified.py
def __init__(self, n_samples: int):
    """Construct a heuristic for the given number of samples.

    Args:
        n_samples: Number of samples for the stratified sampler to generate,
            **per index**. If the sampler uses
            [NoIndexIteration][pydvl.valuation.samplers.NoIndexIteration], then this
            will coincide with the total number of samples.
    """
    self.n_samples = n_samples

sample_sizes cached

sample_sizes(
    n_indices: int, quantize: bool = True
) -> NDArray[int_] | NDArray[float_]

Precomputes the number of samples to take for each set size, from 0 up to n_indices inclusive.

This method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.

Note

A naive implementation with e.g.

m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
        for k in range(n)]
would not respect the total number of samples, and would not distribute remainders correctly.

PARAMETER DESCRIPTION
n_indices

number of indices in the index set from which to sample. This is typically len(dataset) - 1 with the usual index iterations.

TYPE: int

quantize

Whether to perform the remainder distribution. If False, the raw floating point values are returned. Useful e.g. for RandomSizeIteration where one needs frequencies. In this case n_samples can be 1.

TYPE: bool DEFAULT: True

Returns: The exact (integer) number of samples to take for each set size, if quantize is True. Otherwise, the fractional number of samples.

Source code in src/pydvl/valuation/samplers/stratified.py
@lru_cache
def sample_sizes(
    self, n_indices: int, quantize: bool = True
) -> NDArray[np.int_] | NDArray[np.float_]:
    """Precomputes the number of samples to take for each set size, from 0 up to
    `n_indices` inclusive.

    This method corrects rounding errors taking into account the fractional parts
    so that the total number of samples is respected, while allocating remainders
    in a way that follows the relative sizes of the fractional parts.

    ??? Note
        A naive implementation with e.g.
        ```python
        m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
                for k in range(n)]
        ```
        would not respect the total number of samples, and would not distribute
        remainders correctly.

    Args:
        n_indices: number of indices in the index set from which to sample. This is
            typically `len(dataset) - 1` with the usual index iterations.
        quantize: Whether to perform the remainder distribution. If `False`, the raw
            floating point values are returned. Useful e.g. for
            [RandomSizeIteration][pydvl.valuation.samplers.stratified.RandomSizeIteration]
            where one needs frequencies. In this case `n_samples` can
            be 1.
    Returns:
        The exact (integer) number of samples to take for each set size, if
        `quantize` is `True`. Otherwise, the fractional number of samples.
    """

    # m_k = m * f(k) / sum_j f(j)
    values = np.empty(n_indices + 1, dtype=float)
    s = 0.0

    for k in range(n_indices + 1):
        val = self.fun(n_indices, k)
        values[k] = val
        s += val

    values *= self.n_samples / s
    if not quantize:
        return values

    # Round down and distribute remainder by adjusting the largest fractional parts
    int_values: NDArray[np.int_] = np.floor(values).astype(np.int_)
    remainder = self.n_samples - np.sum(int_values)
    fractional_parts = values - int_values
    fractional_parts_indices = np.argsort(-fractional_parts)[:remainder]
    int_values[fractional_parts_indices] += 1
    return int_values

PowerLawSampleSize

PowerLawSampleSize(n_samples: int, exponent: float)

Bases: SampleSizeStrategy

Heuristic choice of samples per set size for VRDS.

Sets the number of sets at size \(k\) to be

\[m_k = m \frac{f(k)}{\sum_{j=0}^{n-1} f(j)},\]

for a total number of samples \(m\) and:

\[f(k) = (1+k)^a, \]

and some exponent \(a.\) With \(a=1\) one recovers the HarmonicSampleSize heuristic.

PARAMETER DESCRIPTION
n_samples

Total number of samples to generate per index.

TYPE: int

exponent

The exponent to use. Recommended values are between -1 and -0.5.

TYPE: float

Source code in src/pydvl/valuation/samplers/stratified.py
def __init__(self, n_samples: int, exponent: float):
    super().__init__(n_samples)
    self.exponent = exponent

sample_sizes cached

sample_sizes(
    n_indices: int, quantize: bool = True
) -> NDArray[int_] | NDArray[float_]

Precomputes the number of samples to take for each set size, from 0 up to n_indices inclusive.

This method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.

Note

A naive implementation with e.g.

m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
        for k in range(n)]
would not respect the total number of samples, and would not distribute remainders correctly.

PARAMETER DESCRIPTION
n_indices

number of indices in the index set from which to sample. This is typically len(dataset) - 1 with the usual index iterations.

TYPE: int

quantize

Whether to perform the remainder distribution. If False, the raw floating point values are returned. Useful e.g. for RandomSizeIteration where one needs frequencies. In this case n_samples can be 1.

TYPE: bool DEFAULT: True

Returns: The exact (integer) number of samples to take for each set size, if quantize is True. Otherwise, the fractional number of samples.

Source code in src/pydvl/valuation/samplers/stratified.py
@lru_cache
def sample_sizes(
    self, n_indices: int, quantize: bool = True
) -> NDArray[np.int_] | NDArray[np.float_]:
    """Precomputes the number of samples to take for each set size, from 0 up to
    `n_indices` inclusive.

    This method corrects rounding errors taking into account the fractional parts
    so that the total number of samples is respected, while allocating remainders
    in a way that follows the relative sizes of the fractional parts.

    ??? Note
        A naive implementation with e.g.
        ```python
        m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
                for k in range(n)]
        ```
        would not respect the total number of samples, and would not distribute
        remainders correctly.

    Args:
        n_indices: number of indices in the index set from which to sample. This is
            typically `len(dataset) - 1` with the usual index iterations.
        quantize: Whether to perform the remainder distribution. If `False`, the raw
            floating point values are returned. Useful e.g. for
            [RandomSizeIteration][pydvl.valuation.samplers.stratified.RandomSizeIteration]
            where one needs frequencies. In this case `n_samples` can
            be 1.
    Returns:
        The exact (integer) number of samples to take for each set size, if
        `quantize` is `True`. Otherwise, the fractional number of samples.
    """

    # m_k = m * f(k) / sum_j f(j)
    values = np.empty(n_indices + 1, dtype=float)
    s = 0.0

    for k in range(n_indices + 1):
        val = self.fun(n_indices, k)
        values[k] = val
        s += val

    values *= self.n_samples / s
    if not quantize:
        return values

    # Round down and distribute remainder by adjusting the largest fractional parts
    int_values: NDArray[np.int_] = np.floor(values).astype(np.int_)
    remainder = self.n_samples - np.sum(int_values)
    fractional_parts = values - int_values
    fractional_parts_indices = np.argsort(-fractional_parts)[:remainder]
    int_values[fractional_parts_indices] += 1
    return int_values

SampleSizeIteration

SampleSizeIteration(strategy: SampleSizeStrategy, n_indices: int)

Bases: ABC

Given a strategy and the number of indices, yield tuples (k, count) that the sampler loop will use. Args: strategy: The strategy to use for computing the number of samples to take. n_indices: The number of indices in the index set from which samples are taken.

Source code in src/pydvl/valuation/samplers/stratified.py
def __init__(self, strategy: SampleSizeStrategy, n_indices: int):
    self.strategy = strategy
    self.n_indices = n_indices

DeterministicSizeIteration

DeterministicSizeIteration(strategy: SampleSizeStrategy, n_indices: int)

Bases: SampleSizeIteration

Generates exactly \(m_k\) samples for each set size \(k\) before moving to the next.

Source code in src/pydvl/valuation/samplers/stratified.py
def __init__(self, strategy: SampleSizeStrategy, n_indices: int):
    self.strategy = strategy
    self.n_indices = n_indices

RandomSizeIteration

RandomSizeIteration(
    strategy: SampleSizeStrategy, n_indices: int, seed: Seed | None = None
)

Bases: SampleSizeIteration

Draws a set size \(k\) following the distribution of sizes given by the strategy.

Source code in src/pydvl/valuation/samplers/stratified.py
def __init__(
    self, strategy: SampleSizeStrategy, n_indices: int, seed: Seed | None = None
):
    super().__init__(strategy, n_indices)
    self._rng = np.random.default_rng(seed)

RoundRobinIteration

RoundRobinIteration(strategy: SampleSizeStrategy, n_indices: int)

Bases: SampleSizeIteration

Generates one sample for each set size \(k\) before moving to the next.

This continues yielding until every size \(k\) has been emitted exactly \(m_k\) times. For example, if strategy.sample_sizes() == [2, 3, 1] then we want the sequence: (0,1), (1,1), (2,1), (0,1), (1,1), (1,1)

Source code in src/pydvl/valuation/samplers/stratified.py
def __init__(self, strategy: SampleSizeStrategy, n_indices: int):
    self.strategy = strategy
    self.n_indices = n_indices

StratifiedSampler

StratifiedSampler(
    sample_sizes: SampleSizeStrategy,
    sample_sizes_iteration: Type[
        SampleSizeIteration
    ] = DeterministicSizeIteration,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
)

Bases: StochasticSamplerMixin, PowersetSampler

A sampler stratified by coalition size with variable number of samples per set size.

Variance Reduced Stratified Sampler (VRDS)

Stratified sampling was introduced at least as early as Maleki et al. (2014)3. Wu et al. 20232, introduced heuristics adequate for ML tasks.

Choosing the number of samples per set size

The idea of VRDS is to allow per-set-size configuration of the total number of samples in order to reduce the variance coming from the marginal utility evaluations.

It is known (Wu et al. (2023), Theorem 4.2) that a minimum variance estimator of Shapley values samples a number \(m_k\) of sets of size \(k\) based on the variance of the marginal utility at that set size. However, this quantity is unknown in practice, so the authors propose a simple heuristic. This function (sample_sizes in the arguments) is deterministic, and in particular does not depend on run-time variance estimates, as an adaptive method might do. Section 4 of Wu et al. (2023) shows a good default choice is based on the harmonic function of the set size \(k\) (see HarmonicSampleSize).

PARAMETER DESCRIPTION
sample_sizes

An object which returns the number of samples to take for a given set size. If index_iteration below is finite, then the sampler will generate exactly as many samples of each size as returned by this object. If the iteration is infinite, then the sample_sizes will be used as probabilities of sampling.

TYPE: SampleSizeStrategy

sample_sizes_iteration

How to loop over sample sizes. The main modes are: * deterministically. For every k generate m_k samples before moving to k+1. * stochastically. Sample sizes k according to the distribution given by sample_sizes. * round-robin. Iterate over k, and generate 1 sample each time, until reaching m_k. But more can be created by subclassing SampleSizeIteration.

TYPE: Type[SampleSizeIteration] DEFAULT: DeterministicSizeIteration

batch_size

The number of samples to generate per batch. Batches are processed together by each subprocess when working in parallel.

TYPE: int DEFAULT: 1

index_iteration

the strategy to use for iterating over indices to update. Note that anything other than returning index exactly once will break the weight computation.

TYPE: Type[IndexIteration] DEFAULT: FiniteSequentialIndexIteration

seed

The seed for the random number generator.

TYPE: Seed | None DEFAULT: None

New in version 0.10.0

Source code in src/pydvl/valuation/samplers/stratified.py
def __init__(
    self,
    sample_sizes: SampleSizeStrategy,
    sample_sizes_iteration: Type[SampleSizeIteration] = DeterministicSizeIteration,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
):
    super().__init__(
        batch_size=batch_size, index_iteration=index_iteration, seed=seed
    )
    self.sample_sizes_strategy = sample_sizes
    self.sample_sizes_iteration = sample_sizes_iteration

skip_indices property writable

skip_indices

Set of indices to skip in the outer loop.

interrupt

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py
def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

index_iterator

index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]

Iterates over indices with the method specified at construction.

Source code in src/pydvl/valuation/samplers/powerset.py
def index_iterator(
    self, indices: IndexSetT
) -> Generator[IndexT | None, None, None]:
    """Iterates over indices with the method specified at construction."""
    try:
        self._index_iterator = self._index_iterator_cls(indices, seed=self._rng)  # type: ignore
    except (AttributeError, TypeError):
        self._index_iterator = self._index_iterator_cls(indices)
    for idx in self._index_iterator:
        if idx not in self.skip_indices:
            yield idx

log_weight

log_weight(n: int, subset_len: int) -> float

The probability of sampling a set of size k is 1/(n choose k) times the probability of choosing size k, which is the number of samples for that size divided by the total number of samples for all sizes:

\[P(S) = \binom{n}{k}^{-1} \ \frac{m_k}{m},\]

where \(m_k\) is the number of samples of size \(k\) and \(m\) is the total number of samples.

PARAMETER DESCRIPTION
n

Size of the index set.

TYPE: int

subset_len

Size of the subset.

TYPE: int

Returns: The logarithm of the probability of having sampled a set of size subset_len.

Source code in src/pydvl/valuation/samplers/stratified.py
def log_weight(self, n: int, subset_len: int) -> float:
    r"""The probability of sampling a set of size k is 1/(n choose k) times the
    probability of choosing size k, which is the number of samples for that size
    divided by the total number of samples for all sizes:

    $$P(S) = \binom{n}{k}^{-1} \ \frac{m_k}{m},$$

    where $m_k$ is the number of samples of size $k$ and $m$ is the total number
    of samples.

    Args:
        n: Size of the index set.
        subset_len: Size of the subset.
    Returns:
        The logarithm of the probability of having sampled a set of size `subset_len`.
    """

    n = self._index_iterator_cls.complement_size(n)
    # Depending on whether we sample from complements or not, the total number of
    # samples passed to the heuristic has a different interpretation.
    index_iteration_length = self._index_iterator_cls.length(n)  # type: ignore
    if index_iteration_length is None:
        index_iteration_length = 1
    index_iteration_length = max(1, index_iteration_length)

    # Note that we can simplify the quotient
    # $$ \frac{m_k}{m} =
    #    \frac{m \frac{f (k)}{\sum_j f (j)}}{m} = \frac{f(k)}{\sum_j f (j)} $$
    # so that in the weight computation we can use the function $f$ directly from
    # the strategy, or equivalently, call `sample_sizes(n, quantize=False)`.
    # This is useful for the stochastic iteration, where we have frequencies
    # and m is possibly 1, so that quantization would yield a bunch of zeros.
    funs = self.sample_sizes_strategy.sample_sizes(n, quantize=False)
    total = np.sum(funs)

    return float(
        -logcomb(n, subset_len)
        + np.log(index_iteration_length)
        + np.log(funs[subset_len])
        - np.log(total)
    )

TruncationPolicy

TruncationPolicy()

Bases: ABC

A policy for deciding whether to stop computation of a batch of samples

Statistics are kept on the total number of calls and truncations as n_calls and n_truncations respectively.

ATTRIBUTE DESCRIPTION
n_calls

Number of calls to the policy.

TYPE: int

n_truncations

Number of truncations made by the policy.

TYPE: int

Todo

Because the policy objects are copied to the workers, the statistics are not accessible from the coordinating process. We need to add methods for this.

Source code in src/pydvl/valuation/samplers/truncation.py
def __init__(self) -> None:
    self.n_calls: int = 0
    self.n_truncations: int = 0

reset abstractmethod

reset(utility: UtilityBase)

(Re)set the policy to a state ready for a new permutation.

Source code in src/pydvl/valuation/samplers/truncation.py
@abstractmethod
def reset(self, utility: UtilityBase):
    """(Re)set the policy to a state ready for a new permutation."""
    ...

__call__

__call__(idx: IndexT, score: float, batch_size: int) -> bool

Check whether the computation should be interrupted.

PARAMETER DESCRIPTION
idx

Position in the batch currently being computed.

TYPE: IndexT

score

Last utility computed.

TYPE: float

batch_size

Size of the batch being computed.

TYPE: int

RETURNS DESCRIPTION
bool

True if the computation should be interrupted.

Source code in src/pydvl/valuation/samplers/truncation.py
def __call__(self, idx: IndexT, score: float, batch_size: int) -> bool:
    """Check whether the computation should be interrupted.

    Args:
        idx: Position in the batch currently being computed.
        score: Last utility computed.
        batch_size: Size of the batch being computed.

    Returns:
        `True` if the computation should be interrupted.
    """

    ret = self._check(idx, score, batch_size)
    self.n_calls += 1
    self.n_truncations += 1 if ret else 0
    return ret

NoTruncation

NoTruncation()

Bases: TruncationPolicy

A policy which never interrupts the computation.

Source code in src/pydvl/valuation/samplers/truncation.py
def __init__(self) -> None:
    self.n_calls: int = 0
    self.n_truncations: int = 0

__call__

__call__(idx: IndexT, score: float, batch_size: int) -> bool

Check whether the computation should be interrupted.

PARAMETER DESCRIPTION
idx

Position in the batch currently being computed.

TYPE: IndexT

score

Last utility computed.

TYPE: float

batch_size

Size of the batch being computed.

TYPE: int

RETURNS DESCRIPTION
bool

True if the computation should be interrupted.

Source code in src/pydvl/valuation/samplers/truncation.py
def __call__(self, idx: IndexT, score: float, batch_size: int) -> bool:
    """Check whether the computation should be interrupted.

    Args:
        idx: Position in the batch currently being computed.
        score: Last utility computed.
        batch_size: Size of the batch being computed.

    Returns:
        `True` if the computation should be interrupted.
    """

    ret = self._check(idx, score, batch_size)
    self.n_calls += 1
    self.n_truncations += 1 if ret else 0
    return ret

FixedTruncation

FixedTruncation(fraction: float)

Bases: TruncationPolicy

Break a computation after a fixed number of updates.

The experiments in Appendix B of (Ghorbani and Zou, 2019)1 show that when the training set size is large enough, one can simply truncate the iteration over permutations after a fixed number of steps. This happens because beyond a certain number of samples in a training set, the model becomes insensitive to new ones. Alas, this strongly depends on the data distribution and the model and there is no automatic way of estimating this number.

PARAMETER DESCRIPTION
fraction

Fraction of updates in a batch to compute before stopping (e.g. 0.5 to compute half of the marginals in a permutation).

TYPE: float

Source code in src/pydvl/valuation/samplers/truncation.py
def __init__(self, fraction: float):
    super().__init__()
    if fraction <= 0 or fraction > 1:
        raise ValueError("fraction must be in (0, 1]")
    self.fraction = fraction
    self.count = 0  # within-permutation count

__call__

__call__(idx: IndexT, score: float, batch_size: int) -> bool

Check whether the computation should be interrupted.

PARAMETER DESCRIPTION
idx

Position in the batch currently being computed.

TYPE: IndexT

score

Last utility computed.

TYPE: float

batch_size

Size of the batch being computed.

TYPE: int

RETURNS DESCRIPTION
bool

True if the computation should be interrupted.

Source code in src/pydvl/valuation/samplers/truncation.py
def __call__(self, idx: IndexT, score: float, batch_size: int) -> bool:
    """Check whether the computation should be interrupted.

    Args:
        idx: Position in the batch currently being computed.
        score: Last utility computed.
        batch_size: Size of the batch being computed.

    Returns:
        `True` if the computation should be interrupted.
    """

    ret = self._check(idx, score, batch_size)
    self.n_calls += 1
    self.n_truncations += 1 if ret else 0
    return ret

RelativeTruncation

RelativeTruncation(rtol: float, burn_in_fraction: float = 0.0)

Bases: TruncationPolicy

Break a computation if the utility is close enough to the total utility.

This is called "performance tolerance" in (Ghorbani and Zou, 2019)1.

Warning

Initialization and reset() of this policy imply the computation of the total utility for the dataset, which can be expensive!

PARAMETER DESCRIPTION
rtol

Relative tolerance. The permutation is broken if the last computed utility is within this tolerance of the total utility.

TYPE: float

burn_in_fraction

Fraction of samples within a permutation to wait until actually checking.

TYPE: float DEFAULT: 0.0

Source code in src/pydvl/valuation/samplers/truncation.py
def __init__(self, rtol: float, burn_in_fraction: float = 0.0):
    super().__init__()
    assert 0 <= burn_in_fraction <= 1
    self.burn_in_fraction = burn_in_fraction
    self.rtol = rtol
    self.total_utility = 0.0
    self.count = 0  # within-permutation count
    self._is_setup = False

DeviationTruncation

DeviationTruncation(sigmas: float, burn_in_fraction: float = 0.0)

Bases: TruncationPolicy

Break a computation if the last computed utility is close to the total utility.

This is essentially the same as RelativeTruncation, but with the tolerance determined by a multiple of the standard deviation of the utilities.

Danger

This policy can break early if the utility function has high variance. This can lead to gross underestimation of values. Use with caution.

Warning

Initialization and reset() of this policy imply the computation of the total utility for the dataset, which can be expensive!

PARAMETER DESCRIPTION
burn_in_fraction

Fraction of samples within a permutation to wait until actually checking.

TYPE: float DEFAULT: 0.0

sigmas

Number of standard deviations to use as a threshold.

TYPE: float

Source code in src/pydvl/valuation/samplers/truncation.py
def __init__(self, sigmas: float, burn_in_fraction: float = 0.0):
    super().__init__()
    assert 0 <= burn_in_fraction <= 1

    self.burn_in_fraction = burn_in_fraction
    self.total_utility = 0.0
    self.count = 0  # within-permutation count
    self.variance = 0.0
    self.mean = 0.0
    self.sigmas = sigmas
    self._is_setup = False

get_unique_labels

get_unique_labels(array: NDArray) -> NDArray

Returns unique labels in a categorical dataset.

PARAMETER DESCRIPTION
array

The input array to find unique labels from. It should be of categorical types such as Object, String, Unicode, Unsigned integer, Signed integer, or Boolean.

TYPE: NDArray

RETURNS DESCRIPTION
NDArray

An array of unique labels.

RAISES DESCRIPTION
ValueError

If the input array is not of a categorical type.

Source code in src/pydvl/valuation/samplers/classwise.py
def get_unique_labels(array: NDArray) -> NDArray:
    """Returns unique labels in a categorical dataset.

    Args:
        array: The input array to find unique labels from. It should be of
               categorical types such as Object, String, Unicode, Unsigned
               integer, Signed integer, or Boolean.

    Returns:
        An array of unique labels.

    Raises:
        ValueError: If the input array is not of a categorical type.
    """
    # Object, String, Unicode, Unsigned integer, Signed integer, boolean
    if array.dtype.kind in "OSUiub":
        return cast(NDArray, np.unique(array))
    raise ValueError(
        f"Input array has an unsupported data type for categorical labels: {array.dtype}. "
        "Expected types: Object, String, Unicode, Unsigned integer, Signed integer, or Boolean."
    )