pydvl.valuation.samplers.stratified ¶

This module implements stratified samplers.

Stratified samplers change the subset sampling distribution to be a function of set size with the goal of reducing the variance of the Monte Carlo estimate of the marginal utility. They key assumption / heuristic is that the utility's variance is a function of the training set size.

Stratified sampling was introduced at least as early as Maleki et al. (2014)³. Later on, Wu et al. 2023², extended these heuristics and proposed some for ML tasks, which they called VRDS (see below). Watson et al. (2023)¹ used error estimates for certain model classes to propose a different heuristic. See below and \(\delta\)-Shapley.

All stratified samplers in pyDVL are implemented by configuring (or subclassing) the classes StratifiedSampler and StratifiedPermutationSampler.

In the simplest case, StratifiedSampler employs some strategy with a fixed number of samples \(m_k\) for each set size \(k \in [0, n],\) where \(n\) is the total number of indices in the index set \(N.\) It iterates through all indices exactly (e.g. exactly once, if using FiniteSequentialIndexIteration) and for each index \(i \in N\), iterates through all set sizes \(k\), then samples exactly \(m_k\) subsets \(S \subset N_{-i}\) of size \(k\). The iteration over set sizes is configured with SampleSizeIteration.

Choosing set size heuristics¶

Optimal sampling (leading to minimal variance estimators) involves a dynamic choice of the number \(m_k\) of samples at size \(k\) based on the variance of the Monte Carlo integrand, but Wu et al. (2023)² show that there exist methods applicable to semi-values which precompute these sizes while still providing reasonable performance.

The number of sets of size \(k\)

Recall that uniform sampling from the powerset \(2^{N_{-i}}\) produces a binomial distribution of set sizes: the number of sets of size \(k\) is \(m_k = \binom{n-1}{k},\) which is the (inverse of the) Shapley coefficient. Therefore, setting for instance \(m_k = C\) for some constant will drastically reduce the number of sets of size \(\sim n/2\) while increasing the number of sets of size 1 or \(n-1.\) This will then have stark implications on the Monte Carlo estimate of semi-values, depending on how the marginal utility (i.e. the training of the model) is affected by the size of the training set.

This heuristic is configured with the argument sample_size of StratifiedSampler, which is an instance of SampleSizeStrategy.

Variance Reduced Stratified Sampler (VRDS)¶

It is known (Wu et al. (2023), Theorem 4.2)² that a minimum variance estimator of Shapley values samples \(m_k\) sets of size \(k\) based on the variance of the marginal utility at that set size. However, this quantity is unknown in practice, so the authors propose a simple deterministic heuristic, which in particular does not depend on run-time variance estimates, as an adaptive method might do. Section 4 of Wu et al. (2023)² shows a good default choice is based on the harmonic function of the set size \(k\).

This sampler is available through VRDSSampler.

Constructing a VRDS

It is possible to "manually" replicate VRDSSampler with:

n_samples_per_index = 1000  # Total number of samples is: n_indices times this
sampler = StratifiedSampler(
    sample_sizes=HarmonicSampleSize(n_samples=1000),
    sample_sizes_iteration=FiniteSequentialSizeIteration,
    index_iteration=FiniteSequentialIndexIteration,
    )

Iterating over indices and its effect on 'n_samples'¶

As any other sampler, StratifiedSampler can iterate over indices finitely or infinitely many times. It can also use NoIndexIteration to sample from the whole powerset. This is configured with the parameter index_iteration.

In the case of finite iterations, the sampler must distribute a finite total number of samples among all indices. This is done by the SampleSizeStrategy, which therefore requires an argument n_samples to be set to the number of samples per index.

Warning

On the other hand, if the sampler iterates over the indices indefinitely, n_indices can be set, but only relative frequencies will matter. As we see next, there is another component that will affect how the sampler behaves.

Iterating over set sizes¶

Additionally, StratifiedSampler must iterate over sample sizes \(k \in [0, n]\), and this can be done in multiple ways, configured via subclasses of SampleSizeIteration.

FiniteSequentialSizeIteration will generate exactly \(m_k\) samples for each \(k\) before moving to the next \(k.\) This implies that n_samples must be large enough for the computed \(m_k\) to be valid. Alternatively, and preferably, some strategies allow n_samples = None to signal them to compute the total number of samples.
RandomSizeIteration will sample a set size \(k\) according to the distribution of sizes given by the strategy. When using this in conjunction with an infinite index iteration for the sampler, n_samples can be safely left to its default None since \(m_k\) will be interpreted as a probability.
RoundRobinSizeIteration will iterate over set sizes \(k\) and generate one sample each time, until reaching \(m_k.\)

Other known strategies¶

All components described above can be mixed in most ways, but some specific configurations besides VRDS appear in the literature as follows:

Sample sizes given by stability bounds related to the algorithm, to ensure good approximation of per-set-size marginal utilities. This sampling method was introduced by Watson et al. (2023)¹ for the computation of Shapley values as \(\delta\)-Shapley.

DeltaShapleyNSGDSampleSize implements the choice of \(m_k\) corresponding to non-convex losses minimized with SGD. Alas, it requires many parameters to be set which are hard to estimate in practice. An alternative is to use PowerLawSampleSize (see below) with an exponent of -2, which is the order of \(m_k\) in \(\delta\)-Shapley.

??? Example "Constructing an alternative sampler for DeltaShapley"
    ```python
    config = DeltaShapleyNCSGDConfig(...)  # See the docs / paper
    sampler = StratifiedSampler(
        sample_sizes=DeltaShapleyNSGDSampleSize(config, lower_bound, upper_bound),
        sample_sizes_iteration=FiniteSequentialSizeIteration,
        index_iteration=SequentialIndexIteration,
        )
    ```
!!! Note "Other bounds"
    Implementing the other bounds from the paper is just a matter of subclassing
    [SampleSizeStrategy][pydvl.valuation.samplers.stratified.SampleSizeStrategy] and
    implementing the  `fun` method, as done in
    [DeltaShapleyNCSGDSampleSize][pydvl.valuation.samplers.stratified.DeltaShapleyNCSGDSampleSize].

Sample sizes decreasing with a power law. Use PowerLawSampleSize for the strategy. This was also proposed in Wu et al. (2023)². Empirically they found an exponent between -1 and -0.5 to perform well.
Power law heuristic
```
sampler = StratifiedSampler(
      sample_sizes=PowerLawSampleSize(exponent=-0.5),
      sample_sizes_iteration=RandomSizeIteration,
      index_iteration=RandomIndexIteration,
      )
```
Group Testing Sample Size. This heuristic is used for the stratified sampling required by GroupTestingShapleyValuation.

References¶

Watson, Lauren, Zeno Kujawa, Rayna Andreeva, Hao-Tsung Yang, Tariq Elahi, and Rik Sarkar. Accelerated Shapley Value Approximation for Data Evaluation. arXiv, 9 November 2023. ↩↩
Wu, Mengmeng, Ruoxi Jia, Changle Lin, Wei Huang, and Xiangyu Chang. Variance Reduced Shapley Value Estimation for Trustworthy Data Valuation. Computers & Operations Research 159 (1 November 2023): 106305. ↩↩↩↩↩
Maleki, Sasan, Long Tran-Thanh, Greg Hines, Talal Rahwan, and Alex Rogers. Bounding the Estimation Error of Sampling-Based Shapley Value Approximation. arXiv:1306.4265 [Cs], 12 February 2014. ↩

ConstantSampleSize ¶

ConstantSampleSize(
    n_samples: int | None = None,
    lower_bound: int | None = None,
    upper_bound: int | None = None,
)

Bases: SampleSizeStrategy

Use a constant number of samples for each set size.

PARAMETER	DESCRIPTION
`n_samples`	Number of samples for the stratified sampler to generate. It can be `None`, 0 or a positive integer. See the documentation of SampleSizeStrategy for details. TYPE: `int \| None` DEFAULT: `None`
`lower_bound`	Lower bound for the set sizes. TYPE: `int \| None` DEFAULT: `None`
`upper_bound`	Upper bound for the set sizes. TYPE: `int \| None` DEFAULT: `None`

Source code in src/pydvl/valuation/samplers/stratified.py

def __init__(
    self,
    n_samples: int | None = None,
    lower_bound: int | None = None,
    upper_bound: int | None = None,
):
    if n_samples is not None:
        n_samples = validate_number("n_samples", n_samples, int, lower=0)
    if lower_bound is not None:
        lower_bound = validate_number("lower_bound", lower_bound, int, lower=0)
    if upper_bound is not None:
        upper_bound = validate_number(
            "upper_bound", upper_bound, int, lower=lower_bound
        )
    self._n_samples_per_index = n_samples
    self.lower_bound = lower_bound
    self.upper_bound = upper_bound

effective_bounds ¶

effective_bounds(n: int) -> tuple[int, int]

Returns the effective bounds for the sample sizes, given the number of indices n from which sets are sampled.

Note

The number of indices n will typically be complement_size(len(train)), i.e. what we sometimes denote as effective_n.

PARAMETER	DESCRIPTION
`n`	The number of indices from which subsets are drawn. TYPE: `int`

Returns: A tuple of [lower, upper] bounds for sample sizes (inclusive).

Source code in src/pydvl/valuation/samplers/stratified.py

def effective_bounds(self, n: int) -> tuple[int, int]:
    """Returns the effective bounds for the sample sizes, given the number of
    indices `n` from which sets are sampled.

    !!! note
        The number of indices `n` will typically be `complement_size(len(train))`,
        i.e. what we sometimes denote as `effective_n`.

    Args:
        n: The number of indices from which subsets are drawn.
    Returns:
        A tuple of [lower, upper] bounds for sample sizes (inclusive).
    """
    lower = 0 if self.lower_bound is None else self.lower_bound
    upper = n if self.upper_bound is None else self.upper_bound
    lower = min(lower, n)
    upper = min(upper, n)

    return lower, upper

n_samples_per_index ¶

n_samples_per_index(n_indices: int) -> int | None

Returns the total number of samples to take for the given number of indices, or None if no limit was set and samples are generated with probabilities.

Source code in src/pydvl/valuation/samplers/stratified.py

def n_samples_per_index(self, n_indices: int) -> int | None:
    """Returns the total number of samples to take for the given number of indices,
    or `None` if no limit was set and samples are generated with probabilities."""
    if self._n_samples_per_index is None:
        return None
    sizes = self.sample_sizes(n_indices, probs=False)
    return int(np.sum(sizes).item())

sample_sizes `cached` ¶

sample_sizes(
    n_indices: int, probs: bool = True
) -> NDArray[int64] | NDArray[float64]

Precomputes the number of samples to take for each set size, from 0 up to n_indices inclusive.

If probs is True, the result is a vector of floats, where each element is the probability of sampling a set of size \(k.\) This is useful e.g. for RandomSizeIteration where one needs frequencies. In this case self.n_samples_per_index can be None.

If probs is False, the result is a vector of integers, where each element \(k\) is the number of samples to take for set size \(k.\) The sum of all elements will depend on the value of n_samples upon construction. It will be

equal to n_samples if n_samples > 0,
the sum of the values of fun for all set sizes if n_samples == 0,
the sum after a normalization such that the smallest non-zero value is 1, if n_samples == None.

This somewhat complex behavior is necessary to ensure that the total number of samples is respected when provided either globally upon construction, or implicitly by the sum of the values of fun when the strategy computes absolute numbers, but also when the strategy has a fun that returns frequencies. The special handling of the case n_samples == None is used implicitly by StratifiedPermutationSampler (yeah, it's not pretty).

When probs is False, this method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.

PARAMETER	DESCRIPTION
`n_indices`	number of indices in the index set from which to sample. This is typically `len(dataset) - 1` with the usual index iterations. TYPE: `int`
`probs`	Whether to perform the remainder distribution. If `True`, sampling probabilities are returned. If `False`, then `n_samples_per_index` is used to compute the actual number of samples and the values are rounded down to the nearest integer, and the remainder is distributed to maintain the relative frequencies. TYPE: `bool` DEFAULT: `True`

Returns: The exact (integer) number of samples to take for each set size, if probs is False. Otherwise, the fractional number of samples.

Source code in src/pydvl/valuation/samplers/stratified.py

@cache
def sample_sizes(
    self, n_indices: int, probs: bool = True
) -> NDArray[np.int64] | NDArray[np.float64]:
    """Precomputes the number of samples to take for each set size, from 0 up to
    `n_indices` inclusive.

    If `probs` is `True`, the result is a vector of floats, where each element
    is the probability of sampling a set of size $k.$ This is useful e.g. for
    [RandomSizeIteration][pydvl.valuation.samplers.stratified.RandomSizeIteration]
    where one needs frequencies. In this case `self.n_samples_per_index` can be
    `None`.

    If `probs` is `False`, the result is a vector of integers, where each element
    $k$ is the number of samples to take for set size $k.$ The sum of all elements
    will depend on the value of `n_samples` upon construction. It will be

    * equal to `n_samples` if `n_samples > 0`,
    * the sum of the values of `fun` for all set sizes if `n_samples == 0`,
    * the sum after a normalization such that the smallest non-zero value is 1, if
      `n_samples == None`.

    This somewhat complex behavior is necessary to ensure that the total number of
    samples is respected when provided either globally upon construction, or
    implicitly by the sum of the values of `fun` when the strategy computes absolute
    numbers, but also when the strategy has a `fun` that returns frequencies. The
    special handling of the case `n_samples == None` is used implicitly by
    [StratifiedPermutationSampler][pydvl.valuation.samplers.stratified.StratifiedPermutationSampler]
    (yeah, it's not pretty).

    When `probs` is `False`, this method corrects rounding errors taking into
    account the fractional parts so that the total number of samples is respected,
    while allocating remainders in a way that follows the relative sizes of the
    fractional parts.

    Args:
        n_indices: number of indices in the index set from which to sample. This is
            typically `len(dataset) - 1` with the usual index iterations.
        probs: Whether to perform the remainder distribution. If `True`, sampling
            probabilities are returned. If `False`, then `n_samples_per_index` is
            used to compute the actual number of samples and the values are rounded
            down to the nearest integer, and the remainder is distributed to
            maintain the relative frequencies.
    Returns:
        The exact (integer) number of samples to take for each set size, if
        `probs` is `False`. Otherwise, the fractional number of samples.
    """

    # m_k = m * f(k) / sum_j f(j)
    values = np.zeros(n_indices + 1, dtype=float)
    s = 0.0
    lb, ub = self.effective_bounds(n_indices)

    for k in range(lb, ub + 1):
        val = self.fun(n_indices, k)
        values[k] = val
        s += val

    assert n_indices == 0 or s > 0, "Sum of sample sizes must be positive"
    values /= s

    if probs:
        values.setflags(write=False)  # avoid accidental modification of cache
        return values  # m_k / m

    if self._n_samples_per_index is None:
        normalization = min(values[values > 0])
        n_samples = np.sum(values / normalization).astype(int)  # min m_k = 1
    elif self._n_samples_per_index == 0:
        n_samples = np.ceil(s).astype(int)  # sum m_k = sum f_k
    else:
        n_samples = self._n_samples_per_index  # sum m_k = n_samples

    values *= n_samples

    # Round down and distribute remainder by adjusting the largest fractional parts
    # A naive implementation with e.g.
    #
    # m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
    #         for k in range(n)]
    #
    # would not respect the total number of samples, and would not distribute
    # remainders correctly
    int_values: NDArray[np.int64] = np.floor(values).astype(np.int64)
    remainder = n_samples - np.sum(int_values)
    fractional_parts = values - int_values
    fractional_parts_indices = np.argsort(-fractional_parts, kind="stable")[
        :remainder
    ]
    int_values[fractional_parts_indices] += 1
    int_values.setflags(write=False)  # avoid accidental modification of cache
    return int_values

DeltaShapleyNCSGDConfig `dataclass` ¶

DeltaShapleyNCSGDConfig(
    max_loss: float,
    lipschitz_loss: float,
    lipschitz_grad: float,
    lr_factor: float,
    n_sgd_iter: int,
    n_val: int,
    n_train: int,
    eps: float = 0.01,
    delta: float = 0.05,
    version: Literal["theorem7", "code"] = "theorem7",
)

Configuration for Delta-Shapley non-convex SGD sampling.

See Watson et al. (2023)¹ for details. Given that it can be difficult to estimate these constants, an alternative which has a similar decay rate of \(O(1/k^2)\) is to use a PowerLawSampleSize strategy.

PARAMETER	DESCRIPTION
`max_loss`	Maximum of the loss. TYPE: `float`
`lipschitz_loss`	Lipschitz constant of the loss TYPE: `float`
`lipschitz_grad`	Lipschitz constant of the gradient of the loss TYPE: `float`
`lr_factor`	Learning rate factor c, assuming it has the form \(lpha_t = c/t.\) TYPE: `float`
`n_sgd_iter`	Number of SGD iterations. TYPE: `int`
`n_val`	Number of test samples. TYPE: `int`
`n_train`	Number of training samples. TYPE: `int`
`eps`	Epsilon value in the epsilon-delta guarantee, i.e. the distance to the true value. TYPE: `float` DEFAULT: `0.01`
`delta`	Delta value in the epsilon-delta guarantee, i.e. the probability of failure. TYPE: `float` DEFAULT: `0.05`
`version`	Version of the bound to use: either the one from the paper or the one in the code. TYPE: `Literal['theorem7', 'code']` DEFAULT: `'theorem7'`

DeltaShapleyNCSGDSampleSize ¶

DeltaShapleyNCSGDSampleSize(
    config: DeltaShapleyNCSGDConfig,
    lower_bound: int | None = None,
    upper_bound: int | None = None,
)

Bases: SampleSizeStrategy

Heuristic choice of samples per set size for \(\delta\)-Shapley.

This implements the non-convex SGD bound from Watson et al. (2023)¹.

Note

This strategy does not accept an n_samples parameter because it provides the absolute number of samples to take for each set size based on the constants provided in the configuration.

PARAMETER	DESCRIPTION
`config`	Configuration for the sample bound. TYPE: `DeltaShapleyNCSGDConfig`
`lower_bound`	Lower bound for the set sizes. If the set size is smaller than this, the probability of sampling is 0. If `None`, the lower bound is set to 0. TYPE: `int \| None` DEFAULT: `None`
`upper_bound`	Upper bound for the set size. If the set size is larger than this, the probability of sampling is 0. If `None`, the upper bound is set to the number of indices. TYPE: `int \| None` DEFAULT: `None`

Source code in src/pydvl/valuation/samplers/stratified.py

def __init__(
    self,
    config: DeltaShapleyNCSGDConfig,
    lower_bound: int | None = None,
    upper_bound: int | None = None,
):
    super().__init__(n_samples=0, lower_bound=lower_bound, upper_bound=upper_bound)
    self.config = config

effective_bounds ¶

effective_bounds(n: int) -> tuple[int, int]

Returns the effective bounds for the sample sizes, given the number of indices n from which sets are sampled.

Note

The number of indices n will typically be complement_size(len(train)), i.e. what we sometimes denote as effective_n.

PARAMETER	DESCRIPTION
`n`	The number of indices from which subsets are drawn. TYPE: `int`

Returns: A tuple of [lower, upper] bounds for sample sizes (inclusive).

Source code in src/pydvl/valuation/samplers/stratified.py

def effective_bounds(self, n: int) -> tuple[int, int]:
    """Returns the effective bounds for the sample sizes, given the number of
    indices `n` from which sets are sampled.

    !!! note
        The number of indices `n` will typically be `complement_size(len(train))`,
        i.e. what we sometimes denote as `effective_n`.

    Args:
        n: The number of indices from which subsets are drawn.
    Returns:
        A tuple of [lower, upper] bounds for sample sizes (inclusive).
    """
    lower = 0 if self.lower_bound is None else self.lower_bound
    upper = n if self.upper_bound is None else self.upper_bound
    lower = min(lower, n)
    upper = min(upper, n)

    return lower, upper

fun ¶

fun(n_indices: int, subset_size: int) -> int

Computes the number of samples for the non-convex SGD bound.

Source code in src/pydvl/valuation/samplers/stratified.py

def fun(self, n_indices: int, subset_size: int) -> int:
    """Computes the number of samples for the non-convex SGD bound."""
    q = self.config.lipschitz_grad * self.config.lr_factor
    H_1 = self.config.max_loss ** (q / (q + 1))
    H_2 = (2 * self.config.lr_factor * (self.config.lipschitz_loss**2)) ** (
        1 / (q + 1)
    )
    H_3 = self.config.n_sgd_iter ** (q / (q + 1))
    H_4 = (1 + (1 / q)) / (max(subset_size - 1, 1))
    H = H_1 * H_2 * H_3 * H_4

    if self.config.version == "code":
        return int(
            np.ceil(
                2
                * np.log((2 * n_indices) / self.config.delta)
                * (
                    (
                        (H**2 / self.config.n_val)
                        + 2 * self.config.max_loss * self.config.eps / 3
                    )
                    / self.config.eps**2
                )
            )
        )
    elif self.config.version == "theorem7":
        return int(
            np.ceil(
                2
                * np.log((2 * n_indices) / self.config.delta)
                * (
                    (
                        2 * H**2
                        + 2 * self.config.max_loss * H
                        + 4 * self.config.max_loss * self.config.eps / 3
                    )
                    / self.config.eps**2
                )
            )
        )
    else:
        raise ValueError(f"Unknown version: {self.config.version}")

n_samples_per_index ¶

n_samples_per_index(n_indices: int) -> int | None

Returns the total number of samples to take for the given number of indices, or None if no limit was set and samples are generated with probabilities.

Source code in src/pydvl/valuation/samplers/stratified.py

def n_samples_per_index(self, n_indices: int) -> int | None:
    """Returns the total number of samples to take for the given number of indices,
    or `None` if no limit was set and samples are generated with probabilities."""
    if self._n_samples_per_index is None:
        return None
    sizes = self.sample_sizes(n_indices, probs=False)
    return int(np.sum(sizes).item())

sample_sizes `cached` ¶

sample_sizes(
    n_indices: int, probs: bool = True
) -> NDArray[int64] | NDArray[float64]

Precomputes the number of samples to take for each set size, from 0 up to n_indices inclusive.

If probs is True, the result is a vector of floats, where each element is the probability of sampling a set of size \(k.\) This is useful e.g. for RandomSizeIteration where one needs frequencies. In this case self.n_samples_per_index can be None.

If probs is False, the result is a vector of integers, where each element \(k\) is the number of samples to take for set size \(k.\) The sum of all elements will depend on the value of n_samples upon construction. It will be

equal to n_samples if n_samples > 0,
the sum of the values of fun for all set sizes if n_samples == 0,
the sum after a normalization such that the smallest non-zero value is 1, if n_samples == None.

This somewhat complex behavior is necessary to ensure that the total number of samples is respected when provided either globally upon construction, or implicitly by the sum of the values of fun when the strategy computes absolute numbers, but also when the strategy has a fun that returns frequencies. The special handling of the case n_samples == None is used implicitly by StratifiedPermutationSampler (yeah, it's not pretty).

When probs is False, this method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.

PARAMETER	DESCRIPTION
`n_indices`	number of indices in the index set from which to sample. This is typically `len(dataset) - 1` with the usual index iterations. TYPE: `int`
`probs`	Whether to perform the remainder distribution. If `True`, sampling probabilities are returned. If `False`, then `n_samples_per_index` is used to compute the actual number of samples and the values are rounded down to the nearest integer, and the remainder is distributed to maintain the relative frequencies. TYPE: `bool` DEFAULT: `True`

Returns: The exact (integer) number of samples to take for each set size, if probs is False. Otherwise, the fractional number of samples.

Source code in src/pydvl/valuation/samplers/stratified.py

@cache
def sample_sizes(
    self, n_indices: int, probs: bool = True
) -> NDArray[np.int64] | NDArray[np.float64]:
    """Precomputes the number of samples to take for each set size, from 0 up to
    `n_indices` inclusive.

    If `probs` is `True`, the result is a vector of floats, where each element
    is the probability of sampling a set of size $k.$ This is useful e.g. for
    [RandomSizeIteration][pydvl.valuation.samplers.stratified.RandomSizeIteration]
    where one needs frequencies. In this case `self.n_samples_per_index` can be
    `None`.

    If `probs` is `False`, the result is a vector of integers, where each element
    $k$ is the number of samples to take for set size $k.$ The sum of all elements
    will depend on the value of `n_samples` upon construction. It will be

    * equal to `n_samples` if `n_samples > 0`,
    * the sum of the values of `fun` for all set sizes if `n_samples == 0`,
    * the sum after a normalization such that the smallest non-zero value is 1, if
      `n_samples == None`.

    This somewhat complex behavior is necessary to ensure that the total number of
    samples is respected when provided either globally upon construction, or
    implicitly by the sum of the values of `fun` when the strategy computes absolute
    numbers, but also when the strategy has a `fun` that returns frequencies. The
    special handling of the case `n_samples == None` is used implicitly by
    [StratifiedPermutationSampler][pydvl.valuation.samplers.stratified.StratifiedPermutationSampler]
    (yeah, it's not pretty).

    When `probs` is `False`, this method corrects rounding errors taking into
    account the fractional parts so that the total number of samples is respected,
    while allocating remainders in a way that follows the relative sizes of the
    fractional parts.

    Args:
        n_indices: number of indices in the index set from which to sample. This is
            typically `len(dataset) - 1` with the usual index iterations.
        probs: Whether to perform the remainder distribution. If `True`, sampling
            probabilities are returned. If `False`, then `n_samples_per_index` is
            used to compute the actual number of samples and the values are rounded
            down to the nearest integer, and the remainder is distributed to
            maintain the relative frequencies.
    Returns:
        The exact (integer) number of samples to take for each set size, if
        `probs` is `False`. Otherwise, the fractional number of samples.
    """

    # m_k = m * f(k) / sum_j f(j)
    values = np.zeros(n_indices + 1, dtype=float)
    s = 0.0
    lb, ub = self.effective_bounds(n_indices)

    for k in range(lb, ub + 1):
        val = self.fun(n_indices, k)
        values[k] = val
        s += val

    assert n_indices == 0 or s > 0, "Sum of sample sizes must be positive"
    values /= s

    if probs:
        values.setflags(write=False)  # avoid accidental modification of cache
        return values  # m_k / m

    if self._n_samples_per_index is None:
        normalization = min(values[values > 0])
        n_samples = np.sum(values / normalization).astype(int)  # min m_k = 1
    elif self._n_samples_per_index == 0:
        n_samples = np.ceil(s).astype(int)  # sum m_k = sum f_k
    else:
        n_samples = self._n_samples_per_index  # sum m_k = n_samples

    values *= n_samples

    # Round down and distribute remainder by adjusting the largest fractional parts
    # A naive implementation with e.g.
    #
    # m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
    #         for k in range(n)]
    #
    # would not respect the total number of samples, and would not distribute
    # remainders correctly
    int_values: NDArray[np.int64] = np.floor(values).astype(np.int64)
    remainder = n_samples - np.sum(int_values)
    fractional_parts = values - int_values
    fractional_parts_indices = np.argsort(-fractional_parts, kind="stable")[
        :remainder
    ]
    int_values[fractional_parts_indices] += 1
    int_values.setflags(write=False)  # avoid accidental modification of cache
    return int_values

FiniteSequentialSizeIteration ¶

FiniteSequentialSizeIteration(strategy: SampleSizeStrategy, n_indices: int)

Bases: SampleSizeIteration

Generates exactly \(m_k\) samples for each set size \(k\) before moving to the next.

Only for deterministic sample sizes

This iteration is only valid for deterministic sample sizes. In particular, n_samples must be set to the total number of samples.

Source code in src/pydvl/valuation/samplers/stratified.py

def __init__(self, strategy: SampleSizeStrategy, n_indices: int):
    self.strategy = strategy
    self.n_indices = n_indices

GroupTestingSampleSize ¶

GroupTestingSampleSize(
    n_samples: int | None = None,
    lower_bound: int | None = None,
    upper_bound: int | None = None,
)

Bases: SampleSizeStrategy

Heuristic choice of samples per set size used for Group Testing.

GroupTestingShapleyValuation uses this strategy for the stratified sampling of samples with which to construct the linear problem it requires.

This heuristic sets the number of sets at size \(k\) to be

\[m_k = m \frac{f(k)}{\sum_{j=0}^{n-1} f(j)},\]

for a total number of samples \(m\) and:

\[ f(k) = \frac{1}{k} + \frac{1}{n-k}, \text{for} k \in \{1, n-1\}. \]

For GT Shapley, \(m=1\) and \(m_k\) is interpreted as a probability of sampling size \(k.\)

Source code in src/pydvl/valuation/samplers/stratified.py

def __init__(
    self,
    n_samples: int | None = None,
    lower_bound: int | None = None,
    upper_bound: int | None = None,
):
    if n_samples is not None:
        n_samples = validate_number("n_samples", n_samples, int, lower=0)
    if lower_bound is not None:
        lower_bound = validate_number("lower_bound", lower_bound, int, lower=0)
    if upper_bound is not None:
        upper_bound = validate_number(
            "upper_bound", upper_bound, int, lower=lower_bound
        )
    self._n_samples_per_index = n_samples
    self.lower_bound = lower_bound
    self.upper_bound = upper_bound

effective_bounds ¶

effective_bounds(n: int) -> tuple[int, int]

Returns the effective bounds for the sample sizes, given the number of indices n from which sets are sampled.

Note

The number of indices n will typically be complement_size(len(train)), i.e. what we sometimes denote as effective_n.

PARAMETER	DESCRIPTION
`n`	The number of indices from which subsets are drawn. TYPE: `int`

Returns: A tuple of [lower, upper] bounds for sample sizes (inclusive).

Source code in src/pydvl/valuation/samplers/stratified.py

def effective_bounds(self, n: int) -> tuple[int, int]:
    """Returns the effective bounds for the sample sizes, given the number of
    indices `n` from which sets are sampled.

    !!! note
        The number of indices `n` will typically be `complement_size(len(train))`,
        i.e. what we sometimes denote as `effective_n`.

    Args:
        n: The number of indices from which subsets are drawn.
    Returns:
        A tuple of [lower, upper] bounds for sample sizes (inclusive).
    """
    lower = 0 if self.lower_bound is None else self.lower_bound
    upper = n if self.upper_bound is None else self.upper_bound
    lower = min(lower, n)
    upper = min(upper, n)

    return lower, upper

n_samples_per_index ¶

n_samples_per_index(n_indices: int) -> int | None

Returns the total number of samples to take for the given number of indices, or None if no limit was set and samples are generated with probabilities.

Source code in src/pydvl/valuation/samplers/stratified.py

def n_samples_per_index(self, n_indices: int) -> int | None:
    """Returns the total number of samples to take for the given number of indices,
    or `None` if no limit was set and samples are generated with probabilities."""
    if self._n_samples_per_index is None:
        return None
    sizes = self.sample_sizes(n_indices, probs=False)
    return int(np.sum(sizes).item())

sample_sizes `cached` ¶

sample_sizes(
    n_indices: int, probs: bool = True
) -> NDArray[int64] | NDArray[float64]

Precomputes the number of samples to take for each set size, from 0 up to n_indices inclusive.

If probs is True, the result is a vector of floats, where each element is the probability of sampling a set of size \(k.\) This is useful e.g. for RandomSizeIteration where one needs frequencies. In this case self.n_samples_per_index can be None.

If probs is False, the result is a vector of integers, where each element \(k\) is the number of samples to take for set size \(k.\) The sum of all elements will depend on the value of n_samples upon construction. It will be

equal to n_samples if n_samples > 0,
the sum of the values of fun for all set sizes if n_samples == 0,
the sum after a normalization such that the smallest non-zero value is 1, if n_samples == None.

This somewhat complex behavior is necessary to ensure that the total number of samples is respected when provided either globally upon construction, or implicitly by the sum of the values of fun when the strategy computes absolute numbers, but also when the strategy has a fun that returns frequencies. The special handling of the case n_samples == None is used implicitly by StratifiedPermutationSampler (yeah, it's not pretty).

When probs is False, this method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.

PARAMETER	DESCRIPTION
`n_indices`	number of indices in the index set from which to sample. This is typically `len(dataset) - 1` with the usual index iterations. TYPE: `int`
`probs`	Whether to perform the remainder distribution. If `True`, sampling probabilities are returned. If `False`, then `n_samples_per_index` is used to compute the actual number of samples and the values are rounded down to the nearest integer, and the remainder is distributed to maintain the relative frequencies. TYPE: `bool` DEFAULT: `True`

Returns: The exact (integer) number of samples to take for each set size, if probs is False. Otherwise, the fractional number of samples.

Source code in src/pydvl/valuation/samplers/stratified.py

@cache
def sample_sizes(
    self, n_indices: int, probs: bool = True
) -> NDArray[np.int64] | NDArray[np.float64]:
    """Precomputes the number of samples to take for each set size, from 0 up to
    `n_indices` inclusive.

    If `probs` is `True`, the result is a vector of floats, where each element
    is the probability of sampling a set of size $k.$ This is useful e.g. for
    [RandomSizeIteration][pydvl.valuation.samplers.stratified.RandomSizeIteration]
    where one needs frequencies. In this case `self.n_samples_per_index` can be
    `None`.

    If `probs` is `False`, the result is a vector of integers, where each element
    $k$ is the number of samples to take for set size $k.$ The sum of all elements
    will depend on the value of `n_samples` upon construction. It will be

    * equal to `n_samples` if `n_samples > 0`,
    * the sum of the values of `fun` for all set sizes if `n_samples == 0`,
    * the sum after a normalization such that the smallest non-zero value is 1, if
      `n_samples == None`.

    This somewhat complex behavior is necessary to ensure that the total number of
    samples is respected when provided either globally upon construction, or
    implicitly by the sum of the values of `fun` when the strategy computes absolute
    numbers, but also when the strategy has a `fun` that returns frequencies. The
    special handling of the case `n_samples == None` is used implicitly by
    [StratifiedPermutationSampler][pydvl.valuation.samplers.stratified.StratifiedPermutationSampler]
    (yeah, it's not pretty).

    When `probs` is `False`, this method corrects rounding errors taking into
    account the fractional parts so that the total number of samples is respected,
    while allocating remainders in a way that follows the relative sizes of the
    fractional parts.

    Args:
        n_indices: number of indices in the index set from which to sample. This is
            typically `len(dataset) - 1` with the usual index iterations.
        probs: Whether to perform the remainder distribution. If `True`, sampling
            probabilities are returned. If `False`, then `n_samples_per_index` is
            used to compute the actual number of samples and the values are rounded
            down to the nearest integer, and the remainder is distributed to
            maintain the relative frequencies.
    Returns:
        The exact (integer) number of samples to take for each set size, if
        `probs` is `False`. Otherwise, the fractional number of samples.
    """

    # m_k = m * f(k) / sum_j f(j)
    values = np.zeros(n_indices + 1, dtype=float)
    s = 0.0
    lb, ub = self.effective_bounds(n_indices)

    for k in range(lb, ub + 1):
        val = self.fun(n_indices, k)
        values[k] = val
        s += val

    assert n_indices == 0 or s > 0, "Sum of sample sizes must be positive"
    values /= s

    if probs:
        values.setflags(write=False)  # avoid accidental modification of cache
        return values  # m_k / m

    if self._n_samples_per_index is None:
        normalization = min(values[values > 0])
        n_samples = np.sum(values / normalization).astype(int)  # min m_k = 1
    elif self._n_samples_per_index == 0:
        n_samples = np.ceil(s).astype(int)  # sum m_k = sum f_k
    else:
        n_samples = self._n_samples_per_index  # sum m_k = n_samples

    values *= n_samples

    # Round down and distribute remainder by adjusting the largest fractional parts
    # A naive implementation with e.g.
    #
    # m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
    #         for k in range(n)]
    #
    # would not respect the total number of samples, and would not distribute
    # remainders correctly
    int_values: NDArray[np.int64] = np.floor(values).astype(np.int64)
    remainder = n_samples - np.sum(int_values)
    fractional_parts = values - int_values
    fractional_parts_indices = np.argsort(-fractional_parts, kind="stable")[
        :remainder
    ]
    int_values[fractional_parts_indices] += 1
    int_values.setflags(write=False)  # avoid accidental modification of cache
    return int_values

HarmonicSampleSize ¶

HarmonicSampleSize(
    n_samples: int | None = None,
    lower_bound: int | None = None,
    upper_bound: int | None = None,
)

Bases: SampleSizeStrategy

Heuristic choice of samples per set size for VRDS.

Sets the number of sets at size \(k\) to be

\[m_k = m \frac{f(k)}{\sum_{j=0}^{n-1} f(j)},\]

for a total number of samples \(m\) and:

\[f(k) = \frac{1}{1+k}.\]

Source code in src/pydvl/valuation/samplers/stratified.py

def __init__(
    self,
    n_samples: int | None = None,
    lower_bound: int | None = None,
    upper_bound: int | None = None,
):
    if n_samples is not None:
        n_samples = validate_number("n_samples", n_samples, int, lower=0)
    if lower_bound is not None:
        lower_bound = validate_number("lower_bound", lower_bound, int, lower=0)
    if upper_bound is not None:
        upper_bound = validate_number(
            "upper_bound", upper_bound, int, lower=lower_bound
        )
    self._n_samples_per_index = n_samples
    self.lower_bound = lower_bound
    self.upper_bound = upper_bound

effective_bounds ¶

effective_bounds(n: int) -> tuple[int, int]

Returns the effective bounds for the sample sizes, given the number of indices n from which sets are sampled.

Note

The number of indices n will typically be complement_size(len(train)), i.e. what we sometimes denote as effective_n.

PARAMETER	DESCRIPTION
`n`	The number of indices from which subsets are drawn. TYPE: `int`

Returns: A tuple of [lower, upper] bounds for sample sizes (inclusive).

Source code in src/pydvl/valuation/samplers/stratified.py

def effective_bounds(self, n: int) -> tuple[int, int]:
    """Returns the effective bounds for the sample sizes, given the number of
    indices `n` from which sets are sampled.

    !!! note
        The number of indices `n` will typically be `complement_size(len(train))`,
        i.e. what we sometimes denote as `effective_n`.

    Args:
        n: The number of indices from which subsets are drawn.
    Returns:
        A tuple of [lower, upper] bounds for sample sizes (inclusive).
    """
    lower = 0 if self.lower_bound is None else self.lower_bound
    upper = n if self.upper_bound is None else self.upper_bound
    lower = min(lower, n)
    upper = min(upper, n)

    return lower, upper

n_samples_per_index ¶

n_samples_per_index(n_indices: int) -> int | None

Returns the total number of samples to take for the given number of indices, or None if no limit was set and samples are generated with probabilities.

Source code in src/pydvl/valuation/samplers/stratified.py

def n_samples_per_index(self, n_indices: int) -> int | None:
    """Returns the total number of samples to take for the given number of indices,
    or `None` if no limit was set and samples are generated with probabilities."""
    if self._n_samples_per_index is None:
        return None
    sizes = self.sample_sizes(n_indices, probs=False)
    return int(np.sum(sizes).item())

sample_sizes `cached` ¶

sample_sizes(
    n_indices: int, probs: bool = True
) -> NDArray[int64] | NDArray[float64]

Precomputes the number of samples to take for each set size, from 0 up to n_indices inclusive.

If probs is True, the result is a vector of floats, where each element is the probability of sampling a set of size \(k.\) This is useful e.g. for RandomSizeIteration where one needs frequencies. In this case self.n_samples_per_index can be None.

If probs is False, the result is a vector of integers, where each element \(k\) is the number of samples to take for set size \(k.\) The sum of all elements will depend on the value of n_samples upon construction. It will be

equal to n_samples if n_samples > 0,
the sum of the values of fun for all set sizes if n_samples == 0,
the sum after a normalization such that the smallest non-zero value is 1, if n_samples == None.

This somewhat complex behavior is necessary to ensure that the total number of samples is respected when provided either globally upon construction, or implicitly by the sum of the values of fun when the strategy computes absolute numbers, but also when the strategy has a fun that returns frequencies. The special handling of the case n_samples == None is used implicitly by StratifiedPermutationSampler (yeah, it's not pretty).

When probs is False, this method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.

PARAMETER	DESCRIPTION
`n_indices`	number of indices in the index set from which to sample. This is typically `len(dataset) - 1` with the usual index iterations. TYPE: `int`
`probs`	Whether to perform the remainder distribution. If `True`, sampling probabilities are returned. If `False`, then `n_samples_per_index` is used to compute the actual number of samples and the values are rounded down to the nearest integer, and the remainder is distributed to maintain the relative frequencies. TYPE: `bool` DEFAULT: `True`

Returns: The exact (integer) number of samples to take for each set size, if probs is False. Otherwise, the fractional number of samples.

Source code in src/pydvl/valuation/samplers/stratified.py

@cache
def sample_sizes(
    self, n_indices: int, probs: bool = True
) -> NDArray[np.int64] | NDArray[np.float64]:
    """Precomputes the number of samples to take for each set size, from 0 up to
    `n_indices` inclusive.

    If `probs` is `True`, the result is a vector of floats, where each element
    is the probability of sampling a set of size $k.$ This is useful e.g. for
    [RandomSizeIteration][pydvl.valuation.samplers.stratified.RandomSizeIteration]
    where one needs frequencies. In this case `self.n_samples_per_index` can be
    `None`.

    If `probs` is `False`, the result is a vector of integers, where each element
    $k$ is the number of samples to take for set size $k.$ The sum of all elements
    will depend on the value of `n_samples` upon construction. It will be

    * equal to `n_samples` if `n_samples > 0`,
    * the sum of the values of `fun` for all set sizes if `n_samples == 0`,
    * the sum after a normalization such that the smallest non-zero value is 1, if
      `n_samples == None`.

    This somewhat complex behavior is necessary to ensure that the total number of
    samples is respected when provided either globally upon construction, or
    implicitly by the sum of the values of `fun` when the strategy computes absolute
    numbers, but also when the strategy has a `fun` that returns frequencies. The
    special handling of the case `n_samples == None` is used implicitly by
    [StratifiedPermutationSampler][pydvl.valuation.samplers.stratified.StratifiedPermutationSampler]
    (yeah, it's not pretty).

    When `probs` is `False`, this method corrects rounding errors taking into
    account the fractional parts so that the total number of samples is respected,
    while allocating remainders in a way that follows the relative sizes of the
    fractional parts.

    Args:
        n_indices: number of indices in the index set from which to sample. This is
            typically `len(dataset) - 1` with the usual index iterations.
        probs: Whether to perform the remainder distribution. If `True`, sampling
            probabilities are returned. If `False`, then `n_samples_per_index` is
            used to compute the actual number of samples and the values are rounded
            down to the nearest integer, and the remainder is distributed to
            maintain the relative frequencies.
    Returns:
        The exact (integer) number of samples to take for each set size, if
        `probs` is `False`. Otherwise, the fractional number of samples.
    """

    # m_k = m * f(k) / sum_j f(j)
    values = np.zeros(n_indices + 1, dtype=float)
    s = 0.0
    lb, ub = self.effective_bounds(n_indices)

    for k in range(lb, ub + 1):
        val = self.fun(n_indices, k)
        values[k] = val
        s += val

    assert n_indices == 0 or s > 0, "Sum of sample sizes must be positive"
    values /= s

    if probs:
        values.setflags(write=False)  # avoid accidental modification of cache
        return values  # m_k / m

    if self._n_samples_per_index is None:
        normalization = min(values[values > 0])
        n_samples = np.sum(values / normalization).astype(int)  # min m_k = 1
    elif self._n_samples_per_index == 0:
        n_samples = np.ceil(s).astype(int)  # sum m_k = sum f_k
    else:
        n_samples = self._n_samples_per_index  # sum m_k = n_samples

    values *= n_samples

    # Round down and distribute remainder by adjusting the largest fractional parts
    # A naive implementation with e.g.
    #
    # m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
    #         for k in range(n)]
    #
    # would not respect the total number of samples, and would not distribute
    # remainders correctly
    int_values: NDArray[np.int64] = np.floor(values).astype(np.int64)
    remainder = n_samples - np.sum(int_values)
    fractional_parts = values - int_values
    fractional_parts_indices = np.argsort(-fractional_parts, kind="stable")[
        :remainder
    ]
    int_values[fractional_parts_indices] += 1
    int_values.setflags(write=False)  # avoid accidental modification of cache
    return int_values

LinearSampleSize ¶

LinearSampleSize(
    scale: float,
    offset: int = 0,
    n_samples: int | None = None,
    lower_bound: int | None = None,
    upper_bound: int | None = None,
)

Bases: SampleSizeStrategy

Use a linear function of the set size to determine the number of samples to take.

This is mostly intended for testing purposes, as it is not a very useful heuristic.

PARAMETER	DESCRIPTION
`scale`	Slope of the linear function. TYPE: `float`
`offset`	Offset of the linear function. TYPE: `int` DEFAULT: `0`
`n_samples`	Number of samples for the stratified sampler to generate. It can be `None`, 0 or a positive integer. See the documentation of SampleSizeStrategy for details. TYPE: `int \| None` DEFAULT: `None`
`lower_bound`	Lower bound for the set sizes. TYPE: `int \| None` DEFAULT: `None`
`upper_bound`	Upper bound for the set sizes. TYPE: `int \| None` DEFAULT: `None`

Source code in src/pydvl/valuation/samplers/stratified.py

def __init__(
    self,
    scale: float,
    offset: int = 0,
    n_samples: int | None = None,
    lower_bound: int | None = None,
    upper_bound: int | None = None,
):
    super().__init__(
        n_samples=n_samples, lower_bound=lower_bound, upper_bound=upper_bound
    )
    self.scale = validate_number("scale", scale, float, lower=0.0)
    self.offset = validate_number("offset", offset, int, lower=0)

effective_bounds ¶

effective_bounds(n: int) -> tuple[int, int]

Returns the effective bounds for the sample sizes, given the number of indices n from which sets are sampled.

Note

The number of indices n will typically be complement_size(len(train)), i.e. what we sometimes denote as effective_n.

PARAMETER	DESCRIPTION
`n`	The number of indices from which subsets are drawn. TYPE: `int`

Returns: A tuple of [lower, upper] bounds for sample sizes (inclusive).

Source code in src/pydvl/valuation/samplers/stratified.py

def effective_bounds(self, n: int) -> tuple[int, int]:
    """Returns the effective bounds for the sample sizes, given the number of
    indices `n` from which sets are sampled.

    !!! note
        The number of indices `n` will typically be `complement_size(len(train))`,
        i.e. what we sometimes denote as `effective_n`.

    Args:
        n: The number of indices from which subsets are drawn.
    Returns:
        A tuple of [lower, upper] bounds for sample sizes (inclusive).
    """
    lower = 0 if self.lower_bound is None else self.lower_bound
    upper = n if self.upper_bound is None else self.upper_bound
    lower = min(lower, n)
    upper = min(upper, n)

    return lower, upper

n_samples_per_index ¶

n_samples_per_index(n_indices: int) -> int | None

Returns the total number of samples to take for the given number of indices, or None if no limit was set and samples are generated with probabilities.

Source code in src/pydvl/valuation/samplers/stratified.py

def n_samples_per_index(self, n_indices: int) -> int | None:
    """Returns the total number of samples to take for the given number of indices,
    or `None` if no limit was set and samples are generated with probabilities."""
    if self._n_samples_per_index is None:
        return None
    sizes = self.sample_sizes(n_indices, probs=False)
    return int(np.sum(sizes).item())

sample_sizes `cached` ¶

sample_sizes(
    n_indices: int, probs: bool = True
) -> NDArray[int64] | NDArray[float64]

Precomputes the number of samples to take for each set size, from 0 up to n_indices inclusive.

If probs is True, the result is a vector of floats, where each element is the probability of sampling a set of size \(k.\) This is useful e.g. for RandomSizeIteration where one needs frequencies. In this case self.n_samples_per_index can be None.

If probs is False, the result is a vector of integers, where each element \(k\) is the number of samples to take for set size \(k.\) The sum of all elements will depend on the value of n_samples upon construction. It will be

equal to n_samples if n_samples > 0,
the sum of the values of fun for all set sizes if n_samples == 0,
the sum after a normalization such that the smallest non-zero value is 1, if n_samples == None.

This somewhat complex behavior is necessary to ensure that the total number of samples is respected when provided either globally upon construction, or implicitly by the sum of the values of fun when the strategy computes absolute numbers, but also when the strategy has a fun that returns frequencies. The special handling of the case n_samples == None is used implicitly by StratifiedPermutationSampler (yeah, it's not pretty).

When probs is False, this method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.

PARAMETER	DESCRIPTION
`n_indices`	number of indices in the index set from which to sample. This is typically `len(dataset) - 1` with the usual index iterations. TYPE: `int`
`probs`	Whether to perform the remainder distribution. If `True`, sampling probabilities are returned. If `False`, then `n_samples_per_index` is used to compute the actual number of samples and the values are rounded down to the nearest integer, and the remainder is distributed to maintain the relative frequencies. TYPE: `bool` DEFAULT: `True`

Returns: The exact (integer) number of samples to take for each set size, if probs is False. Otherwise, the fractional number of samples.

Source code in src/pydvl/valuation/samplers/stratified.py

@cache
def sample_sizes(
    self, n_indices: int, probs: bool = True
) -> NDArray[np.int64] | NDArray[np.float64]:
    """Precomputes the number of samples to take for each set size, from 0 up to
    `n_indices` inclusive.

    If `probs` is `True`, the result is a vector of floats, where each element
    is the probability of sampling a set of size $k.$ This is useful e.g. for
    [RandomSizeIteration][pydvl.valuation.samplers.stratified.RandomSizeIteration]
    where one needs frequencies. In this case `self.n_samples_per_index` can be
    `None`.

    If `probs` is `False`, the result is a vector of integers, where each element
    $k$ is the number of samples to take for set size $k.$ The sum of all elements
    will depend on the value of `n_samples` upon construction. It will be

    * equal to `n_samples` if `n_samples > 0`,
    * the sum of the values of `fun` for all set sizes if `n_samples == 0`,
    * the sum after a normalization such that the smallest non-zero value is 1, if
      `n_samples == None`.

    This somewhat complex behavior is necessary to ensure that the total number of
    samples is respected when provided either globally upon construction, or
    implicitly by the sum of the values of `fun` when the strategy computes absolute
    numbers, but also when the strategy has a `fun` that returns frequencies. The
    special handling of the case `n_samples == None` is used implicitly by
    [StratifiedPermutationSampler][pydvl.valuation.samplers.stratified.StratifiedPermutationSampler]
    (yeah, it's not pretty).

    When `probs` is `False`, this method corrects rounding errors taking into
    account the fractional parts so that the total number of samples is respected,
    while allocating remainders in a way that follows the relative sizes of the
    fractional parts.

    Args:
        n_indices: number of indices in the index set from which to sample. This is
            typically `len(dataset) - 1` with the usual index iterations.
        probs: Whether to perform the remainder distribution. If `True`, sampling
            probabilities are returned. If `False`, then `n_samples_per_index` is
            used to compute the actual number of samples and the values are rounded
            down to the nearest integer, and the remainder is distributed to
            maintain the relative frequencies.
    Returns:
        The exact (integer) number of samples to take for each set size, if
        `probs` is `False`. Otherwise, the fractional number of samples.
    """

    # m_k = m * f(k) / sum_j f(j)
    values = np.zeros(n_indices + 1, dtype=float)
    s = 0.0
    lb, ub = self.effective_bounds(n_indices)

    for k in range(lb, ub + 1):
        val = self.fun(n_indices, k)
        values[k] = val
        s += val

    assert n_indices == 0 or s > 0, "Sum of sample sizes must be positive"
    values /= s

    if probs:
        values.setflags(write=False)  # avoid accidental modification of cache
        return values  # m_k / m

    if self._n_samples_per_index is None:
        normalization = min(values[values > 0])
        n_samples = np.sum(values / normalization).astype(int)  # min m_k = 1
    elif self._n_samples_per_index == 0:
        n_samples = np.ceil(s).astype(int)  # sum m_k = sum f_k
    else:
        n_samples = self._n_samples_per_index  # sum m_k = n_samples

    values *= n_samples

    # Round down and distribute remainder by adjusting the largest fractional parts
    # A naive implementation with e.g.
    #
    # m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
    #         for k in range(n)]
    #
    # would not respect the total number of samples, and would not distribute
    # remainders correctly
    int_values: NDArray[np.int64] = np.floor(values).astype(np.int64)
    remainder = n_samples - np.sum(int_values)
    fractional_parts = values - int_values
    fractional_parts_indices = np.argsort(-fractional_parts, kind="stable")[
        :remainder
    ]
    int_values[fractional_parts_indices] += 1
    int_values.setflags(write=False)  # avoid accidental modification of cache
    return int_values

PowerLawSampleSize ¶

PowerLawSampleSize(
    exponent: float,
    n_samples: int | None = None,
    lower_bound: int | None = None,
    upper_bound: int | None = None,
)

Bases: SampleSizeStrategy

Heuristic choice of samples per set size for VRDS.

Sets the number of sets at size \(k\) to be

\[m_k = m \frac{f(k)}{\sum_{j=0}^{n-1} f(j)},\]

for a total number of samples \(m\) and:

\[f(k) = (1+k)^a, \]

and some exponent \(a.\) With \(a=-1\) one recovers the HarmonicSampleSize heuristic. With \(a=-2\) one has the same asymptotic behaviour as the \(\delta\)-Shapley strategies.

PARAMETER	DESCRIPTION
`n_samples`	Total number of samples to generate per index. TYPE: `int \| None` DEFAULT: `None`
`exponent`	The exponent to use. Recommended values are between -1 and -0.5. TYPE: `float`

Source code in src/pydvl/valuation/samplers/stratified.py

def __init__(
    self,
    exponent: float,
    n_samples: int | None = None,
    lower_bound: int | None = None,
    upper_bound: int | None = None,
):
    super().__init__(n_samples, lower_bound, upper_bound)
    self.exponent = exponent

effective_bounds ¶

effective_bounds(n: int) -> tuple[int, int]

Returns the effective bounds for the sample sizes, given the number of indices n from which sets are sampled.

Note

The number of indices n will typically be complement_size(len(train)), i.e. what we sometimes denote as effective_n.

PARAMETER	DESCRIPTION
`n`	The number of indices from which subsets are drawn. TYPE: `int`

Returns: A tuple of [lower, upper] bounds for sample sizes (inclusive).

Source code in src/pydvl/valuation/samplers/stratified.py

def effective_bounds(self, n: int) -> tuple[int, int]:
    """Returns the effective bounds for the sample sizes, given the number of
    indices `n` from which sets are sampled.

    !!! note
        The number of indices `n` will typically be `complement_size(len(train))`,
        i.e. what we sometimes denote as `effective_n`.

    Args:
        n: The number of indices from which subsets are drawn.
    Returns:
        A tuple of [lower, upper] bounds for sample sizes (inclusive).
    """
    lower = 0 if self.lower_bound is None else self.lower_bound
    upper = n if self.upper_bound is None else self.upper_bound
    lower = min(lower, n)
    upper = min(upper, n)

    return lower, upper

n_samples_per_index ¶

n_samples_per_index(n_indices: int) -> int | None

Returns the total number of samples to take for the given number of indices, or None if no limit was set and samples are generated with probabilities.

Source code in src/pydvl/valuation/samplers/stratified.py

def n_samples_per_index(self, n_indices: int) -> int | None:
    """Returns the total number of samples to take for the given number of indices,
    or `None` if no limit was set and samples are generated with probabilities."""
    if self._n_samples_per_index is None:
        return None
    sizes = self.sample_sizes(n_indices, probs=False)
    return int(np.sum(sizes).item())

sample_sizes `cached` ¶

sample_sizes(
    n_indices: int, probs: bool = True
) -> NDArray[int64] | NDArray[float64]

Precomputes the number of samples to take for each set size, from 0 up to n_indices inclusive.

If probs is True, the result is a vector of floats, where each element is the probability of sampling a set of size \(k.\) This is useful e.g. for RandomSizeIteration where one needs frequencies. In this case self.n_samples_per_index can be None.

If probs is False, the result is a vector of integers, where each element \(k\) is the number of samples to take for set size \(k.\) The sum of all elements will depend on the value of n_samples upon construction. It will be

equal to n_samples if n_samples > 0,
the sum of the values of fun for all set sizes if n_samples == 0,
the sum after a normalization such that the smallest non-zero value is 1, if n_samples == None.

This somewhat complex behavior is necessary to ensure that the total number of samples is respected when provided either globally upon construction, or implicitly by the sum of the values of fun when the strategy computes absolute numbers, but also when the strategy has a fun that returns frequencies. The special handling of the case n_samples == None is used implicitly by StratifiedPermutationSampler (yeah, it's not pretty).

When probs is False, this method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.

PARAMETER	DESCRIPTION
`n_indices`	number of indices in the index set from which to sample. This is typically `len(dataset) - 1` with the usual index iterations. TYPE: `int`
`probs`	Whether to perform the remainder distribution. If `True`, sampling probabilities are returned. If `False`, then `n_samples_per_index` is used to compute the actual number of samples and the values are rounded down to the nearest integer, and the remainder is distributed to maintain the relative frequencies. TYPE: `bool` DEFAULT: `True`

Returns: The exact (integer) number of samples to take for each set size, if probs is False. Otherwise, the fractional number of samples.

Source code in src/pydvl/valuation/samplers/stratified.py

@cache
def sample_sizes(
    self, n_indices: int, probs: bool = True
) -> NDArray[np.int64] | NDArray[np.float64]:
    """Precomputes the number of samples to take for each set size, from 0 up to
    `n_indices` inclusive.

    If `probs` is `True`, the result is a vector of floats, where each element
    is the probability of sampling a set of size $k.$ This is useful e.g. for
    [RandomSizeIteration][pydvl.valuation.samplers.stratified.RandomSizeIteration]
    where one needs frequencies. In this case `self.n_samples_per_index` can be
    `None`.

    If `probs` is `False`, the result is a vector of integers, where each element
    $k$ is the number of samples to take for set size $k.$ The sum of all elements
    will depend on the value of `n_samples` upon construction. It will be

    * equal to `n_samples` if `n_samples > 0`,
    * the sum of the values of `fun` for all set sizes if `n_samples == 0`,
    * the sum after a normalization such that the smallest non-zero value is 1, if
      `n_samples == None`.

    This somewhat complex behavior is necessary to ensure that the total number of
    samples is respected when provided either globally upon construction, or
    implicitly by the sum of the values of `fun` when the strategy computes absolute
    numbers, but also when the strategy has a `fun` that returns frequencies. The
    special handling of the case `n_samples == None` is used implicitly by
    [StratifiedPermutationSampler][pydvl.valuation.samplers.stratified.StratifiedPermutationSampler]
    (yeah, it's not pretty).

    When `probs` is `False`, this method corrects rounding errors taking into
    account the fractional parts so that the total number of samples is respected,
    while allocating remainders in a way that follows the relative sizes of the
    fractional parts.

    Args:
        n_indices: number of indices in the index set from which to sample. This is
            typically `len(dataset) - 1` with the usual index iterations.
        probs: Whether to perform the remainder distribution. If `True`, sampling
            probabilities are returned. If `False`, then `n_samples_per_index` is
            used to compute the actual number of samples and the values are rounded
            down to the nearest integer, and the remainder is distributed to
            maintain the relative frequencies.
    Returns:
        The exact (integer) number of samples to take for each set size, if
        `probs` is `False`. Otherwise, the fractional number of samples.
    """

    # m_k = m * f(k) / sum_j f(j)
    values = np.zeros(n_indices + 1, dtype=float)
    s = 0.0
    lb, ub = self.effective_bounds(n_indices)

    for k in range(lb, ub + 1):
        val = self.fun(n_indices, k)
        values[k] = val
        s += val

    assert n_indices == 0 or s > 0, "Sum of sample sizes must be positive"
    values /= s

    if probs:
        values.setflags(write=False)  # avoid accidental modification of cache
        return values  # m_k / m

    if self._n_samples_per_index is None:
        normalization = min(values[values > 0])
        n_samples = np.sum(values / normalization).astype(int)  # min m_k = 1
    elif self._n_samples_per_index == 0:
        n_samples = np.ceil(s).astype(int)  # sum m_k = sum f_k
    else:
        n_samples = self._n_samples_per_index  # sum m_k = n_samples

    values *= n_samples

    # Round down and distribute remainder by adjusting the largest fractional parts
    # A naive implementation with e.g.
    #
    # m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
    #         for k in range(n)]
    #
    # would not respect the total number of samples, and would not distribute
    # remainders correctly
    int_values: NDArray[np.int64] = np.floor(values).astype(np.int64)
    remainder = n_samples - np.sum(int_values)
    fractional_parts = values - int_values
    fractional_parts_indices = np.argsort(-fractional_parts, kind="stable")[
        :remainder
    ]
    int_values[fractional_parts_indices] += 1
    int_values.setflags(write=False)  # avoid accidental modification of cache
    return int_values

RandomSizeIteration ¶

RandomSizeIteration(
    strategy: SampleSizeStrategy, n_indices: int, seed: Seed | None = None
)

Bases: SampleSizeIteration

Draws a set size \(k\) following the distribution of sizes given by the strategy.

Source code in src/pydvl/valuation/samplers/stratified.py

def __init__(
    self, strategy: SampleSizeStrategy, n_indices: int, seed: Seed | None = None
):
    super().__init__(strategy, n_indices)
    self._rng = np.random.default_rng(seed)

RoundRobinSizeIteration ¶

RoundRobinSizeIteration(strategy: SampleSizeStrategy, n_indices: int)

Bases: SampleSizeIteration

Generates one sample for each set size \(k\) before moving to the next.

This continues yielding until every size \(k\) has been emitted exactly \(m_k\) times. For example, if strategy.sample_sizes() == [2, 3, 1] then we want the sequence: (0,1), (1,1), (2,1), (0,1), (1,1), (1,1)

Only for deterministic sample sizes

This iteration is only valid for deterministic sample sizes. In particular, n_samples must be set to the total number of samples.

Source code in src/pydvl/valuation/samplers/stratified.py

def __init__(self, strategy: SampleSizeStrategy, n_indices: int):
    self.strategy = strategy
    self.n_indices = n_indices

SampleSizeIteration ¶

SampleSizeIteration(strategy: SampleSizeStrategy, n_indices: int)

Bases: ABC

Given a strategy and the number of indices, yield tuples (k, count) that the sampler loop will use. Args: strategy: The strategy to use for computing the number of samples to take. n_indices: The number of indices in the index set from which samples are taken.

Source code in src/pydvl/valuation/samplers/stratified.py

def __init__(self, strategy: SampleSizeStrategy, n_indices: int):
    self.strategy = strategy
    self.n_indices = n_indices

SampleSizeStrategy ¶

SampleSizeStrategy(
    n_samples: int | None = None,
    lower_bound: int | None = None,
    upper_bound: int | None = None,
)

Bases: ABC

An object to compute the number of samples to take for a given set size.

To be used with StratifiedSampler.

Following the structure proposed in Wu et al. (2023),² this sets the number of sets at size \(k\) to be:

\[m(k) = m \frac{f(k)}{\sum_{j=0}^{n} f(j)},\]

for some choice of \(f.\) Implementations of this base class must override the method fun() implementing \(f\). It is provided both the size \(k\) and the total number of indices \(n\) as arguments.

The argument n_samples can be:

Fixed to a positive integer. For powerset samplers, this is the number of samples per index that will be generated, i.e. if the sampler iterates over each index exactly once, e.g. FiniteSequentialIndexIteration, then the total number of samples will be n_samples * n_indices.
Fixed to 0 to indicate the strategy produces an absolute number of samples, this only used when subclassing and implementing a fun that does not return frequencies.
None if only sampling probabilities are required, e.g. when using a stochastic sampler and a RandomSizeIteration.

PARAMETER	DESCRIPTION
`n_samples`	Number of samples for the stratified sampler to generate. It can be `None`, 0 or a positive integer. See the class documentation for details. TYPE: `int \| None` DEFAULT: `None`
`lower_bound`	Lower bound for the set sizes. If the set size is smaller than this, the probability of sampling is 0. If `None`, the lower bound is set to 0. TYPE: `int \| None` DEFAULT: `None`
`upper_bound`	Upper bound for the set size. If the set size is larger than this, the probability of sampling is 0. If `None`, the upper bound is set to the number of indices. TYPE: `int \| None` DEFAULT: `None`

Source code in src/pydvl/valuation/samplers/stratified.py

def __init__(
    self,
    n_samples: int | None = None,
    lower_bound: int | None = None,
    upper_bound: int | None = None,
):
    if n_samples is not None:
        n_samples = validate_number("n_samples", n_samples, int, lower=0)
    if lower_bound is not None:
        lower_bound = validate_number("lower_bound", lower_bound, int, lower=0)
    if upper_bound is not None:
        upper_bound = validate_number(
            "upper_bound", upper_bound, int, lower=lower_bound
        )
    self._n_samples_per_index = n_samples
    self.lower_bound = lower_bound
    self.upper_bound = upper_bound

effective_bounds ¶

effective_bounds(n: int) -> tuple[int, int]

Returns the effective bounds for the sample sizes, given the number of indices n from which sets are sampled.

Note

The number of indices n will typically be complement_size(len(train)), i.e. what we sometimes denote as effective_n.

PARAMETER	DESCRIPTION
`n`	The number of indices from which subsets are drawn. TYPE: `int`

Returns: A tuple of [lower, upper] bounds for sample sizes (inclusive).

Source code in src/pydvl/valuation/samplers/stratified.py

def effective_bounds(self, n: int) -> tuple[int, int]:
    """Returns the effective bounds for the sample sizes, given the number of
    indices `n` from which sets are sampled.

    !!! note
        The number of indices `n` will typically be `complement_size(len(train))`,
        i.e. what we sometimes denote as `effective_n`.

    Args:
        n: The number of indices from which subsets are drawn.
    Returns:
        A tuple of [lower, upper] bounds for sample sizes (inclusive).
    """
    lower = 0 if self.lower_bound is None else self.lower_bound
    upper = n if self.upper_bound is None else self.upper_bound
    lower = min(lower, n)
    upper = min(upper, n)

    return lower, upper

fun `abstractmethod` ¶

fun(n_indices: int, subset_len: int) -> float

The function \(f\) to use in the heuristic. Args: n_indices: Size of the index set. subset_len: Size of the subset.

Source code in src/pydvl/valuation/samplers/stratified.py

@abstractmethod
def fun(self, n_indices: int, subset_len: int) -> float:
    """The function $f$ to use in the heuristic.
    Args:
        n_indices: Size of the index set.
        subset_len: Size of the subset.
    """
    ...

n_samples_per_index ¶

n_samples_per_index(n_indices: int) -> int | None

Returns the total number of samples to take for the given number of indices, or None if no limit was set and samples are generated with probabilities.

Source code in src/pydvl/valuation/samplers/stratified.py

def n_samples_per_index(self, n_indices: int) -> int | None:
    """Returns the total number of samples to take for the given number of indices,
    or `None` if no limit was set and samples are generated with probabilities."""
    if self._n_samples_per_index is None:
        return None
    sizes = self.sample_sizes(n_indices, probs=False)
    return int(np.sum(sizes).item())

sample_sizes `cached` ¶

sample_sizes(
    n_indices: int, probs: bool = True
) -> NDArray[int64] | NDArray[float64]

Precomputes the number of samples to take for each set size, from 0 up to n_indices inclusive.

If probs is True, the result is a vector of floats, where each element is the probability of sampling a set of size \(k.\) This is useful e.g. for RandomSizeIteration where one needs frequencies. In this case self.n_samples_per_index can be None.

If probs is False, the result is a vector of integers, where each element \(k\) is the number of samples to take for set size \(k.\) The sum of all elements will depend on the value of n_samples upon construction. It will be

equal to n_samples if n_samples > 0,
the sum of the values of fun for all set sizes if n_samples == 0,
the sum after a normalization such that the smallest non-zero value is 1, if n_samples == None.

This somewhat complex behavior is necessary to ensure that the total number of samples is respected when provided either globally upon construction, or implicitly by the sum of the values of fun when the strategy computes absolute numbers, but also when the strategy has a fun that returns frequencies. The special handling of the case n_samples == None is used implicitly by StratifiedPermutationSampler (yeah, it's not pretty).

When probs is False, this method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.

PARAMETER	DESCRIPTION
`n_indices`	number of indices in the index set from which to sample. This is typically `len(dataset) - 1` with the usual index iterations. TYPE: `int`
`probs`	Whether to perform the remainder distribution. If `True`, sampling probabilities are returned. If `False`, then `n_samples_per_index` is used to compute the actual number of samples and the values are rounded down to the nearest integer, and the remainder is distributed to maintain the relative frequencies. TYPE: `bool` DEFAULT: `True`

Returns: The exact (integer) number of samples to take for each set size, if probs is False. Otherwise, the fractional number of samples.

Source code in src/pydvl/valuation/samplers/stratified.py

@cache
def sample_sizes(
    self, n_indices: int, probs: bool = True
) -> NDArray[np.int64] | NDArray[np.float64]:
    """Precomputes the number of samples to take for each set size, from 0 up to
    `n_indices` inclusive.

    If `probs` is `True`, the result is a vector of floats, where each element
    is the probability of sampling a set of size $k.$ This is useful e.g. for
    [RandomSizeIteration][pydvl.valuation.samplers.stratified.RandomSizeIteration]
    where one needs frequencies. In this case `self.n_samples_per_index` can be
    `None`.

    If `probs` is `False`, the result is a vector of integers, where each element
    $k$ is the number of samples to take for set size $k.$ The sum of all elements
    will depend on the value of `n_samples` upon construction. It will be

    * equal to `n_samples` if `n_samples > 0`,
    * the sum of the values of `fun` for all set sizes if `n_samples == 0`,
    * the sum after a normalization such that the smallest non-zero value is 1, if
      `n_samples == None`.

    This somewhat complex behavior is necessary to ensure that the total number of
    samples is respected when provided either globally upon construction, or
    implicitly by the sum of the values of `fun` when the strategy computes absolute
    numbers, but also when the strategy has a `fun` that returns frequencies. The
    special handling of the case `n_samples == None` is used implicitly by
    [StratifiedPermutationSampler][pydvl.valuation.samplers.stratified.StratifiedPermutationSampler]
    (yeah, it's not pretty).

    When `probs` is `False`, this method corrects rounding errors taking into
    account the fractional parts so that the total number of samples is respected,
    while allocating remainders in a way that follows the relative sizes of the
    fractional parts.

    Args:
        n_indices: number of indices in the index set from which to sample. This is
            typically `len(dataset) - 1` with the usual index iterations.
        probs: Whether to perform the remainder distribution. If `True`, sampling
            probabilities are returned. If `False`, then `n_samples_per_index` is
            used to compute the actual number of samples and the values are rounded
            down to the nearest integer, and the remainder is distributed to
            maintain the relative frequencies.
    Returns:
        The exact (integer) number of samples to take for each set size, if
        `probs` is `False`. Otherwise, the fractional number of samples.
    """

    # m_k = m * f(k) / sum_j f(j)
    values = np.zeros(n_indices + 1, dtype=float)
    s = 0.0
    lb, ub = self.effective_bounds(n_indices)

    for k in range(lb, ub + 1):
        val = self.fun(n_indices, k)
        values[k] = val
        s += val

    assert n_indices == 0 or s > 0, "Sum of sample sizes must be positive"
    values /= s

    if probs:
        values.setflags(write=False)  # avoid accidental modification of cache
        return values  # m_k / m

    if self._n_samples_per_index is None:
        normalization = min(values[values > 0])
        n_samples = np.sum(values / normalization).astype(int)  # min m_k = 1
    elif self._n_samples_per_index == 0:
        n_samples = np.ceil(s).astype(int)  # sum m_k = sum f_k
    else:
        n_samples = self._n_samples_per_index  # sum m_k = n_samples

    values *= n_samples

    # Round down and distribute remainder by adjusting the largest fractional parts
    # A naive implementation with e.g.
    #
    # m_k = [max(1, int(round(m * f(k)/sum(f(j) for j in range(n)), 0)))
    #         for k in range(n)]
    #
    # would not respect the total number of samples, and would not distribute
    # remainders correctly
    int_values: NDArray[np.int64] = np.floor(values).astype(np.int64)
    remainder = n_samples - np.sum(int_values)
    fractional_parts = values - int_values
    fractional_parts_indices = np.argsort(-fractional_parts, kind="stable")[
        :remainder
    ]
    int_values[fractional_parts_indices] += 1
    int_values.setflags(write=False)  # avoid accidental modification of cache
    return int_values

StratifiedPermutation `dataclass` ¶

StratifiedPermutation(
    idx: IndexT | None,
    subset: NDArray[int_],
    lower_bound: int,
    upper_bound: int,
)

Bases: Sample

A sample for the stratified permutation sampling strategy.

This is a subclass of Sample which adds information about the set sizes to sample. It is used by StratifiedPermutationEvaluationStrategy to clip permutations to the required lengths.

idx `instance-attribute` ¶

idx: IndexT | None

Index of current sample

lower_bound `instance-attribute` ¶

lower_bound: int

The lower bound for the set sizes.

subset `instance-attribute` ¶

subset: NDArray[int_]

Indices of current sample

upper_bound `instance-attribute` ¶

upper_bound: int

The upper bound for the set sizes.

hash ¶

__hash__()

This type must be hashable for the utility caching to work. We use hashlib.sha256 which is about 4-5x faster than hash(), and returns the same value in all processes, as opposed to hash() which is salted in each process

Source code in src/pydvl/valuation/types.py

def __hash__(self):
    """This type must be hashable for the utility caching to work.
    We use hashlib.sha256 which is about 4-5x faster than hash(), and returns the
    same value in all processes, as opposed to hash() which is salted in each
    process
    """
    sha256_hash = hashlib.sha256(self.subset.tobytes()).hexdigest()
    return int(sha256_hash, base=16)

__post_init__ ¶

__post_init__()

Ensure that the subset is a numpy array of integers.

Source code in src/pydvl/valuation/types.py

def __post_init__(self):
    """Ensure that the subset is a numpy array of integers."""
    try:
        self.__dict__["subset"] = to_numpy(self.subset)
    except Exception:
        raise TypeError(
            f"subset must be a numpy array, got {type(self.subset).__name__}"
        )
    if self.subset.size == 0:
        self.__dict__["subset"] = self.subset.astype(int)
    if not np.issubdtype(self.subset.dtype, np.integer):
        raise TypeError(
            f"subset must be a numpy array of integers, got {self.subset.dtype}"
        )

with_idx ¶

with_idx(idx: IndexT) -> Self

Return a copy of sample with idx changed.

Returns the original sample if idx is the same.

PARAMETER	DESCRIPTION
`idx`	New value for idx. TYPE: `IndexT`

RETURNS	DESCRIPTION
`Sample`	A copy of the sample with idx changed. TYPE: `Self`

Source code in src/pydvl/valuation/types.py

def with_idx(self, idx: IndexT) -> Self:
    """Return a copy of sample with idx changed.

    Returns the original sample if idx is the same.

    Args:
        idx: New value for idx.

    Returns:
        Sample: A copy of the sample with idx changed.
    """
    if self.idx == idx:
        return self

    return replace(self, idx=idx)

with_idx_in_subset ¶

with_idx_in_subset() -> Self

Return a copy of sample with idx added to the subset.

Returns the original sample if idx was already part of the subset.

RETURNS	DESCRIPTION
`Sample`	A copy of the sample with idx added to the subset. TYPE: `Self`

RAISES	DESCRIPTION
`ValueError`	If idx is None.

Source code in src/pydvl/valuation/types.py

def with_idx_in_subset(self) -> Self:
    """Return a copy of sample with idx added to the subset.

    Returns the original sample if idx was already part of the subset.

    Returns:
        Sample: A copy of the sample with idx added to the subset.

    Raises:
        ValueError: If idx is None.
    """
    if self.idx in self.subset:
        return self

    if self.idx is None:
        raise ValueError("Cannot add idx to subset if idx is None.")

    new_subset = array_concatenate([self.subset, np.array([self.idx])])
    return replace(self, subset=new_subset)

with_subset ¶

with_subset(subset: Array[IndexT]) -> Self

Return a copy of sample with the subset changed.

PARAMETER	DESCRIPTION
`subset`	New value for subset. TYPE: `Array[IndexT]`

RETURNS	DESCRIPTION
`Self`	A copy of the sample with subset changed.

Source code in src/pydvl/valuation/types.py

def with_subset(self, subset: Array[IndexT]) -> Self:
    """Return a copy of sample with the subset changed.

    Args:
        subset: New value for subset.

    Returns:
        A copy of the sample with subset changed.
    """
    return replace(self, subset=to_numpy(subset))

StratifiedPermutationEvaluationStrategy ¶

StratifiedPermutationEvaluationStrategy(
    sampler: PermutationSamplerBase,
    utility: UtilityBase,
    coefficient: SemivalueCoefficient | None,
)

Bases: PermutationEvaluationStrategy

Evaluation strategy for the StratifiedPermutationSampler.

Experimental

Source code in src/pydvl/valuation/samplers/permutation.py

def __init__(
    self,
    sampler: PermutationSamplerBase,
    utility: UtilityBase,
    coefficient: SemivalueCoefficient | None,
):
    super().__init__(utility, coefficient)
    self.truncation = copy(sampler.truncation)
    self.truncation.reset(utility)  # Perform initial setup (e.g. total_utility)

StratifiedPermutationSampler ¶

StratifiedPermutationSampler(
    sample_sizes: SampleSizeStrategy,
    truncation: TruncationPolicy | None = None,
    seed: Seed | None = None,
    batch_size: int = 1,
)

Bases: PermutationSampler

A stratified permutation sampler.

Experimental

This is just an approximation for now. The number of set sizes generated is only roughly equal to that specified by the SampleSizeStrategy. In particular, there is a single counter of sizes for all indices.

PARAMETER	DESCRIPTION
`sample_sizes`	An object which returns the number of samples to take for a given set size. If the strategy fixes a total number of samples for each set size, then this sampler will produce (approximately) that amount of samples for each index. If it does not, then the sampler will generate infinitely many samples, with set sizes following the distribution set by the strategy. TYPE: `SampleSizeStrategy`
`truncation`	A policy to stop the permutation early. TYPE: `TruncationPolicy \| None` DEFAULT: `None`
`seed`	Seed for the random number generator. TYPE: `Seed \| None` DEFAULT: `None`
`batch_size`	The number of samples (full permutations) to generate at once. TYPE: `int` DEFAULT: `1`

Source code in src/pydvl/valuation/samplers/stratified.py

def __init__(
    self,
    sample_sizes: SampleSizeStrategy,
    truncation: TruncationPolicy | None = None,
    seed: Seed | None = None,
    batch_size: int = 1,
):
    super().__init__(truncation, seed, batch_size)
    self.sample_sizes_strategy = sample_sizes
    logger.warning(
        "StratifiedPermutationSampler is experimental and inexact. "
        "Please use another sampler if you are benchmarking methods."
    )

interrupted `property` ¶

interrupted: bool

Whether the sampler has been interrupted.

len ¶

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES	DESCRIPTION
`TypeError`	if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py

def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

repr ¶

__repr__() -> str

FIXME: This is not a proper representation of the sampler.

Source code in src/pydvl/valuation/samplers/base.py

def __repr__(self) -> str:
    """FIXME: This is not a proper representation of the sampler."""
    return f"{self.__class__.__name__}"

complement_size ¶

complement_size(n: int) -> int

Size of the complement of an index wrt. set size n.

Required in certain coefficient computations. Even though we are sampling permutations, updates are always done per-index and the size of the complement is always \(n-1\).

Source code in src/pydvl/valuation/samplers/permutation.py

def complement_size(self, n: int) -> int:
    """Size of the complement of an index wrt. set size `n`.

    Required in certain coefficient computations. Even though we are sampling
    permutations, updates are always done per-index and the size of the complement
    is always $n-1$.
    """
    return n - 1

generate ¶

generate(indices: IndexSetT) -> SampleGenerator[StratifiedPermutation]

Generates the permutation samples.

These samples include information as to what sample sizes can be taken from the permutation by the strategy.

Info

This generator ignores skip_indices.

PARAMETER	DESCRIPTION
`indices`	The indices to sample from. If empty, no samples are generated. TYPE: `IndexSetT`

Source code in src/pydvl/valuation/samplers/stratified.py

def generate(self, indices: IndexSetT) -> SampleGenerator[StratifiedPermutation]:
    """Generates the permutation samples.

    These samples include information as to what sample sizes can be taken from the
    permutation by the strategy.

    !!! info
        This generator ignores `skip_indices`.

    Args:
        indices: The indices to sample from. If empty, no samples are generated.
    """
    n = len(indices)
    if n == 0:
        return

    sizes = self.sample_sizes_strategy.sample_sizes(n, probs=False)
    # FIXME: This is just an approximation. On expectation we should produce roughly
    #   the correct number of sizes per index, but we should probably keep track
    #   separately.
    sizes = sizes * n  # Need a copy because the method is @cached

    while True:
        # Can't have skip indices: if the index set is smaller than the lower bound
        # for the set sizes, the strategy's process() will always return empty
        # evaluations and the method might never stop with criteria depending on the
        # number of updates
        # _indices = np.setdiff1d(indices, self.skip_indices)

        positive = np.where(sizes > 0)[0]
        if len(positive) == 0:  # Restart
            sizes = self.sample_sizes_strategy.sample_sizes(n, probs=False)
            sizes = sizes * n
            continue
        lb, ub = int(positive[0]), int(positive[-1])
        # FIXME: do we really want to restrict the strategies to be monotonic?
        assert all(sizes[lb : ub + 1] > 0), "Sample size function must be monotonic"
        sizes = np.maximum(sizes - 1, 0)

        yield StratifiedPermutation(
            idx=None,
            subset=self._rng.permutation(indices),
            lower_bound=lb,
            upper_bound=ub,
        )

generate_batches ¶

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py

def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

interrupt ¶

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py

def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

log_weight ¶

log_weight(n: int, subset_len: int) -> float

The probability of sampling a set of size subset_len from n indices.

See StratifiedSampler.log_weight()

PARAMETER	DESCRIPTION
`n`	Size of the index set. TYPE: `int`
`subset_len`	Size of the subset. TYPE: `int`

RETURNS	DESCRIPTION
`float`	The logarithm of the probability of having sampled a set of size `subset_len`.

Source code in src/pydvl/valuation/samplers/stratified.py

def log_weight(self, n: int, subset_len: int) -> float:
    """The probability of sampling a set of size `subset_len` from `n` indices.

    See
    [StratifiedSampler.log_weight()][pydvl.valuation.samplers.stratified.StratifiedSampler.log_weight]

    Args:
        n:  Size of the index set.
        subset_len: Size of the subset.

    Returns:
        The logarithm of the probability of having sampled a set of size
            `subset_len`.
    """
    effective_n = self.complement_size(n)
    p = self.sample_sizes_strategy.sample_sizes(effective_n, probs=True)
    p_k = p[subset_len]
    if p_k == 0:
        return -np.inf

    return float(np.log(p_k) - logcomb(effective_n, subset_len))

result_updater ¶

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns an object that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER	DESCRIPTION
`result`	The result to update TYPE: `ValuationResult`

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py

def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns an object that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

StratifiedSampler ¶

StratifiedSampler(
    sample_sizes: SampleSizeStrategy,
    sample_sizes_iteration: Type[
        SampleSizeIteration
    ] = FiniteSequentialSizeIteration,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
)

Bases: StochasticSamplerMixin, PowersetSampler

A sampler stratified by coalition size with variable number of samples per set size.

PARAMETER	DESCRIPTION
`sample_sizes`	An object which returns the number of samples to take for a given set size. If `index_iteration` below is finite, then the sampler will generate exactly as many samples of each size as returned by this object. If the iteration is infinite, then the `sample_sizes` will be used as probabilities of sampling. TYPE: `SampleSizeStrategy`
`sample_sizes_iteration`	How to loop over sample sizes. The main modes are: * deterministically. For every k generate m_k samples before moving to k+1. * stochastically. Sample sizes k according to the distribution given by `sample_sizes`. * round-robin. Iterate over k, and generate 1 sample each time, until reaching m_k. But more can be created by subclassing SampleSizeIteration. TYPE: `Type[SampleSizeIteration]` DEFAULT: `FiniteSequentialSizeIteration`
`batch_size`	The number of samples to generate per batch. Batches are processed together by each subprocess when working in parallel. TYPE: `int` DEFAULT: `1`
`index_iteration`	the strategy to use for iterating over indices to update. TYPE: `Type[IndexIteration]` DEFAULT: `FiniteSequentialIndexIteration`
`seed`	The seed for the random number generator. TYPE: `Seed \| None` DEFAULT: `None`

New in version 0.10.0

Source code in src/pydvl/valuation/samplers/stratified.py

def __init__(
    self,
    sample_sizes: SampleSizeStrategy,
    sample_sizes_iteration: Type[
        SampleSizeIteration
    ] = FiniteSequentialSizeIteration,
    batch_size: int = 1,
    index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
    seed: Seed | None = None,
):
    super().__init__(
        batch_size=batch_size, index_iteration=index_iteration, seed=seed
    )
    self.sample_sizes_strategy = sample_sizes
    self.sample_sizes_iteration = maybe_add_argument(sample_sizes_iteration, "seed")

interrupted `property` ¶

interrupted: bool

Whether the sampler has been interrupted.

skip_indices `property` `writable` ¶

skip_indices

Set of indices to skip in the outer loop.

len ¶

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES	DESCRIPTION
`TypeError`	if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py

def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

repr ¶

__repr__() -> str

FIXME: This is not a proper representation of the sampler.

Source code in src/pydvl/valuation/samplers/base.py

def __repr__(self) -> str:
    """FIXME: This is not a proper representation of the sampler."""
    return f"{self.__class__.__name__}"

generate_batches ¶

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py

def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

index_iterable ¶

index_iterable(indices: IndexSetT) -> Generator[IndexT | None, None, None]

Iterates over indices with the method specified at construction.

Source code in src/pydvl/valuation/samplers/powerset.py

def index_iterable(
    self, indices: IndexSetT
) -> Generator[IndexT | None, None, None]:
    """Iterates over indices with the method specified at construction."""
    try:
        iterable = self._index_iterator_cls(indices, seed=self._rng)  # type: ignore
    except (AttributeError, TypeError):
        iterable = self._index_iterator_cls(indices)
    for idx in iterable:
        if idx not in self.skip_indices:
            yield idx

interrupt ¶

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py

def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

log_weight ¶

log_weight(n: int, subset_len: int) -> float

The probability of sampling a set of size k is 1/(n choose k) times the probability of the set having size k, which is the number of samples for that size divided by the total number of samples for all sizes:

\[P(S) = \binom{n}{k}^{-1} \ \frac{m_k}{m},\]

where \(m_k\) is the number of samples of size \(k\) and \(m\) is the total number of samples.

PARAMETER	DESCRIPTION
`n`	Size of the index set. TYPE: `int`
`subset_len`	Size of the subset. TYPE: `int`

Returns: The logarithm of the probability of having sampled a set of size subset_len.

Source code in src/pydvl/valuation/samplers/stratified.py

def log_weight(self, n: int, subset_len: int) -> float:
    r"""The probability of sampling a set of size k is 1/(n choose k) times the
    probability of the set having size k, which is the number of samples for that
    size divided by the total number of samples for all sizes:

    $$P(S) = \binom{n}{k}^{-1} \ \frac{m_k}{m},$$

    where $m_k$ is the number of samples of size $k$ and $m$ is the total number
    of samples.

    Args:
        n: Size of the index set.
        subset_len: Size of the subset.
    Returns:
        The logarithm of the probability of having sampled a set of size `subset_len`.
    """

    effective_n = self.complement_size(n)

    # Note that we can simplify the quotient
    # $$ \frac{m_k}{m} =
    #    \frac{m \frac{f (k)}{\sum_j f (j)}}{m} = \frac{f(k)}{\sum_j f (j)} $$
    # so that in the weight computation we can use the function $f$ directly from
    # the strategy, or equivalently, call `sample_sizes(n, probs=True)`.
    # This is useful for the stochastic iteration, where we are given sampling
    # frequencies for each size instead of counts, and the total number of samples
    # m is 1, so that quantization would yield a bunch of zeros.
    p = self.sample_sizes_strategy.sample_sizes(effective_n, probs=True)
    p_k = p[subset_len]  # also m_k / m
    if p_k == 0:
        return -np.inf

    return float(np.log(p_k) - logcomb(effective_n, subset_len))

result_updater ¶

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns an object that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER	DESCRIPTION
`result`	The result to update TYPE: `ValuationResult`

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py

def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns an object that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

VRDSSampler ¶

VRDSSampler(
    n_samples_per_index: int, batch_size: int = 1, seed: Seed | None = None
)

Bases: StratifiedSampler

A sampler stratified by coalition size with variable number of samples per set size.

This sampler iterates once per index and generates a fixed mount of subsets of each size in its complement.

This is a convenience subclass of StratifiedSampler which implements the VRDS heuristic from Wu et al. (2023)².

It is functionally equivalent to a StratifiedSampler with HarmonicSampleSize, FiniteSequentialSizeIteration, and FiniteSequentialIndexIteration.

PARAMETER	DESCRIPTION
`n_samples_per_index`	The number of samples to generate per index. To compute with the (ε,δ) bound from the paper, use min_samples(). The distribution per set size will follow a harmonic function, as defined in HarmonicSampleSize. TYPE: `int`
`batch_size`	The number of samples to generate per batch. Batches are processed together by each subprocess when working in parallel. TYPE: `int` DEFAULT: `1`
`seed`	The seed for the random number generator. TYPE: `Seed \| None` DEFAULT: `None`

Source code in src/pydvl/valuation/samplers/stratified.py

def __init__(
    self,
    n_samples_per_index: int,
    batch_size: int = 1,
    seed: Seed | None = None,
):
    super().__init__(
        sample_sizes=HarmonicSampleSize(n_samples=n_samples_per_index),
        sample_sizes_iteration=FiniteSequentialSizeIteration,
        batch_size=batch_size,
        index_iteration=FiniteSequentialIndexIteration,
        seed=seed,
    )

interrupted `property` ¶

interrupted: bool

Whether the sampler has been interrupted.

skip_indices `property` `writable` ¶

skip_indices

Set of indices to skip in the outer loop.

len ¶

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES	DESCRIPTION
`TypeError`	if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py

def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

repr ¶

__repr__() -> str

FIXME: This is not a proper representation of the sampler.

Source code in src/pydvl/valuation/samplers/base.py

def __repr__(self) -> str:
    """FIXME: This is not a proper representation of the sampler."""
    return f"{self.__class__.__name__}"

generate_batches ¶

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py

def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

index_iterable ¶

index_iterable(indices: IndexSetT) -> Generator[IndexT | None, None, None]

Iterates over indices with the method specified at construction.

Source code in src/pydvl/valuation/samplers/powerset.py

def index_iterable(
    self, indices: IndexSetT
) -> Generator[IndexT | None, None, None]:
    """Iterates over indices with the method specified at construction."""
    try:
        iterable = self._index_iterator_cls(indices, seed=self._rng)  # type: ignore
    except (AttributeError, TypeError):
        iterable = self._index_iterator_cls(indices)
    for idx in iterable:
        if idx not in self.skip_indices:
            yield idx

interrupt ¶

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py

def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

log_weight ¶

log_weight(n: int, subset_len: int) -> float

The probability of sampling a set of size k is 1/(n choose k) times the probability of the set having size k, which is the number of samples for that size divided by the total number of samples for all sizes:

\[P(S) = \binom{n}{k}^{-1} \ \frac{m_k}{m},\]

where \(m_k\) is the number of samples of size \(k\) and \(m\) is the total number of samples.

PARAMETER	DESCRIPTION
`n`	Size of the index set. TYPE: `int`
`subset_len`	Size of the subset. TYPE: `int`

Returns: The logarithm of the probability of having sampled a set of size subset_len.

Source code in src/pydvl/valuation/samplers/stratified.py

def log_weight(self, n: int, subset_len: int) -> float:
    r"""The probability of sampling a set of size k is 1/(n choose k) times the
    probability of the set having size k, which is the number of samples for that
    size divided by the total number of samples for all sizes:

    $$P(S) = \binom{n}{k}^{-1} \ \frac{m_k}{m},$$

    where $m_k$ is the number of samples of size $k$ and $m$ is the total number
    of samples.

    Args:
        n: Size of the index set.
        subset_len: Size of the subset.
    Returns:
        The logarithm of the probability of having sampled a set of size `subset_len`.
    """

    effective_n = self.complement_size(n)

    # Note that we can simplify the quotient
    # $$ \frac{m_k}{m} =
    #    \frac{m \frac{f (k)}{\sum_j f (j)}}{m} = \frac{f(k)}{\sum_j f (j)} $$
    # so that in the weight computation we can use the function $f$ directly from
    # the strategy, or equivalently, call `sample_sizes(n, probs=True)`.
    # This is useful for the stochastic iteration, where we are given sampling
    # frequencies for each size instead of counts, and the total number of samples
    # m is 1, so that quantization would yield a bunch of zeros.
    p = self.sample_sizes_strategy.sample_sizes(effective_n, probs=True)
    p_k = p[subset_len]  # also m_k / m
    if p_k == 0:
        return -np.inf

    return float(np.log(p_k) - logcomb(effective_n, subset_len))

min_samples `staticmethod` ¶

min_samples(n_indices: int, eps: float = 0.01, delta: float = 0.05) -> int

Computes the minimal amount of samples for an (ε,δ)-approximation of data Shapley.

This is the bound shown in Theorem 4.3 of Wu et al. (2023)².

PARAMETER	DESCRIPTION
`n_indices`	The number of indices in the index set. TYPE: `int`
`eps`	The epsilon value in the epsilon-delta guarantee, i.e. the distance to the true value. TYPE: `float` DEFAULT: `0.01`
`delta`	The delta value in the epsilon-delta guarantee, i.e. the probability of failure. TYPE: `float` DEFAULT: `0.05`

Returns: \((2 \log(2/\delta) / \epsilon^2) \log(n + 1)^2.\)

Source code in src/pydvl/valuation/samplers/stratified.py

@staticmethod
def min_samples(n_indices: int, eps: float = 0.01, delta: float = 0.05) -> int:
    r"""Computes the minimal amount of samples for an (ε,δ)-approximation of data
    Shapley.

    This is the bound shown in Theorem 4.3 of Wu et al. (2023)<sup><a
    href="#wu_variance_2023">2</a></sup>.

    Args:
        n_indices: The number of indices in the index set.
        eps: The epsilon value in the epsilon-delta guarantee, i.e. the distance to the
            true value.
        delta: The delta value in the epsilon-delta guarantee, i.e. the probability of
            failure.
    Returns:
        $(2 \log(2/\delta) / \epsilon^2) \log(n + 1)^2.$
    """
    m = 2 * (np.log(2) - np.log(delta)) / eps**2
    m *= (np.log(n_indices) + 1) ** 2
    return int(np.ceil(m).item())

result_updater ¶

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns an object that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER	DESCRIPTION
`result`	The result to update TYPE: `ValuationResult`

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py

def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns an object that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

pydvl.valuation.samplers.stratified ¶

Choosing set size heuristics¶

Variance Reduced Stratified Sampler (VRDS)¶

Iterating over indices and its effect on 'n_samples'¶

Iterating over set sizes¶

Other known strategies¶

References¶

ConstantSampleSize ¶

effective_bounds ¶

n_samples_per_index ¶

sample_sizes cached ¶

DeltaShapleyNCSGDConfig dataclass ¶

DeltaShapleyNCSGDSampleSize ¶

effective_bounds ¶

fun ¶

n_samples_per_index ¶

sample_sizes cached ¶

FiniteSequentialSizeIteration ¶

GroupTestingSampleSize ¶

effective_bounds ¶

n_samples_per_index ¶

sample_sizes cached ¶

HarmonicSampleSize ¶

effective_bounds ¶

n_samples_per_index ¶

sample_sizes cached ¶

LinearSampleSize ¶

effective_bounds ¶

n_samples_per_index ¶

sample_sizes cached ¶

PowerLawSampleSize ¶

effective_bounds ¶

n_samples_per_index ¶

sample_sizes cached ¶

RandomSizeIteration ¶

RoundRobinSizeIteration ¶

SampleSizeIteration ¶

SampleSizeStrategy ¶

effective_bounds ¶

fun abstractmethod ¶

n_samples_per_index ¶

sample_sizes cached ¶

StratifiedPermutation dataclass ¶

idx instance-attribute ¶

lower_bound instance-attribute ¶

subset instance-attribute ¶

upper_bound instance-attribute ¶

__hash__ ¶

__post_init__ ¶

with_idx ¶

with_idx_in_subset ¶

with_subset ¶

StratifiedPermutationEvaluationStrategy ¶

StratifiedPermutationSampler ¶

interrupted property ¶

__len__ ¶

__repr__ ¶

complement_size ¶

generate ¶

generate_batches ¶

interrupt ¶

log_weight ¶

result_updater ¶

StratifiedSampler ¶

interrupted property ¶

skip_indices property writable ¶

__len__ ¶

__repr__ ¶

generate_batches ¶

index_iterable ¶

interrupt ¶

log_weight ¶

result_updater ¶

VRDSSampler ¶

interrupted property ¶

skip_indices property writable ¶

__len__ ¶

__repr__ ¶

generate_batches ¶

index_iterable ¶

sample_sizes `cached` ¶

DeltaShapleyNCSGDConfig `dataclass` ¶

sample_sizes `cached` ¶

sample_sizes `cached` ¶

sample_sizes `cached` ¶

sample_sizes `cached` ¶

sample_sizes `cached` ¶

fun `abstractmethod` ¶

sample_sizes `cached` ¶

StratifiedPermutation `dataclass` ¶

idx `instance-attribute` ¶

lower_bound `instance-attribute` ¶

subset `instance-attribute` ¶

upper_bound `instance-attribute` ¶

hash ¶

interrupted `property` ¶

len ¶

repr ¶

interrupted `property` ¶

skip_indices `property` `writable` ¶

len ¶

repr ¶

interrupted `property` ¶

skip_indices `property` `writable` ¶

len ¶

repr ¶

min_samples `staticmethod` ¶