pydvl.valuation.samplers.classwise ¶

Class-wise sampler for the class-wise Shapley valuation method.

The class-wise Shapley method, introduced by Schoch et al., 2022¹, uses a so-called set-conditional marginal Shapley value that requires selectively sampling subsets of data points with the same or a different class from that of the data point of interest.

This sampling scheme is divided into an outer and an inner sampler.

The outer sampler is any subclass of PowersetSampler that generates subsets within the complement set of the data point of interest, and with a different label (so-called "out-of-class" samples, denoted by \(S_{-y_i}\) in this documentation).
The inner sampler is any subclass of IndexSampler, typically (and in the paper) a PermutationSampler. It returns so-called "in-class" samples (denoted by \(S_{y_i}\) in this documentation) from the set \(N_{y_i}\), i.e., the set of all indices with the same label as the data point of interest. !!! info "Restricting the number of inner samples" Because of the nested sampling procedure, it is necessary to limit the amount of in-class samples to a finite number. This is done by setting the max_in_class_samples parameter. For finite samplers, it can be left as None to let them run until completion.

Info

For more information on the class-wise Shapley method, as well as a summary of the reproduction results by Semmler and de Benito Delgado (2024)² see the main documentation for the method.

References¶

Schoch, Stephanie, Haifeng Xu, and Yangfeng Ji. CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification. In Proc. of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS). New Orleans, Louisiana, USA, 2022. ↩
Semmler, Markus, and Miguel de Benito Delgado. [Re] Classwise-Shapley Values for Data Valuation. Transactions on Machine Learning Research, July 2024. ↩

ClasswiseSampler ¶

ClasswiseSampler(
    in_class: IndexSampler,
    out_of_class: PowersetSampler,
    *,
    max_in_class_samples: int | None = None,
    min_elements_per_label: int = 1,
    batch_size: int = 1,
)

Bases: IndexSampler[ClasswiseSample, ValueUpdate]

A sampler that samples elements from a dataset in two steps, based on the labels.

It proceeds by sampling out-of-class indices (training points with a different label to the point of interest), and in-class indices (training points with the same label as the point of interest).

Used by the class-wise Shapley valuation method.

PARAMETER	DESCRIPTION
`in_class`	Sampling scheme for elements of a given label (inner sampler). Typically, a PermutationSampler. TYPE: `IndexSampler`
`out_of_class`	Sampling scheme for elements of different labels (outer sampler). E.g. a UniformSampler or a VRDSSampler. This sampler must use NoIndexIteration or any subclass thereof. TYPE: `PowersetSampler`
`max_in_class_samples`	Maximum number of in-class samples to generate per outer iteration. Leave as `None` to sample all in-class samples when using finite samplers. This must be set to a positive integer when using an infinite in-class sampler. TYPE: `int \| None` DEFAULT: `None`
`min_elements_per_label`	Minimum number of elements per label to sample from the complement set, i.e., out of class elements. TYPE: `int` DEFAULT: `1`
`batch_size`	Number of samples to generate in each batch. TYPE: `int` DEFAULT: `1`

Source code in src/pydvl/valuation/samplers/classwise.py

def __init__(
    self,
    in_class: IndexSampler,
    out_of_class: PowersetSampler,
    *,
    max_in_class_samples: int | None = None,
    min_elements_per_label: int = 1,
    batch_size: int = 1,
):
    super().__init__(batch_size=batch_size)
    # By default, powerset samplers remove the index from the generated
    # subset but in this case we want all indices.
    # The index for which we compute the value will be removed by
    # the in_class sampler instead.
    if not issubclass(out_of_class._index_iterator_cls, NoIndexIteration):
        raise ValueError(
            "The out-of-class sampler must use NoIndexIteration or any subclass. "
            f"It is currently {out_of_class._index_iterator_cls.__name__}."
        )

    self.in_class = in_class
    self.out_of_class = out_of_class
    self.min_elements_per_label = min_elements_per_label
    self.max_in_class_samples = max_in_class_samples

interrupted `property` ¶

interrupted: bool

Whether the sampler has been interrupted.

skip_indices `property` `writable` ¶

skip_indices: IndexSetT

Indices being skipped in the sampler. The exact behaviour will be sampler-dependent, so that setting this property is disabled by default.

len ¶

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES	DESCRIPTION
`TypeError`	if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py

def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

repr ¶

__repr__() -> str

FIXME: This is not a proper representation of the sampler.

Source code in src/pydvl/valuation/samplers/base.py

def __repr__(self) -> str:
    """FIXME: This is not a proper representation of the sampler."""
    return f"{self.__class__.__name__}"

batches_from_data ¶

batches_from_data(data: Dataset) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/classwise.py

def batches_from_data(self, data: Dataset) -> BatchGenerator:
    """Batches the samples and yields them."""
    try:
        self._len = self.sample_limit(data.indices)
    except AttributeError:
        pass

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(data) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.samples_from_data(data), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self.interrupted:
            break

generate ¶

generate(indices: IndexSetT) -> SampleGenerator

This is not needed because this sampler is used by calling the from_data method instead of the generate_batches method.

Source code in src/pydvl/valuation/samplers/classwise.py

def generate(self, indices: IndexSetT) -> SampleGenerator:
    """This is not needed because this sampler is used by calling the `from_data`
    method instead of the `generate_batches` method."""
    raise AttributeError("Cannot sample from indices directly.")

interrupt ¶

interrupt()

Signals the sampler to stop generating samples after the current batch.

Source code in src/pydvl/valuation/samplers/base.py

def interrupt(self):
    """Signals the sampler to stop generating samples after the current batch."""
    self._interrupted = True

log_weight ¶

log_weight(n: int, subset_len: int) -> float

CW-Shapley uses the evaluation strategy from the in-class sampler, so this method should never be called.

Source code in src/pydvl/valuation/samplers/classwise.py

def log_weight(self, n: int, subset_len: int) -> float:
    """CW-Shapley uses the evaluation strategy from the in-class sampler, so this
    method should never be called."""
    raise AttributeError("The weight should come from the in-class sampler")

result_updater ¶

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns an object that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER	DESCRIPTION
`result`	The result to update TYPE: `ValuationResult`

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py

def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns an object that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

samples_from_data ¶

samples_from_data(data: Dataset) -> SampleGenerator

Generates batches of class-wise samples from the dataset. Args: data: The dataset to sample from.

Source code in src/pydvl/valuation/samplers/classwise.py

def samples_from_data(self, data: Dataset) -> SampleGenerator:
    """Generates batches of class-wise samples from the dataset.
    Args:
        data: The dataset to sample from.
    """
    labels = get_unique_labels(data.data().y)
    n_labels = len(labels)

    if (
        self.max_in_class_samples is None
        and self.in_class.sample_limit(data.indices) is None
    ):
        raise ValueError(
            f"When using an infinite in-class sampler ({str(self.in_class)}), "
            f"ClasswiseSampler.max_in_class_samples must be set to a positive integer "
            f"upon construction."
        )

    out_of_class_batch_generators = {}

    for label in labels:
        without_label = np.where(data.data().y != label)[0]
        out_of_class_batch_generators[label] = self.out_of_class.generate_batches(
            without_label
        )

    for label, ooc_batch in roundrobin(out_of_class_batch_generators):
        if self.interrupted:
            return
        for ooc_sample in ooc_batch:
            if self.min_elements_per_label > 0:
                # We make sure that we have at least
                # `min_elements_per_label` elements per label per sample
                n_unique_sample_labels = len(
                    get_unique_labels(data.data().y[ooc_sample.subset])
                )
                if n_unique_sample_labels < n_labels - 1:
                    continue

            with_label = np.where(data.data().y == label)[0]
            for ic_sample in flatten(self.in_class.generate_batches(with_label)):
                yield ClasswiseSample(
                    idx=ic_sample.idx,
                    label=label,
                    subset=ic_sample.subset,
                    ooc_subset=ooc_sample.subset,
                )
                if (
                    self.max_in_class_samples is not None
                    and self.in_class.n_samples >= self.max_in_class_samples
                ) or self.in_class.interrupted:
                    break

get_unique_labels ¶

get_unique_labels(arr: Array[DT]) -> Array[DT]

Returns unique labels in a categorical dataset.

PARAMETER	DESCRIPTION
`arr`	The input array to find unique labels from. It should be of categorical types such as Object, String, Unicode, Unsigned integer, Signed integer, or Boolean. TYPE: `Array[DT]`

RETURNS	DESCRIPTION
`Array[DT]`	An array of unique labels.

RAISES	DESCRIPTION
`ValueError`	If the input array is not of a categorical type.

Source code in src/pydvl/valuation/samplers/classwise.py

def get_unique_labels(arr: Array[DT]) -> Array[DT]:
    """Returns unique labels in a categorical dataset.

    Args:
        arr: The input array to find unique labels from. It should be of
             categorical types such as Object, String, Unicode, Unsigned
             integer, Signed integer, or Boolean.

    Returns:
        An array of unique labels.

    Raises:
        ValueError: If the input array is not of a categorical type.
    """
    if is_categorical(arr):
        return cast(Array[DT], array_unique(arr))
    else:
        raise ValueError(
            f"Input array has an unsupported data type for categorical labels: {type(arr)}. "
            "Expected types: Object, String, Unicode, Unsigned integer, Signed integer, or Boolean."
        )

roundrobin ¶

roundrobin(
    batch_generators: Mapping[U, Iterable[V]],
) -> Generator[tuple[U, V], None, None]

Take samples from batch generators in order until all of them are exhausted.

This was heavily inspired by the roundrobin recipe in the official Python documentation for the itertools package.

Examples:

>>> from pydvl.valuation.samplers.classwise import roundrobin
>>> list(roundrobin({"A": "123"}, {"B": "456"}))
[("A", "1"), ("B", "4"), ("A", "2"), ("B", "5"), ("A", "3"), ("B", "6")]

PARAMETER	DESCRIPTION
`batch_generators`	dictionary mapping labels to batch generators. TYPE: `Mapping[U, Iterable[V]]`

RETURNS	DESCRIPTION
`None`	Combined generators

Source code in src/pydvl/valuation/samplers/classwise.py

def roundrobin(
    batch_generators: Mapping[U, Iterable[V]],
) -> Generator[tuple[U, V], None, None]:
    """Take samples from batch generators in order until all of them are exhausted.

    This was heavily inspired by the roundrobin recipe
    in the official Python documentation for the itertools package.

    Examples:
        >>> from pydvl.valuation.samplers.classwise import roundrobin
        >>> list(roundrobin({"A": "123"}, {"B": "456"}))
        [("A", "1"), ("B", "4"), ("A", "2"), ("B", "5"), ("A", "3"), ("B", "6")]

    Args:
        batch_generators: dictionary mapping labels to batch generators.

    Returns:
        Combined generators
    """
    n_active = len(batch_generators)
    remaining_generators = cycle(
        (label, iter(it).__next__) for label, it in batch_generators.items()
    )
    while n_active:
        try:
            for label, next_generator in remaining_generators:
                yield label, next_generator()
        except StopIteration:
            # Remove the iterator we just exhausted from the cycle.
            n_active -= 1
            remaining_generators = cycle(islice(remaining_generators, n_active))

pydvl.valuation.samplers.classwise ¶

References¶

ClasswiseSampler ¶

interrupted property ¶

skip_indices property writable ¶

__len__ ¶

__repr__ ¶

batches_from_data ¶

generate ¶

interrupt ¶

log_weight ¶

result_updater ¶

samples_from_data ¶

get_unique_labels ¶

roundrobin ¶

interrupted `property` ¶

skip_indices `property` `writable` ¶

len ¶

repr ¶