Skip to content

pydvl.valuation.samplers.classwise

Class-wise sampler for the class-wise Shapley valuation method.

The class-wise Shapley method, introduced by Schoch et al., 20221, uses a so-called set-conditional marginal Shapley value that requires selectively sampling subsets of data points with the same or a different class from that of the data point of interest.

This sampling scheme is divided into an outer and an inner sampler. The outer one is any subclass of PowersetSampler that generates subsets of the complement set of the data point of interest. The inner sampler is any subclass of IndexSampler, typically (and in the paper) a PermutationSampler.

References


  1. Schoch, Stephanie, Haifeng Xu, and Yangfeng Ji. CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification. In Proc. of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS). New Orleans, Louisiana, USA, 2022. 

ClasswiseSampler

ClasswiseSampler(
    in_class: IndexSampler,
    out_of_class: PowersetSampler,
    *,
    min_elements_per_label: int = 1,
    batch_size: int = 1,
)

Bases: IndexSampler

A sampler that samples elements from a dataset in two steps, based on the labels.

It proceeds by sampling out-of-class indices (training points with a different label to the point of interest), and in-class indices (training points with the same label as the point of interest), in the complement.

Used by the class-wise Shapley valuation method.

PARAMETER DESCRIPTION
in_class

Sampling scheme for elements of a given label.

TYPE: IndexSampler

out_of_class

Sampling scheme for elements of different labels, i.e., the complement set.

TYPE: PowersetSampler

min_elements_per_label

Minimum number of elements per label to sample from the complement set, i.e., out of class elements.

TYPE: int DEFAULT: 1

Source code in src/pydvl/valuation/samplers/classwise.py
def __init__(
    self,
    in_class: IndexSampler,
    out_of_class: PowersetSampler,
    *,
    min_elements_per_label: int = 1,
    batch_size: int = 1,
):
    super().__init__(batch_size=batch_size)
    self.in_class = in_class
    self.out_of_class = out_of_class
    self.min_elements_per_label = min_elements_per_label

skip_indices property writable

skip_indices: IndexSetT

Indices being skipped in the sampler. The exact behaviour will be sampler-dependent, so that setting this property is disabled by default.

__len__

__len__() -> int

Returns the length of the current sample generation in generate_batches.

RAISES DESCRIPTION
`TypeError`

if the sampler is infinite or generate_batches has not been called yet.

Source code in src/pydvl/valuation/samplers/base.py
def __len__(self) -> int:
    """Returns the length of the current sample generation in generate_batches.

    Raises:
        `TypeError`: if the sampler is infinite or
            [generate_batches][pydvl.valuation.samplers.IndexSampler.generate_batches]
            has not been called yet.
    """
    if self._len is None:
        raise TypeError(f"This {self.__class__.__name__} has no length")
    return self._len

generate_batches

generate_batches(indices: IndexSetT) -> BatchGenerator

Batches the samples and yields them.

Source code in src/pydvl/valuation/samplers/base.py
def generate_batches(self, indices: IndexSetT) -> BatchGenerator:
    """Batches the samples and yields them."""
    self._len = self.sample_limit(indices)

    # Create an empty generator if the indices are empty: `return` acts like a
    # `break`, and produces an empty generator.
    if len(indices) == 0:
        return

    self._interrupted = False
    self._n_samples = 0
    for batch in chunked(self.generate(indices), self.batch_size):
        self._n_samples += len(batch)
        yield batch
        if self._interrupted:
            break

result_updater

result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]

Returns a callable that updates a valuation result with a value update.

Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.

PARAMETER DESCRIPTION
result

The result to update

TYPE: ValuationResult

Returns: A callable object that updates the result with a value update

Source code in src/pydvl/valuation/samplers/base.py
def result_updater(self, result: ValuationResult) -> ResultUpdater[ValueUpdateT]:
    """Returns a callable that updates a valuation result with a value update.

    Because we use log-space computation for numerical stability, the default result
    updater keeps track of several quantities required to maintain accurate running
    1st and 2nd moments.

    Args:
        result: The result to update
    Returns:
        A callable object that updates the result with a value update
    """
    return LogResultUpdater(result)

interrupt

interrupt() -> None

Interrupts the current sampler as well as the passed in samplers

Source code in src/pydvl/valuation/samplers/classwise.py
def interrupt(self) -> None:
    """Interrupts the current sampler as well as the passed in samplers"""
    super().interrupt()
    self.in_class.interrupt()
    self.out_of_class.interrupt()

roundrobin

roundrobin(
    batch_generators: Mapping[U, Iterable[V]],
) -> Generator[tuple[U, V], None, None]

Take samples from batch generators in order until all of them are exhausted.

This was heavily inspired by the roundrobin recipe in the official Python documentation for the itertools package.

Examples:

>>> from pydvl.valuation.samplers.classwise import roundrobin
>>> list(roundrobin({"A": "123"}, {"B": "456"}))
[("A", "1"), ("B", "4"), ("A", "2"), ("B", "5"), ("A", "3"), ("B", "6")]
PARAMETER DESCRIPTION
batch_generators

dictionary mapping labels to batch generators.

TYPE: Mapping[U, Iterable[V]]

RETURNS DESCRIPTION
None

Combined generators

Source code in src/pydvl/valuation/samplers/classwise.py
def roundrobin(
    batch_generators: Mapping[U, Iterable[V]],
) -> Generator[tuple[U, V], None, None]:
    """Take samples from batch generators in order until all of them are exhausted.

    This was heavily inspired by the roundrobin recipe
    in the official Python documentation for the itertools package.

    Examples:
        >>> from pydvl.valuation.samplers.classwise import roundrobin
        >>> list(roundrobin({"A": "123"}, {"B": "456"}))
        [("A", "1"), ("B", "4"), ("A", "2"), ("B", "5"), ("A", "3"), ("B", "6")]

    Args:
        batch_generators: dictionary mapping labels to batch generators.

    Returns:
        Combined generators
    """
    n_active = len(batch_generators)
    remaining_generators = cycle(
        (label, iter(it).__next__) for label, it in batch_generators.items()
    )
    while n_active:
        try:
            for label, next_generator in remaining_generators:
                yield label, next_generator()
        except StopIteration:
            # Remove the iterator we just exhausted from the cycle.
            n_active -= 1
            remaining_generators = cycle(islice(remaining_generators, n_active))

get_unique_labels

get_unique_labels(array: NDArray) -> NDArray

Returns unique labels in a categorical dataset.

PARAMETER DESCRIPTION
array

The input array to find unique labels from. It should be of categorical types such as Object, String, Unicode, Unsigned integer, Signed integer, or Boolean.

TYPE: NDArray

RETURNS DESCRIPTION
NDArray

An array of unique labels.

RAISES DESCRIPTION
ValueError

If the input array is not of a categorical type.

Source code in src/pydvl/valuation/samplers/classwise.py
def get_unique_labels(array: NDArray) -> NDArray:
    """Returns unique labels in a categorical dataset.

    Args:
        array: The input array to find unique labels from. It should be of
               categorical types such as Object, String, Unicode, Unsigned
               integer, Signed integer, or Boolean.

    Returns:
        An array of unique labels.

    Raises:
        ValueError: If the input array is not of a categorical type.
    """
    # Object, String, Unicode, Unsigned integer, Signed integer, boolean
    if array.dtype.kind in "OSUiub":
        return cast(NDArray, np.unique(array))
    raise ValueError(
        f"Input array has an unsupported data type for categorical labels: {array.dtype}. "
        "Expected types: Object, String, Unicode, Unsigned integer, Signed integer, or Boolean."
    )