pydvl.valuation.samplers.classwise
¶
Class-wise sampler for the class-wise Shapley valuation method.
The class-wise Shapley method, introduced by Schoch et al., 20221, uses a so-called set-conditional marginal Shapley value that requires selectively sampling subsets of data points with the same or a different class from that of the data point of interest.
This sampling scheme is divided into an outer and an inner sampler.
- The outer sampler is any subclass of PowersetSampler that generates subsets within the complement set of the data point of interest, and with a different label (so-called "out-of-class" samples, denoted by \(S_{-y_i}\) in this documentation).
- The inner sampler is any subclass of
IndexSampler, typically (and in the
paper) a
PermutationSampler. It
returns so-called "in-class" samples (denoted by \(S_{y_i}\) in this documentation) from
the set \(N_{y_i}\), i.e., the set of all indices with the same label as the data point
of interest.
!!! info "Restricting the number of inner samples"
Because of the nested sampling procedure, it is necessary to limit the amount
of in-class samples to a finite number. This is done by setting the
max_in_class_samples
parameter. For finite samplers, it can be left asNone
to let them run until completion.
Info
For more information on the class-wise Shapley method, as well as a summary of the reproduction results by Semmler and de Benito Delgado (2024)2 see the main documentation for the method.
References¶
-
Schoch, Stephanie, Haifeng Xu, and Yangfeng Ji. CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification. In Proc. of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS). New Orleans, Louisiana, USA, 2022. ↩
-
Semmler, Markus, and Miguel de Benito Delgado. [Re] Classwise-Shapley Values for Data Valuation. Transactions on Machine Learning Research, July 2024. ↩
ClasswiseSampler
¶
ClasswiseSampler(
in_class: IndexSampler,
out_of_class: PowersetSampler,
*,
max_in_class_samples: int | None = None,
min_elements_per_label: int = 1,
batch_size: int = 1,
)
Bases: IndexSampler[ClasswiseSample, ValueUpdate]
A sampler that samples elements from a dataset in two steps, based on the labels.
It proceeds by sampling out-of-class indices (training points with a different label to the point of interest), and in-class indices (training points with the same label as the point of interest).
Used by the class-wise Shapley valuation method.
PARAMETER | DESCRIPTION |
---|---|
in_class
|
Sampling scheme for elements of a given label (inner sampler). Typically, a PermutationSampler.
TYPE:
|
out_of_class
|
Sampling scheme for elements of different labels (outer sampler). E.g. a [UniformSampler][pydvl.valuation.samplers.uniform.UniformSampler] or a VRDSSampler. This sampler must use NoIndexIteration or any subclass thereof.
TYPE:
|
max_in_class_samples
|
Maximum number of in-class samples to generate per outer
iteration. Leave as
TYPE:
|
min_elements_per_label
|
Minimum number of elements per label to sample from the complement set, i.e., out of class elements.
TYPE:
|
batch_size
|
Number of samples to generate in each batch.
TYPE:
|
Source code in src/pydvl/valuation/samplers/classwise.py
skip_indices
property
writable
¶
Indices being skipped in the sampler. The exact behaviour will be sampler-dependent, so that setting this property is disabled by default.
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
batches_from_data
¶
batches_from_data(data: Dataset) -> BatchGenerator
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/classwise.py
generate
¶
This is not needed because this sampler is used by calling the from_data
method instead of the generate_batches
method.
Source code in src/pydvl/valuation/samplers/classwise.py
interrupt
¶
log_weight
¶
CW-Shapley uses the evaluation strategy from the in-class sampler, so this method should never be called.
Source code in src/pydvl/valuation/samplers/classwise.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns an object that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
samples_from_data
¶
samples_from_data(data: Dataset) -> SampleGenerator
Generates batches of class-wise samples from the dataset. Args: data: The dataset to sample from.
Source code in src/pydvl/valuation/samplers/classwise.py
get_unique_labels
¶
Returns unique labels in a categorical dataset.
PARAMETER | DESCRIPTION |
---|---|
array
|
The input array to find unique labels from. It should be of categorical types such as Object, String, Unicode, Unsigned integer, Signed integer, or Boolean.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray
|
An array of unique labels. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the input array is not of a categorical type. |
Source code in src/pydvl/valuation/samplers/classwise.py
roundrobin
¶
Take samples from batch generators in order until all of them are exhausted.
This was heavily inspired by the roundrobin recipe in the official Python documentation for the itertools package.
Examples:
>>> from pydvl.valuation.samplers.classwise import roundrobin
>>> list(roundrobin({"A": "123"}, {"B": "456"}))
[("A", "1"), ("B", "4"), ("A", "2"), ("B", "5"), ("A", "3"), ("B", "6")]
PARAMETER | DESCRIPTION |
---|---|
batch_generators
|
dictionary mapping labels to batch generators. |
RETURNS | DESCRIPTION |
---|---|
None
|
Combined generators |