pydvl.value
¶
This module implements algorithms for the exact and approximate computation of values and semi-values.
See Data valuation for an introduction to the concepts and methods implemented here.
PowersetSampler
¶
PowersetSampler(
indices: NDArray[IndexT],
index_iteration: IndexIteration = Sequential,
outer_indices: NDArray[IndexT] | None = None,
**kwargs,
)
Bases: ABC
, Iterable[SampleT]
, Generic[IndexT]
Samplers are custom iterables over subsets of indices.
Calling iter()
on a sampler returns an iterator over tuples of the form
\((i, S)\), where \(i\) is an index of interest, and \(S \subset I \setminus \{i\}\)
is a subset of the complement of \(i\).
This is done in two nested loops, where the outer loop iterates over the set of indices, and the inner loop iterates over subsets of the complement of the current index. The outer iteration can be either sequential or at random.
Note
Samplers are not iterators themselves, so that each call to iter()
e.g. in a for loop creates a new iterator.
Example
Methods required in subclasses¶
Samplers must implement a weight() function to be used as a multiplier in Monte Carlo sums, so that the limit expectation coincides with the semi-value.
Slicing of samplers¶
The samplers can be sliced for parallel computation. For those which are embarrassingly parallel, this is done by slicing the set of "outer" indices and returning new samplers over those slices.
index_iteration: the order in which indices are iterated over
outer_indices: The set of items (indices) over which to iterate
when sampling. Subsets are taken from the complement of each index
in succession. For embarrassingly parallel computations, this set
is sliced and the samplers are used to iterate over the slices.
Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
weight
abstractmethod
classmethod
¶
Factor by which to multiply Monte Carlo samples, so that the mean converges to the desired expression.
By the Law of Large Numbers, the sample mean of \(\delta_i(S_j)\) converges to the expectation under the distribution from which \(S_j\) is sampled.
We add a factor \(c(S_j)\) in order to have this expectation coincide with the desired expression.
Source code in src/pydvl/value/sampler.py
DeterministicUniformSampler
¶
DeterministicUniformSampler(indices: NDArray[IndexT], *args, **kwargs)
Bases: PowersetSampler[IndexT]
For every index \(i\), each subset of the complement indices - {i}
is
returned.
Note
Indices are always iterated over sequentially, irrespective of
the value of index_iteration
upon construction.
Example
PARAMETER | DESCRIPTION |
---|---|
indices
|
The set of items (indices) to sample from.
TYPE:
|
Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
UniformSampler
¶
UniformSampler(*args, seed: Optional[Seed] = None, **kwargs)
Bases: StochasticSamplerMixin
, PowersetSampler[IndexT]
An iterator to perform uniform random sampling of subsets.
Iterating over every index \(i\), either in sequence or at random depending on
the value of index_iteration
, one subset of the complement
indices - {i}
is sampled with equal probability \(2^{n-1}\). The
iterator never ends.
Example
The code
Produces the output:Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
weight
classmethod
¶
Correction coming from Monte Carlo integration so that the mean of the marginals converges to the value: the uniform distribution over the powerset of a set with n-1 elements has mass 2^{n-1} over each subset.
Source code in src/pydvl/value/sampler.py
MSRSampler
¶
MSRSampler(*args, seed: Optional[Seed] = None, **kwargs)
Bases: StochasticSamplerMixin
, PowersetSampler[IndexT]
An iterator to perform sampling of random subsets.
This sampler does not return any index, it only returns subsets of the data. This sampler is used in (Wang et. al.)2.
Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
AntitheticSampler
¶
AntitheticSampler(*args, seed: Optional[Seed] = None, **kwargs)
Bases: StochasticSamplerMixin
, PowersetSampler[IndexT]
An iterator to perform uniform random sampling of subsets, and their complements.
Works as UniformSampler, but for every tuple \((i,S)\), it subsequently returns \((i,S^c)\), where \(S^c\) is the complement of the set \(S\) in the set of indices, excluding \(i\).
Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
PermutationSampler
¶
PermutationSampler(*args, seed: Optional[Seed] = None, **kwargs)
Bases: StochasticSamplerMixin
, PowersetSampler[IndexT]
Sample permutations of indices and iterate through each returning increasing subsets, as required for the permutation definition of semi-values.
This sampler does not implement the two loops described in
PowersetSampler. Instead, for a
permutation (3,1,4,2)
, it returns in sequence the tuples of index and
sets: (3, {})
, (1, {3})
, (4, {3,1})
and (2, {3,1,4})
.
Note that the full index set is never returned.
Warning
This sampler requires caching to be enabled or computation will be doubled wrt. a "direct" implementation of permutation MC
Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
__getitem__
¶
__getitem__(key: slice | list[int]) -> PowersetSampler[IndexT]
Permutation samplers cannot be split across indices, so we return a copy of the full sampler.
DeterministicPermutationSampler
¶
DeterministicPermutationSampler(*args, seed: Optional[Seed] = None, **kwargs)
Bases: PermutationSampler[IndexT]
Samples all n! permutations of the indices deterministically, and iterates through them, returning sets as required for the permutation-based definition of semi-values.
Warning
This sampler requires caching to be enabled or computation will be doubled wrt. a "direct" implementation of permutation MC
Warning
This sampler is not parallelizable, as it always iterates over the whole set of permutations in the same order. Different processes would always return the same values for all indices.
Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
__getitem__
¶
__getitem__(key: slice | list[int]) -> PowersetSampler[IndexT]
Permutation samplers cannot be split across indices, so we return a copy of the full sampler.
RandomHierarchicalSampler
¶
RandomHierarchicalSampler(*args, seed: Optional[Seed] = None, **kwargs)
Bases: StochasticSamplerMixin
, PowersetSampler[IndexT]
For every index, sample a set size, then a set of that size.
Todo
This is unnecessary, but a step towards proper stratified sampling.
Source code in src/pydvl/value/sampler.py
iterindices
¶
iterindices() -> Iterator[IndexT]
Iterates over indices in the order specified at construction.
this is probably not very useful, but I couldn't decide
which method is better
Source code in src/pydvl/value/sampler.py
SemiValueMode
¶
ValueItem
dataclass
¶
ValueItem(
index: IndexT,
name: NameT,
value: float,
variance: Optional[float],
count: Optional[int],
)
Bases: Generic[IndexT, NameT]
The result of a value computation for one datum.
ValueItems
can be compared with the usual operators, forming a total
order. Comparisons take only the value
into account.
Todo
Maybe have a mode of comparing similar to np.isclose
, or taking the
variance
into account.
ATTRIBUTE | DESCRIPTION |
---|---|
index |
Index of the sample with this value in the original Dataset
TYPE:
|
name |
Name of the sample if it was provided. Otherwise,
TYPE:
|
value |
The value
TYPE:
|
variance |
Variance of the value if it was computed with an approximate method |
count |
Number of updates for this value |
ValuationResult
¶
ValuationResult(
*,
values: NDArray[float64],
variances: Optional[NDArray[float64]] = None,
counts: Optional[NDArray[int_]] = None,
indices: Optional[NDArray[IndexT]] = None,
data_names: Optional[Sequence[NameT] | NDArray[NameT]] = None,
algorithm: str = "",
status: Status = Pending,
sort: bool = False,
**extra_values: Any,
)
Bases: Sequence
, Iterable[ValueItem[IndexT, NameT]]
, Generic[IndexT, NameT]
Objects of this class hold the results of valuation algorithms.
These include indices in the original Dataset,
any data names (e.g. group names in GroupedDataset),
the values themselves, and variance of the computation in the case of Monte
Carlo methods. ValuationResults
can be iterated over like any Sequence
:
iter(valuation_result)
returns a generator of
ValueItem in the order in which the object
is sorted.
Indexing¶
Indexing can be position-based, when accessing any of the attributes values, variances, counts and indices, as well as when iterating over the object, or using the item access operator, both getter and setter. The "position" is either the original sequence in which the data was passed to the constructor, or the sequence in which the object is sorted, see below.
Alternatively, indexing can be data-based, i.e. using the indices in the original dataset. This is the case for the methods get() and update().
Sorting¶
Results can be sorted in-place with sort(), or alternatively using
python's standard sorted()
and reversed()
Note that sorting values
affects how iterators and the object itself as Sequence
behave:
values[0]
returns a ValueItem with the highest or lowest
ranking point if this object is sorted by descending or ascending value,
respectively. If unsorted, values[0]
returns the ValueItem
at
position 0, which has data index indices[0]
in the
Dataset.
The same applies to direct indexing of the ValuationResult
: the index
is positional, according to the sorting. It does not refer to the "data
index". To sort according to data index, use sort() with
key="index"
.
In order to access ValueItem objects by their data index, use get().
Operating on results¶
Results can be added to each other with the +
operator. Means and
variances are correctly updated, using the counts
attribute.
Results can also be updated with new values using update(). Means and variances are updated accordingly using the Welford algorithm.
Empty objects behave in a special way, see empty().
PARAMETER | DESCRIPTION |
---|---|
values
|
An array of values. If omitted, defaults to an empty array
or to an array of zeros if |
indices
|
An optional array of indices in the original dataset. If
omitted, defaults to |
variances
|
An optional array of variances in the computation of each value. |
counts
|
An optional array with the number of updates for each value. Defaults to an array of ones. |
data_names
|
Names for the data points. Defaults to index numbers if not set.
TYPE:
|
algorithm
|
The method used.
TYPE:
|
status
|
The end status of the algorithm.
TYPE:
|
sort
|
Whether to sort the indices by ascending value. See above how this affects usage as an iterable or sequence.
TYPE:
|
extra_values
|
Any Additional values that can be passed as keyword arguments. This can contain, for example, the least core value.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
ValueError
|
If input arrays have mismatching lengths. |
Source code in src/pydvl/value/result.py
indices
property
¶
indices: NDArray[IndexT]
The indices for the values, possibly sorted.
If the object is unsorted, then these are the same as declared at
construction or np.arange(len(values))
if none were passed.
names
property
¶
names: NDArray[NameT]
The names for the values, possibly sorted.
If the object is unsorted, then these are the same as declared at
construction or np.arange(len(values))
if none were passed.
sort
¶
sort(
reverse: bool = False,
key: Literal["value", "variance", "index", "name"] = "value",
) -> None
Sorts the indices in place by key
.
Once sorted, iteration over the results, and indexing of all the properties ValuationResult.values, ValuationResult.variances, ValuationResult.counts, ValuationResult.indices and ValuationResult.names will follow the same order.
PARAMETER | DESCRIPTION |
---|---|
reverse
|
Whether to sort in descending order by value.
TYPE:
|
key
|
The key to sort by. Defaults to ValueItem.value.
TYPE:
|
Source code in src/pydvl/value/result.py
__getattr__
¶
Allows access to extra values as if they were properties of the instance.
Source code in src/pydvl/value/result.py
__iter__
¶
Iterate over the results returning ValueItem objects. To sort in place before iteration, use sort().
Source code in src/pydvl/value/result.py
__add__
¶
__add__(
other: ValuationResult[IndexT, NameT],
) -> ValuationResult[IndexT, NameT]
Adds two ValuationResults.
The values must have been computed with the same algorithm. An exception to this is if one argument has empty values, in which case the other argument is returned.
Warning
Abusing this will introduce numerical errors.
Means and standard errors are correctly handled. Statuses are added with
bit-wise &
, see Status.
data_names
are taken from the left summand, or if unavailable from
the right one. The algorithm
string is carried over if both terms
have the same one or concatenated.
It is possible to add ValuationResults of different lengths, and with different or overlapping indices. The result will have the union of indices, and the values.
Warning
FIXME: Arbitrary extra_values
aren't handled.
Source code in src/pydvl/value/result.py
479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 |
|
update
¶
update(idx: int, new_value: float) -> ValuationResult[IndexT, NameT]
Updates the result in place with a new value, using running mean and variance.
PARAMETER | DESCRIPTION |
---|---|
idx
|
Data index of the value to update.
TYPE:
|
new_value
|
New value to add to the result.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult[IndexT, NameT]
|
A reference to the same, modified result. |
RAISES | DESCRIPTION |
---|---|
IndexError
|
If the index is not found. |
Source code in src/pydvl/value/result.py
scale
¶
Scales the values and variances of the result by a coefficient.
PARAMETER | DESCRIPTION |
---|---|
factor
|
Factor to scale by.
TYPE:
|
indices
|
Indices to scale. If None, all values are scaled. |
Source code in src/pydvl/value/result.py
get
¶
Retrieves a ValueItem by data index, as opposed to sort index, like the indexing operator.
RAISES | DESCRIPTION |
---|---|
IndexError
|
If the index is not found. |
Source code in src/pydvl/value/result.py
to_dataframe
¶
Returns values as a dataframe.
PARAMETER | DESCRIPTION |
---|---|
column
|
Name for the column holding the data value. Defaults to the name of the algorithm used. |
use_names
|
Whether to use data names instead of indices for the DataFrame's index.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DataFrame
|
A dataframe with two columns, one for the values, with name
given as explained in |
Source code in src/pydvl/value/result.py
from_random
classmethod
¶
from_random(
size: int,
total: Optional[float] = None,
seed: Optional[Seed] = None,
**kwargs: Any,
) -> "ValuationResult"
Creates a ValuationResult object and fills it with an array of random values from a uniform distribution in [-1,1]. The values can be made to sum up to a given total number (doing so will change their range).
PARAMETER | DESCRIPTION |
---|---|
size
|
Number of values to generate
TYPE:
|
total
|
If set, the values are normalized to sum to this number ("efficiency" property of Shapley values). |
kwargs
|
Any Additional options to pass to the constructor of ValuationResult. Use to override status, names, etc.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
'ValuationResult'
|
A valuation result with its status set to |
'ValuationResult'
|
Status.Converged by default. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If |
Changed in version 0.6.0
Added parameter total
. Check for zero size
Source code in src/pydvl/value/result.py
empty
classmethod
¶
empty(
algorithm: str = "",
indices: Optional[Sequence[IndexT] | NDArray[IndexT]] = None,
data_names: Optional[Sequence[NameT] | NDArray[NameT]] = None,
n_samples: int = 0,
) -> ValuationResult
Creates an empty ValuationResult object.
Empty results are characterised by having an empty array of values. When another result is added to an empty one, the empty one is discarded.
PARAMETER | DESCRIPTION |
---|---|
algorithm
|
Name of the algorithm used to compute the values
TYPE:
|
indices
|
Optional sequence or array of indices.
TYPE:
|
data_names
|
Optional sequences or array of names for the data points. Defaults to index numbers if not set.
TYPE:
|
n_samples
|
Number of valuation result entries.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Source code in src/pydvl/value/result.py
zeros
classmethod
¶
zeros(
algorithm: str = "",
indices: Optional[Sequence[IndexT] | NDArray[IndexT]] = None,
data_names: Optional[Sequence[NameT] | NDArray[NameT]] = None,
n_samples: int = 0,
) -> ValuationResult
Creates an empty ValuationResult object.
Empty results are characterised by having an empty array of values. When another result is added to an empty one, the empty one is ignored.
PARAMETER | DESCRIPTION |
---|---|
algorithm
|
Name of the algorithm used to compute the values
TYPE:
|
indices
|
Data indices to use. A copy will be made. If not given,
the indices will be set to the range
TYPE:
|
data_names
|
Data names to use. A copy will be made. If not given, the names will be set to the string representation of the indices.
TYPE:
|
n_samples
|
Number of data points whose values are computed. If
not given, the length of
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Source code in src/pydvl/value/result.py
ClasswiseScorer
¶
ClasswiseScorer(
scoring: Union[str, ScorerCallable] = "accuracy",
default: float = 0.0,
range: Tuple[float, float] = (0, 1),
in_class_discount_fn: Callable[[float], float] = lambda x: x,
out_of_class_discount_fn: Callable[[float], float] = exp,
initial_label: Optional[int] = None,
name: Optional[str] = None,
)
Bases: Scorer
A Scorer designed for evaluation in classification problems. Its value is computed from an in-class and an out-of-class "inner score" (Schoch et al., 2022) 1. Let \(S\) be the training set and \(D\) be the valuation set. For each label \(c\), \(D\) is factorized into two disjoint sets: \(D_c\) for in-class instances and \(D_{-c}\) for out-of-class instances. The score combines an in-class metric of performance, adjusted by a discounted out-of-class metric. These inner scores must be provided upon construction or default to accuracy. They are combined into:
where \(f\) and \(g\) are continuous, monotonic functions. For a detailed explanation, refer to section four of (Schoch et al., 2022) 1.
Warning
Metrics must support multiple class labels if you intend to apply them
to a multi-class problem. For instance, the metric 'accuracy' supports
multiple classes, but the metric f1
does not. For a two-class
classification problem, using f1_weighted
is essentially equivalent to
using accuracy
.
PARAMETER | DESCRIPTION |
---|---|
scoring
|
Name of the scoring function or a callable that can be passed to Scorer.
TYPE:
|
default
|
Score to use when a model fails to provide a number, e.g. when too little was used to train it, or errors arise.
TYPE:
|
range
|
Numerical range of the score function. Some Monte Carlo methods
can use this to estimate the number of samples required for a
certain quality of approximation. If not provided, it can be read
from the |
in_class_discount_fn
|
Continuous, monotonic increasing function used to discount the in-class score. |
out_of_class_discount_fn
|
Continuous, monotonic increasing function used to discount the out-of-class score. |
initial_label
|
Set initial label (for the first iteration) |
name
|
Name of the scorer. If not provided, the name of the inner scoring
function will be prefixed by |
New in version 0.7.1
Source code in src/pydvl/value/shapley/classwise.py
estimate_in_class_and_out_of_class_score
¶
estimate_in_class_and_out_of_class_score(
model: SupervisedModel,
x_test: NDArray[float64],
y_test: NDArray[int_],
rescale_scores: bool = True,
) -> Tuple[float, float]
Computes in-class and out-of-class scores using the provided inner scoring function. The result is
In this context, for label \(c\) calculations are executed twice: once for \(D_c\) and once for \(D_{-c}\) to determine the in-class and out-of-class scores, respectively. By default, the raw scores are multiplied by \(\frac{|D_c|}{|D|}\) and \(\frac{|D_{-c}|}{|D|}\), respectively. This is done to ensure that both scores are of the same order of magnitude. This normalization is particularly useful when the inner score function \(a_S\) is calculated by an estimator of the form \(\frac{1}{N} \sum_i x_i\), e.g. the accuracy.
PARAMETER | DESCRIPTION |
---|---|
model
|
Model used for computing the score on the validation set.
TYPE:
|
x_test
|
Array containing the features of the classification problem. |
y_test
|
Array containing the labels of the classification problem. |
rescale_scores
|
If set to True, the scores will be denormalized. This is particularly useful when the inner score function \(a_S\) is calculated by an estimator of the form \(\frac{1}{N} \sum_i x_i\).
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tuple[float, float]
|
Tuple containing the in-class and out-of-class scores. |
Source code in src/pydvl/value/shapley/classwise.py
OwenAlgorithm
¶
Bases: Enum
Choices for the Owen sampling method.
ATTRIBUTE | DESCRIPTION |
---|---|
Standard |
Use q ∈ [0, 1]
|
Antithetic |
Use q ∈ [0, 0.5] and correlated samples
|
TruncationPolicy
¶
Bases: ABC
A policy for deciding whether to stop computing marginals in a permutation.
Statistics are kept on the number of calls and truncations as n_calls
and
n_truncations
respectively.
ATTRIBUTE | DESCRIPTION |
---|---|
n_calls |
Number of calls to the policy.
TYPE:
|
n_truncations |
Number of truncations made by the policy.
TYPE:
|
Todo
Because the policy objects are copied to the workers, the statistics are not accessible from the coordinating process. We need to add methods for this.
Source code in src/pydvl/value/shapley/truncated.py
reset
abstractmethod
¶
__call__
¶
Check whether the computation should be interrupted.
PARAMETER | DESCRIPTION |
---|---|
idx
|
Position in the permutation currently being computed.
TYPE:
|
score
|
Last utility computed.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
bool
|
|
Source code in src/pydvl/value/shapley/truncated.py
NoTruncation
¶
Bases: TruncationPolicy
A policy which never interrupts the computation.
Source code in src/pydvl/value/shapley/truncated.py
__call__
¶
Check whether the computation should be interrupted.
PARAMETER | DESCRIPTION |
---|---|
idx
|
Position in the permutation currently being computed.
TYPE:
|
score
|
Last utility computed.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
bool
|
|
Source code in src/pydvl/value/shapley/truncated.py
FixedTruncation
¶
Bases: TruncationPolicy
Break a permutation after computing a fixed number of marginals.
The experiments in Appendix B of (Ghorbani and Zou, 2019)1 show that when the training set size is large enough, one can simply truncate the iteration over permutations after a fixed number of steps. This happens because beyond a certain number of samples in a training set, the model becomes insensitive to new ones. Alas, this strongly depends on the data distribution and the model and there is no automatic way of estimating this number.
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function
TYPE:
|
fraction
|
Fraction of marginals in a permutation to compute before stopping (e.g. 0.5 to compute half of the marginals).
TYPE:
|
Source code in src/pydvl/value/shapley/truncated.py
__call__
¶
Check whether the computation should be interrupted.
PARAMETER | DESCRIPTION |
---|---|
idx
|
Position in the permutation currently being computed.
TYPE:
|
score
|
Last utility computed.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
bool
|
|
Source code in src/pydvl/value/shapley/truncated.py
RelativeTruncation
¶
Bases: TruncationPolicy
Break a permutation if the marginal utility is too low.
This is called "performance tolerance" in (Ghorbani and Zou, 2019)1.
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function
TYPE:
|
rtol
|
Relative tolerance. The permutation is broken if the
last computed utility is less than
TYPE:
|
Source code in src/pydvl/value/shapley/truncated.py
__call__
¶
Check whether the computation should be interrupted.
PARAMETER | DESCRIPTION |
---|---|
idx
|
Position in the permutation currently being computed.
TYPE:
|
score
|
Last utility computed.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
bool
|
|
Source code in src/pydvl/value/shapley/truncated.py
BootstrapTruncation
¶
Bases: TruncationPolicy
Break a permutation if the last computed utility is close to the total utility, measured as a multiple of the standard deviation of the utilities.
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function
TYPE:
|
n_samples
|
Number of bootstrap samples to use to compute the variance of the utilities.
TYPE:
|
sigmas
|
Number of standard deviations to use as a threshold.
TYPE:
|
Source code in src/pydvl/value/shapley/truncated.py
__call__
¶
Check whether the computation should be interrupted.
PARAMETER | DESCRIPTION |
---|---|
idx
|
Position in the permutation currently being computed.
TYPE:
|
score
|
Last utility computed.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
bool
|
|
Source code in src/pydvl/value/shapley/truncated.py
ShapleyMode
¶
StoppingCriterion
¶
StoppingCriterion(modify_result: bool = True)
Bases: ABC
A composable callable object to determine whether a computation must stop.
A StoppingCriterion
is a callable taking a
ValuationResult and returning a
Status. It also keeps track of individual
convergence of values with
converged, and reports
the overall completion of the computation with
completion.
Instances of StoppingCriterion
can be composed with the binary operators
&
(and), and |
(or), following the truth tables of
Status. The unary operator ~
(not) is
also supported. These boolean operations act according to the following
rules:
- The results of
check()
are combined with the operator. See Status for the truth tables. - The results of converged are combined with the operator (returning another boolean array).
- The completion method returns the min, max, or the complement to 1 of the completions of the operands, for AND, OR and NOT respectively. This is required for cases where one of the criteria does not keep track of the convergence of single values, e.g. MaxUpdates, because completion by default returns the mean of the boolean convergence array.
Subclassing¶
Subclassing this class requires implementing a check()
method that
returns a Status object based on a given
ValuationResult. This method should
update the attribute _converged
, which is a boolean array indicating
whether the value for each index has converged.
When this does not make sense for a particular stopping criterion,
completion should be
overridden to provide an overall completion value, since its default
implementation attempts to compute the mean of _converged
.
PARAMETER | DESCRIPTION |
---|---|
modify_result
|
If
TYPE:
|
Source code in src/pydvl/value/stopping.py
converged
property
¶
completion
¶
completion() -> float
Returns a value between 0 and 1 indicating the completion of the computation.
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/value/stopping.py
AbsoluteStandardError
¶
AbsoluteStandardError(
threshold: float,
fraction: float = 1.0,
burn_in: int = 4,
modify_result: bool = True,
)
Bases: StoppingCriterion
Determine convergence based on the standard error of the values.
If \(s_i\) is the standard error for datum \(i\), then this criterion returns Converged if \(s_i < \epsilon\) for all \(i\) and a threshold value \(\epsilon \gt 0\).
PARAMETER | DESCRIPTION |
---|---|
threshold
|
A value is considered to have converged if the standard error is below this threshold. A way of choosing it is to pick some percentage of the range of the values. For Shapley values this is the difference between the maximum and minimum of the utility function (to see this substitute the maximum and minimum values of the utility into the marginal contribution formula).
TYPE:
|
fraction
|
The fraction of values that must have converged for the criterion to return Converged.
TYPE:
|
burn_in
|
The number of iterations to ignore before checking for convergence. This is required because computations typically start with zero variance, as a result of using zeros(). The default is set to an arbitrary minimum which is usually enough but may need to be increased.
TYPE:
|
Source code in src/pydvl/value/stopping.py
converged
property
¶
completion
¶
completion() -> float
Returns a value between 0 and 1 indicating the completion of the computation.
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/value/stopping.py
MaxChecks
¶
Bases: StoppingCriterion
Terminate as soon as the number of checks exceeds the threshold.
A "check" is one call to the criterion.
PARAMETER | DESCRIPTION |
---|---|
n_checks
|
Threshold: if |
Source code in src/pydvl/value/stopping.py
converged
property
¶
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/value/stopping.py
MaxUpdates
¶
Bases: StoppingCriterion
Terminate if any number of value updates exceeds or equals the given threshold.
Note
If you want to ensure that all values have been updated, you probably want MinUpdates instead.
This checks the counts
field of a
ValuationResult, i.e. the number of
times that each index has been updated. For powerset samplers, the maximum
of this number coincides with the maximum number of subsets sampled. For
permutation samplers, it coincides with the number of permutations sampled.
PARAMETER | DESCRIPTION |
---|---|
n_updates
|
Threshold: if |
Source code in src/pydvl/value/stopping.py
converged
property
¶
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/value/stopping.py
MinUpdates
¶
Bases: StoppingCriterion
Terminate as soon as all value updates exceed or equal the given threshold.
This checks the counts
field of a
ValuationResult, i.e. the number of times that
each index has been updated. For powerset samplers, the minimum of this
number is a lower bound for the number of subsets sampled. For
permutation samplers, it lower-bounds the amount of permutations sampled.
PARAMETER | DESCRIPTION |
---|---|
n_updates
|
Threshold: if |
Source code in src/pydvl/value/stopping.py
converged
property
¶
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/value/stopping.py
MaxTime
¶
Bases: StoppingCriterion
Terminate if the computation time exceeds the given number of seconds.
Checks the elapsed time since construction
PARAMETER | DESCRIPTION |
---|---|
seconds
|
Threshold: The computation is terminated if the elapsed time
between object construction and a _check exceeds this value. If |
Source code in src/pydvl/value/stopping.py
converged
property
¶
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/value/stopping.py
HistoryDeviation
¶
HistoryDeviation(
n_steps: int,
rtol: float,
pin_converged: bool = True,
modify_result: bool = True,
)
Bases: StoppingCriterion
A simple check for relative distance to a previous step in the computation.
The method used by (Ghorbani and Zou, 2019)1 computes the relative distances between the current values \(v_i^t\) and the values at the previous checkpoint \(v_i^{t-\tau}\). If the sum is below a given threshold, the computation is terminated.
When the denominator is zero, the summand is set to the value of \(v_i^{ t-\tau}\).
This implementation is slightly generalised to allow for different number of updates to individual indices, as happens with powerset samplers instead of permutations. Every subset of indices that is found to converge can be pinned to that state. Once all indices have converged the method has converged.
Warning
This criterion is meant for the reproduction of the results in the paper, but we do not recommend using it in practice.
PARAMETER | DESCRIPTION |
---|---|
n_steps
|
Checkpoint values every so many updates and use these saved values to compare.
TYPE:
|
rtol
|
Relative tolerance for convergence (\(\epsilon\) in the formula).
TYPE:
|
pin_converged
|
If
TYPE:
|
Source code in src/pydvl/value/stopping.py
converged
property
¶
completion
¶
completion() -> float
Returns a value between 0 and 1 indicating the completion of the computation.
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/value/stopping.py
RankCorrelation
¶
Bases: StoppingCriterion
A check for stability of Spearman correlation between checks.
When the change in rank correlation between two successive iterations is below a given threshold, the computation is terminated. The criterion computes the Spearman correlation between two successive iterations. The Spearman correlation uses the ordering indices of the given values and correlates them. This means it focuses on the order of the elements instead of their exact values. If the order stops changing (meaning the Banzhaf semivalues estimates converge), the criterion stops the algorithm.
This criterion is used in (Wang et. al.)2.
PARAMETER | DESCRIPTION |
---|---|
rtol
|
Relative tolerance for convergence (\(\epsilon\) in the formula)
TYPE:
|
modify_result
|
If
TYPE:
|
burn_in
|
The minimum number of iterations before checking for convergence. This is required because the first correlation is meaningless.
TYPE:
|
Added in 0.9.0
Source code in src/pydvl/value/stopping.py
converged
property
¶
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/value/stopping.py
compute_least_core_values
¶
compute_least_core_values(
u: Utility,
*,
n_jobs: int = 1,
n_iterations: Optional[int] = None,
mode: LeastCoreMode = MonteCarlo,
non_negative_subsidy: bool = False,
solver_options: Optional[dict] = None,
progress: bool = False,
**kwargs,
) -> ValuationResult
Umbrella method to compute Least Core values with any of the available algorithms.
See Data valuation for an overview.
The following algorithms are available. Note that the exact method can only work with very small datasets and is thus intended only for testing.
exact
: uses the complete powerset of the training set for the constraints combinatorial_exact_shapley().montecarlo
: uses the approximate Monte Carlo Least Core algorithm. Implemented in montecarlo_least_core().
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function
TYPE:
|
n_jobs
|
Number of jobs to run in parallel. Only used for Monte Carlo Least Core.
TYPE:
|
n_iterations
|
Number of subsets to sample and evaluate the utility on. Only used for Monte Carlo Least Core. |
mode
|
Algorithm to use. See LeastCoreMode for available options.
TYPE:
|
non_negative_subsidy
|
If True, the least core subsidy \(e\) is constrained to be non-negative.
TYPE:
|
solver_options
|
Optional dictionary of options passed to the solvers. |
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the computed values. |
New in version 0.5.0
Source code in src/pydvl/value/least_core/__init__.py
compute_loo
¶
compute_loo(
u: Utility,
*,
n_jobs: int = 1,
parallel_backend: Optional[ParallelBackend] = None,
config: Optional[ParallelConfig] = None,
progress: bool = True,
) -> ValuationResult
Computes leave one out value:
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function
TYPE:
|
progress
|
If True, display a progress bar
TYPE:
|
n_jobs
|
Number of parallel jobs to use
TYPE:
|
parallel_backend
|
Parallel backend instance to use
for parallelizing computations. If
TYPE:
|
config
|
(DEPRECATED) Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
progress
|
If True, display a progress bar
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the data values. |
New in version 0.7.0
Renamed from naive_loo
and added parallel computation.
Changed in version 0.9.0
Deprecated config
argument and added a parallel_backend
argument to allow users to pass the Parallel Backend instance
directly.
Source code in src/pydvl/value/loo/loo.py
compute_data_oob
¶
compute_data_oob(
u: Utility,
*,
n_est: int = 10,
max_samples: float = 0.8,
loss: Optional[PointwiseScore] = None,
n_jobs: Optional[int] = None,
seed: Optional[Seed] = None,
progress: bool = False,
) -> ValuationResult
Computes Data out of bag values
This implements the method described in (Kwon and Zou, 2023)1. It fits several base estimators provided through u.model through a bagging process. The point value corresponds to the average loss of estimators which were not fit on it.
\(w_{bj}\in Z\) is the number of times the j-th datum \((x_j, y_j)\) is selected in the b-th bootstrap dataset.
With:
T is a score function that represents the goodness of a weak learner \(\hat{f}_b\) at the i-th datum \((x_i, y_i)\).
n_est
and max_samples
must be tuned jointly to ensure that all samples
are at least 1 time out-of-bag, otherwise the result could include a NaN
value for that datum.
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function.
TYPE:
|
n_est
|
Number of estimator used in the bagging procedure.
TYPE:
|
max_samples
|
The fraction of samples to draw to train each base estimator.
TYPE:
|
loss
|
A function taking as parameters model prediction and corresponding data labels(y_true, y_pred) and returning an array of point-wise errors.
TYPE:
|
n_jobs
|
The number of jobs to run in parallel used in the bagging procedure for both fit and predict. |
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
progress
|
If True, display a progress bar.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the data values. |
Source code in src/pydvl/value/oob/oob.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
|
compute_generic_semivalues
¶
compute_generic_semivalues(
sampler: PowersetSampler[IndexT],
u: Utility,
coefficient: SVCoefficient,
done: StoppingCriterion,
*,
marginal: MarginalFunction = DefaultMarginal(),
future_processor: FutureProcessor = FutureProcessor(),
batch_size: int = 1,
skip_converged: bool = False,
n_jobs: int = 1,
parallel_backend: Optional[ParallelBackend] = None,
config: Optional[ParallelConfig] = None,
progress: bool = False,
) -> ValuationResult
Computes semi-values for a given utility function and subset sampler.
PARAMETER | DESCRIPTION |
---|---|
sampler
|
The subset sampler to use for utility computations.
TYPE:
|
u
|
Utility object with model, data, and scoring function.
TYPE:
|
coefficient
|
The semi-value coefficient
TYPE:
|
done
|
Stopping criterion.
TYPE:
|
marginal
|
Marginal function to be used for computing the semivalues
TYPE:
|
future_processor
|
Additional postprocessing steps required for some algorithms
TYPE:
|
batch_size
|
Number of marginal evaluations per single parallel job.
TYPE:
|
skip_converged
|
Whether to skip marginal evaluations for indices that have already converged. CAUTION: This is only entirely safe if the stopping criterion is MaxUpdates. For any other stopping criterion, the convergence status of indices may change during the computation, or they may be marked as having converged even though in fact the estimated values are far from the true values (e.g. for AbsoluteStandardError, you will probably have to carefully adjust the threshold).
TYPE:
|
n_jobs
|
Number of parallel jobs to use.
TYPE:
|
parallel_backend
|
Parallel backend instance to use
for parallelizing computations. If
TYPE:
|
config
|
(DEPRECATED) Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
progress
|
Whether to display a progress bar.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Deprecation notice
Parameter batch_size
is for experimental use and will be removed in
future versions.
Changed in version 0.9.0
Deprecated config
argument and added a parallel_backend
argument to allow users to pass the Parallel Backend instance
directly.
Source code in src/pydvl/value/semivalues.py
309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 |
|
compute_shapley_semivalues
¶
compute_shapley_semivalues(
u: Utility,
*,
done: StoppingCriterion,
sampler_t: Type[StochasticSampler] = PermutationSampler,
batch_size: int = 1,
n_jobs: int = 1,
parallel_backend: Optional[ParallelBackend] = None,
config: Optional[ParallelConfig] = None,
progress: bool = False,
seed: Optional[Seed] = None,
) -> ValuationResult
Computes Shapley values for a given utility function.
This is a convenience wrapper for compute_generic_semivalues with the Shapley coefficient. Use compute_shapley_values for a more flexible interface and additional methods, including TMCS.
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function.
TYPE:
|
done
|
Stopping criterion.
TYPE:
|
sampler_t
|
The sampler type to use. See the sampler module for a list.
TYPE:
|
batch_size
|
Number of marginal evaluations per single parallel job.
TYPE:
|
n_jobs
|
Number of parallel jobs to use.
TYPE:
|
parallel_backend
|
Parallel backend instance to use
for parallelizing computations. If
TYPE:
|
config
|
(DEPRECATED) Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
progress
|
Whether to display a progress bar.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Deprecation notice
Parameter batch_size
is for experimental use and will be removed in
future versions.
Changed in version 0.9.0
Deprecated config
argument and added a parallel_backend
argument to allow users to pass the Parallel Backend instance
directly.
Source code in src/pydvl/value/semivalues.py
compute_banzhaf_semivalues
¶
compute_banzhaf_semivalues(
u: Utility,
*,
done: StoppingCriterion,
sampler_t: Type[StochasticSampler] = PermutationSampler,
batch_size: int = 1,
n_jobs: int = 1,
parallel_backend: Optional[ParallelBackend] = None,
config: Optional[ParallelConfig] = None,
progress: bool = False,
seed: Optional[Seed] = None,
) -> ValuationResult
Computes Banzhaf values for a given utility function.
This is a convenience wrapper for compute_generic_semivalues with the Banzhaf coefficient.
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function.
TYPE:
|
done
|
Stopping criterion.
TYPE:
|
sampler_t
|
The sampler type to use. See the sampler module for a list.
TYPE:
|
batch_size
|
Number of marginal evaluations per single parallel job.
TYPE:
|
n_jobs
|
Number of parallel jobs to use.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
parallel_backend
|
Parallel backend instance to use
for parallelizing computations. If
TYPE:
|
config
|
(DEPRECATED) Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
progress
|
Whether to display a progress bar.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Deprecation notice
Parameter batch_size
is for experimental use and will be removed in
future versions.
Changed in version 0.9.0
Deprecated config
argument and added a parallel_backend
argument to allow users to pass the Parallel Backend instance
directly.
Source code in src/pydvl/value/semivalues.py
compute_msr_banzhaf_semivalues
¶
compute_msr_banzhaf_semivalues(
u: Utility,
*,
done: StoppingCriterion,
sampler_t: Type[StochasticSampler] = MSRSampler,
batch_size: int = 1,
n_jobs: int = 1,
parallel_backend: Optional[ParallelBackend] = None,
config: Optional[ParallelConfig] = None,
progress: bool = False,
seed: Optional[Seed] = None,
) -> ValuationResult
Computes MSR sampled Banzhaf values for a given utility function.
This is a convenience wrapper for compute_generic_semivalues with the Banzhaf coefficient and MSR sampling.
This algorithm works by sampling random subsets and then evaluating the utility on that subset only once. Based on the evaluation and the subset indices, the MSRFutureProcessor then computes the marginal updates like in the paper (Wang et. al.)3. Their approach updates the semivalues for all data points every time a new evaluation is computed. This increases sample efficiency compared to normal Monte Carlo updates.
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function.
TYPE:
|
done
|
Stopping criterion.
TYPE:
|
sampler_t
|
The sampler type to use. See the sampler module for a list.
TYPE:
|
batch_size
|
Number of marginal evaluations per single parallel job.
TYPE:
|
n_jobs
|
Number of parallel jobs to use.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
config
|
Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
progress
|
Whether to display a progress bar.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Deprecation notice
Parameter batch_size
is for experimental use and will be removed in
future versions.
Source code in src/pydvl/value/semivalues.py
compute_beta_shapley_semivalues
¶
compute_beta_shapley_semivalues(
u: Utility,
*,
alpha: float = 1,
beta: float = 1,
done: StoppingCriterion,
sampler_t: Type[StochasticSampler] = PermutationSampler,
batch_size: int = 1,
n_jobs: int = 1,
parallel_backend: Optional[ParallelBackend] = None,
config: Optional[ParallelConfig] = None,
progress: bool = False,
seed: Optional[Seed] = None,
) -> ValuationResult
Computes Beta Shapley values for a given utility function.
This is a convenience wrapper for compute_generic_semivalues with the Beta Shapley coefficient.
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function.
TYPE:
|
alpha
|
Alpha parameter of the Beta distribution.
TYPE:
|
beta
|
Beta parameter of the Beta distribution.
TYPE:
|
done
|
Stopping criterion.
TYPE:
|
sampler_t
|
The sampler type to use. See the sampler module for a list.
TYPE:
|
batch_size
|
Number of marginal evaluations per (parallelized) task.
TYPE:
|
n_jobs
|
Number of parallel jobs to use.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
parallel_backend
|
Parallel backend instance to use
for parallelizing computations. If
TYPE:
|
config
|
(DEPRECATED) Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
progress
|
Whether to display a progress bar.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Deprecation notice
Parameter batch_size
is for experimental use and will be removed in
future versions.
Changed in version 0.9.0
Deprecated config
argument and added a parallel_backend
argument to allow users to pass the Parallel Backend instance
directly.
Source code in src/pydvl/value/semivalues.py
compute_semivalues
¶
compute_semivalues(
u: Utility,
*,
done: StoppingCriterion,
mode: SemiValueMode = Shapley,
sampler_t: Type[StochasticSampler] = PermutationSampler,
batch_size: int = 1,
n_jobs: int = 1,
seed: Optional[Seed] = None,
**kwargs: Any,
) -> ValuationResult
Convenience entry point for most common semi-value computations.
Deprecation warning
This method is deprecated and will be replaced in 0.8.0 by the more general implementation of compute_generic_semivalues. Use compute_shapley_semivalues, compute_banzhaf_semivalues, or compute_beta_shapley_semivalues instead.
The modes supported with this interface are the following. For greater flexibility use compute_generic_semivalues directly.
- SemiValueMode.Shapley: Shapley values.
- SemiValueMode.BetaShapley:
Implements the Beta Shapley semi-value as introduced in
(Kwon and Zou, 2022)1.
Pass additional keyword arguments
alpha
andbeta
to set the parameters of the Beta distribution (both default to 1). - SemiValueMode.Banzhaf: Implements the Banzhaf semi-value as introduced in (Wang and Jia, 2022)1.
See Data valuation for an overview of valuation.
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function.
TYPE:
|
done
|
Stopping criterion.
TYPE:
|
mode
|
The semi-value mode to use. See SemiValueMode for a list.
TYPE:
|
sampler_t
|
The sampler type to use. See sampler for a list.
TYPE:
|
batch_size
|
Number of marginal evaluations per (parallelized) task.
TYPE:
|
n_jobs
|
Number of parallel jobs to use.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
kwargs
|
Additional keyword arguments passed to compute_generic_semivalues.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Deprecation notice
Parameter batch_size
is for experimental use and will be removed in
future versions.
Source code in src/pydvl/value/semivalues.py
778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 |
|
compute_classwise_shapley_values
¶
compute_classwise_shapley_values(
u: Utility,
*,
done: StoppingCriterion,
truncation: TruncationPolicy,
done_sample_complements: Optional[StoppingCriterion] = None,
normalize_values: bool = True,
use_default_scorer_value: bool = True,
min_elements_per_label: int = 1,
n_jobs: int = 1,
parallel_backend: Optional[ParallelBackend] = None,
config: Optional[ParallelConfig] = None,
progress: bool = False,
seed: Optional[Seed] = None,
) -> ValuationResult
Computes an approximate Class-wise Shapley value by sampling independent permutations of the index set for each label and index sets sampled from the powerset of the complement (with respect to the currently evaluated label), approximating the sum:
where \(\sigma_{:i}\) denotes the set of indices in permutation sigma before the position where \(i\) appears and \(S\) is a subset of the index set of all other labels (see [the main documentation][#intro-to-cw-shapley] for details).
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object containing model, data, and scoring function. The scorer must be of type ClasswiseScorer.
TYPE:
|
done
|
Function that checks whether the computation needs to stop.
TYPE:
|
truncation
|
Callable function that decides whether to interrupt processing a permutation and set subsequent marginals to zero.
TYPE:
|
done_sample_complements
|
Function checking whether computation needs to stop. Otherwise, it will resample conditional sets until the stopping criterion is met.
TYPE:
|
normalize_values
|
Indicates whether to normalize the values by the variation in each class times their in-class accuracy.
TYPE:
|
done_sample_complements
|
Number of times to resample the complement set for each permutation.
TYPE:
|
use_default_scorer_value
|
The first set of indices is the sampled complement set. Unless not otherwise specified, the default scorer value is used for this. If it is set to false, the base score is calculated from the utility.
TYPE:
|
min_elements_per_label
|
The minimum number of elements for each opposite label.
TYPE:
|
n_jobs
|
Number of parallel jobs to run.
TYPE:
|
parallel_backend
|
Parallel backend instance to use
for parallelizing computations. If
TYPE:
|
config
|
(DEPRECATED) Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
progress
|
Whether to display a progress bar.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
ValuationResult object containing computed data values. |
New in version 0.7.1
Source code in src/pydvl/value/shapley/classwise.py
238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 |
|
compute_shapley_values
¶
compute_shapley_values(
u: Utility,
*,
done: StoppingCriterion = MaxChecks(None),
mode: ShapleyMode = TruncatedMontecarlo,
n_jobs: int = 1,
seed: Optional[Seed] = None,
**kwargs,
) -> ValuationResult
Umbrella method to compute Shapley values with any of the available algorithms.
See Data valuation for an overview.
The following algorithms are available. Note that the exact methods can only work with very small datasets and are thus intended only for testing. Some algorithms also accept additional arguments, please refer to the documentation of each particular method.
combinatorial_exact
: uses the combinatorial implementation of data Shapley. Implemented in combinatorial_exact_shapley().combinatorial_montecarlo
: uses the approximate Monte Carlo implementation of combinatorial data Shapley. Implemented in combinatorial_montecarlo_shapley().permutation_exact
: uses the permutation-based implementation of data Shapley. Computation is not parallelized. Implemented in permutation_exact_shapley().permutation_montecarlo
: uses the approximate Monte Carlo implementation of permutation data Shapley. Accepts a TruncationPolicy to stop computing marginals. Implemented in permutation_montecarlo_shapley().owen_sampling
: Uses the Owen continuous extension of the utility function to the unit cube. Implemented in owen_sampling_shapley(). This method does not take a StoppingCriterion but instead requires a parameterq_max
for the number of subdivisions of the unit interval to use for integration, and another parametern_samples
for the number of subsets to sample for each \(q\).owen_halved
: Same as 'owen_sampling' but uses correlated samples in the expectation. Implemented in owen_sampling_shapley(). This method requires an additional parameterq_max
for the number of subdivisions of the interval [0,0.5] to use for integration, and another parametern_samples
for the number of subsets to sample for each \(q\).group_testing
: estimates differences of Shapley values and solves a constraint satisfaction problem. High sample complexity, not recommended. Implemented in group_testing_shapley(). This method does not take a StoppingCriterion but instead requires a parametern_samples
for the number of iterations to run.
Additionally, one can use model-specific methods:
knn
: Exact method for K-Nearest neighbour models. Implemented in knn_shapley().
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function.
TYPE:
|
done
|
Object used to determine when to stop the computation for Monte Carlo methods. The default is to stop after 100 iterations. See the available criteria in stopping. It is possible to combine several of them using boolean operators. Some methods ignore this argument, others require specific subtypes.
TYPE:
|
n_jobs
|
Number of parallel jobs (available only to some methods)
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
mode
|
Choose which shapley algorithm to use. See ShapleyMode for a list of allowed value.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Source code in src/pydvl/value/shapley/common.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
|
num_samples_eps_delta
¶
Implements the formula in Theorem 3 of (Jia, R. et al., 2019)1 which gives a lower bound on the number of samples required to obtain an (ε/√n,δ/(N(N-1))-approximation to all pair-wise differences of Shapley values, wrt. \(\ell_2\) norm.
PARAMETER | DESCRIPTION |
---|---|
eps
|
ε
TYPE:
|
delta
|
δ
TYPE:
|
n
|
Number of data points
TYPE:
|
utility_range
|
Range of the Utility function
TYPE:
|
Returns: Number of samples from \(2^{[n]}\) guaranteeing ε/√n-correct Shapley pair-wise differences of values with probability 1-δ/(N(N-1)).
New in version 0.4.0
Source code in src/pydvl/value/shapley/gt.py
group_testing_shapley
¶
group_testing_shapley(
u: Utility,
n_samples: int,
epsilon: float,
delta: float,
*,
n_jobs: int = 1,
parallel_backend: Optional[ParallelBackend] = None,
config: Optional[ParallelConfig] = None,
progress: bool = False,
seed: Optional[Seed] = None,
**options: dict,
) -> ValuationResult
Implements group testing for approximation of Shapley values as described in (Jia, R. et al., 2019)1.
Warning
This method is very inefficient. It requires several orders of magnitude more evaluations of the utility than others in montecarlo. It also uses several intermediate objects like the results from the runners and the constraint matrices which can become rather large.
By picking a specific distribution over subsets, the differences in Shapley values can be approximated with a Monte Carlo sum. These are then used to solve for the individual values in a feasibility problem.
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function
TYPE:
|
n_samples
|
Number of tests to perform. Use num_samples_eps_delta to estimate this.
TYPE:
|
epsilon
|
From the (ε,δ) sample bound. Use the same as for the
estimation of
TYPE:
|
delta
|
From the (ε,δ) sample bound. Use the same as for the
estimation of
TYPE:
|
n_jobs
|
Number of parallel jobs to use. Each worker performs a chunk of all tests (i.e. utility evaluations).
TYPE:
|
parallel_backend
|
Parallel backend instance to use
for parallelizing computations. If
TYPE:
|
config
|
(DEPRECATED) Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
progress
|
Whether to display progress bars for each job.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
options
|
Additional options to pass to
cvxpy.Problem.solve().
E.g. to change the solver (which defaults to
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the data values. |
New in version 0.4.0
Changed in version 0.5.0
Changed the solver to cvxpy instead of scipy's linprog. Added the ability to pass arbitrary options to it.
Changed in version 0.9.0
Deprecated config
argument and added a parallel_backend
argument to allow users to pass the Parallel Backend instance
directly.
Source code in src/pydvl/value/shapley/gt.py
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 |
|
knn_shapley
¶
knn_shapley(u: Utility, *, progress: bool = True) -> ValuationResult
Computes exact Shapley values for a KNN classifier.
This implements the method described in (Jia, R. et al., 2019)1. It exploits the local structure of K-Nearest Neighbours to reduce the value computation to sorting of the training points by distance to the test point and applying a recursive formula, thus reducing computation time to \(O(n_test n_train log(n_train)\).
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility with a KNN model to extract parameters from. The object will not be modified nor used other than to call get_params()
TYPE:
|
progress
|
Whether to display a progress bar.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the data values. |
RAISES | DESCRIPTION |
---|---|
TypeError
|
If the model in the utility is not a sklearn.neighbors.KNeighborsClassifier. |
New in version 0.1.0
Source code in src/pydvl/value/shapley/knn.py
permutation_montecarlo_shapley
¶
permutation_montecarlo_shapley(
u: Utility,
done: StoppingCriterion,
*,
truncation: TruncationPolicy = NoTruncation(),
n_jobs: int = 1,
parallel_backend: Optional[ParallelBackend] = None,
config: Optional[ParallelConfig] = None,
progress: bool = False,
seed: Optional[Seed] = None,
) -> ValuationResult
Computes an approximate Shapley value by sampling independent permutations of the index set, approximating the sum:
where \(\sigma_{:i}\) denotes the set of indices in permutation sigma before the position where \(i\) appears (see [[data-valuation]] for details).
This implements the method described in (Ghorbani and Zou, 2019)1 with a double stopping criterion.
Todo
Think of how to add Robin-Gelman or some other more principled stopping criterion.
Instead of naively implementing the expectation, we sequentially add points to coalitions from a permutation and incrementally compute marginal utilities. We stop computing marginals for a given permutation based on a TruncationPolicy. (Ghorbani and Zou, 2019)1 mention two policies: one that stops after a certain fraction of marginals are computed, implemented in FixedTruncation, and one that stops if the last computed utility ("score") is close to the total utility using the standard deviation of the utility as a measure of proximity, implemented in BootstrapTruncation.
We keep sampling permutations and updating all shapley values
until the StoppingCriterion returns
True
.
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function.
TYPE:
|
done
|
function checking whether computation must stop.
TYPE:
|
truncation
|
An optional callable which decides whether to interrupt processing a permutation and set all subsequent marginals to zero. Typically used to stop computation when the marginal is small.
TYPE:
|
n_jobs
|
number of jobs across which to distribute the computation.
TYPE:
|
parallel_backend
|
Parallel backend instance to use
for parallelizing computations. If
TYPE:
|
config
|
(DEPRECATED) Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
progress
|
Whether to display a progress bar.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the data values. |
Changed in version 0.9.0
Deprecated config
argument and added a parallel_backend
argument to allow users to pass the Parallel Backend instance
directly.
Source code in src/pydvl/value/shapley/montecarlo.py
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 |
|
combinatorial_montecarlo_shapley
¶
combinatorial_montecarlo_shapley(
u: Utility,
done: StoppingCriterion,
*,
n_jobs: int = 1,
parallel_backend: Optional[ParallelBackend] = None,
config: Optional[ParallelConfig] = None,
progress: bool = False,
seed: Optional[Seed] = None,
) -> ValuationResult
Computes an approximate Shapley value using the combinatorial definition:
This consists of randomly sampling subsets of the power set of the training indices in u.data, and computing their marginal utilities. See Data valuation for details.
Note that because sampling is done with replacement, the approximation is poor even for \(2^{m}\) subsets with \(m>n\), even though there are \(2^{n-1}\) subsets for each \(i\). Prefer permutation_montecarlo_shapley().
Parallelization is done by splitting the set of indices across processes and computing the sum over subsets \(S \subseteq N \setminus \{i\}\) separately.
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function
TYPE:
|
done
|
Stopping criterion for the computation.
TYPE:
|
n_jobs
|
number of parallel jobs across which to distribute the computation. Each worker receives a chunk of indices
TYPE:
|
parallel_backend
|
Parallel backend instance to use
for parallelizing computations. If
TYPE:
|
config
|
(DEPRECATED) Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
progress
|
Whether to display progress bars for each job.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the data values. |
Changed in version 0.9.0
Deprecated config
argument and added a parallel_backend
argument to allow users to pass the Parallel Backend instance
directly.
Source code in src/pydvl/value/shapley/montecarlo.py
permutation_exact_shapley
¶
permutation_exact_shapley(
u: Utility, *, progress: bool = True
) -> ValuationResult
Computes the exact Shapley value using the formulation with permutations:
See Data valuation for details.
When the length of the training set is > 10 this prints a warning since the computation becomes too expensive. Used mostly for internal testing and simple use cases. Please refer to the Monte Carlo approximations for practical applications.
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function
TYPE:
|
progress
|
Whether to display progress bars for each job.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the data values. |
Source code in src/pydvl/value/shapley/naive.py
combinatorial_exact_shapley
¶
combinatorial_exact_shapley(
u: Utility,
*,
n_jobs: int = 1,
parallel_backend: Optional[ParallelBackend] = None,
config: Optional[ParallelConfig] = None,
progress: bool = False,
) -> ValuationResult
Computes the exact Shapley value using the combinatorial definition.
See Data valuation for details.
Note
If the length of the training set is > n_jobs*20 this prints a warning because the computation is very expensive. Used mostly for internal testing and simple use cases. Please refer to the Monte Carlo approximations for practical applications.
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object with model, data, and scoring function
TYPE:
|
n_jobs
|
Number of parallel jobs to use
TYPE:
|
parallel_backend
|
Parallel backend instance to use
for parallelizing computations. If
TYPE:
|
config
|
(DEPRECATED) Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
progress
|
Whether to display progress bars for each job.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the data values. |
Changed in version 0.9.0
Deprecated config
argument and added a parallel_backend
argument to allow users to pass the Parallel Backend instance
directly.
Source code in src/pydvl/value/shapley/naive.py
owen_sampling_shapley
¶
owen_sampling_shapley(
u: Utility,
n_samples: int,
max_q: int,
*,
method: OwenAlgorithm = Standard,
n_jobs: int = 1,
parallel_backend: Optional[ParallelBackend] = None,
config: Optional[ParallelConfig] = None,
progress: bool = False,
seed: Optional[Seed] = None,
) -> ValuationResult
Owen sampling of Shapley values as described in (Okhrati and Lipani, 2021)1.
This function computes a Monte Carlo approximation to
using one of two methods. The first one, selected with the argument mode =
OwenAlgorithm.Standard
, approximates the integral with:
where \(q_j = \frac{j}{Q} \in [0,1]\) and the sets \(S^{(q_j)}\) are such that a sample \(x \in S^{(q_j)}\) if a draw from a \(Ber(q_j)\) distribution is 1.
The second method, selected with the argument mode =
OwenAlgorithm.Antithetic
, uses correlated samples in the inner sum to
reduce the variance:
where now \(q_j = \frac{j}{2Q} \in [0,\frac{1}{2}]\), and \(S^c\) is the complement of \(S\).
Note
The outer integration could be done instead with a quadrature rule.
PARAMETER | DESCRIPTION |
---|---|
u
|
Utility object holding data, model and scoring function.
TYPE:
|
n_samples
|
Numer of sets to sample for each value of q
TYPE:
|
max_q
|
Number of subdivisions for q ∈ [0,1] (the element sampling probability) used to approximate the outer integral.
TYPE:
|
method
|
Selects the algorithm to use, see the description. Either OwenAlgorithm.Full for \(q \in [0,1]\) or OwenAlgorithm.Halved for \(q \in [0,0.5]\) and correlated samples
TYPE:
|
n_jobs
|
Number of parallel jobs to use. Each worker receives a chunk
of the total of
TYPE:
|
parallel_backend
|
Parallel backend instance to use
for parallelizing computations. If
TYPE:
|
config
|
(DEPRECATED) Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
progress
|
Whether to display progress bars for each job.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the data values. |
New in version 0.3.0
Changed in version 0.5.0
Support for parallel computation and enable antithetic sampling.
Changed in version 0.9.0
Deprecated config
argument and added a parallel_backend
argument to allow users to pass the Parallel Backend instance
directly.
Source code in src/pydvl/value/shapley/owen.py
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
|
make_criterion
¶
make_criterion(
fun: StoppingCriterionCallable,
converged: Callable[[], NDArray[bool_]] | None = None,
completion: Callable[[], float] | None = None,
name: str | None = None,
) -> Type[StoppingCriterion]
Create a new StoppingCriterion from a function. Use this to enable simpler functions to be composed with bitwise operators
PARAMETER | DESCRIPTION |
---|---|
fun
|
The callable to wrap. |
converged
|
A callable that returns a boolean array indicating what values have converged. |
completion
|
A callable that returns a value between 0 and 1 indicating the rate of completion of the computation. If not provided, the fraction of converged values is used. |
name
|
The name of the new criterion. If
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Type[StoppingCriterion]
|
A new subclass of StoppingCriterion. |