pydvl.valuation
¶
This module collects methods for data valuation mostly based on marginal utility computation, approximations thereof or other game-theoretic methods.
Additionally, it includes subset sampling schemes, dataset utilitites and utility learning techniques.
RawData
dataclass
¶
A view on a dataset's raw data. This is not a copy.
GroupedDataset
¶
GroupedDataset(
x: NDArray,
y: NDArray,
data_groups: Sequence[int] | NDArray[int_],
feature_names: Sequence[str] | NDArray[str_] | None = None,
target_names: Sequence[str] | NDArray[str_] | None = None,
data_names: Sequence[str] | NDArray[str_] | None = None,
group_names: Sequence[str] | NDArray[str_] | None = None,
description: str | None = None,
**kwargs: Any,
)
Bases: Dataset
Used for calculating values of subsets of the data considered as logical units. For instance, one can group by value of a categorical feature, by bin into which a continuous feature falls, or by label.
PARAMETER | DESCRIPTION |
---|---|
x
|
training data
TYPE:
|
y
|
labels of training data
TYPE:
|
data_groups
|
Sequence of the same length as |
feature_names
|
names of the covariates' features. |
target_names
|
names of the labels or targets y |
data_names
|
names of the data points. For example, if the dataset is a time series, each entry can be a timestamp. |
group_names
|
names of the groups. If not provided, the numerical group ids
from |
description
|
A textual description of the dataset
TYPE:
|
kwargs
|
Additional keyword arguments to pass to the Dataset constructor.
TYPE:
|
Changed in version 0.6.0
Added group_names
and forwarding of kwargs
Changed in version 0.10.0
No longer holds split data, but only x, y and group information. Added methods to retrieve indices for groups and vice versa.
Source code in src/pydvl/valuation/dataset.py
424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 |
|
feature
¶
Returns a slice for the feature with the given name.
Source code in src/pydvl/valuation/dataset.py
data
¶
Returns the data and labels of all samples in the given groups.
PARAMETER | DESCRIPTION |
---|---|
indices
|
group indices whose elements to return. If
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
RawData
|
Tuple of training data |
Source code in src/pydvl/valuation/dataset.py
data_indices
¶
data_indices(
indices: int | slice | Sequence[int] | NDArray[int_] | None = None,
) -> NDArray[int_]
Returns the indices of the samples in the given groups.
PARAMETER | DESCRIPTION |
---|---|
indices
|
group indices whose elements to return. If
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray[int_]
|
Indices of the samples in the given groups. |
Source code in src/pydvl/valuation/dataset.py
logical_indices
¶
Returns the group indices for the given data indices.
PARAMETER | DESCRIPTION |
---|---|
indices
|
indices of the data points in the data array. If |
RETURNS | DESCRIPTION |
---|---|
NDArray[int_]
|
Group indices for the given data indices. |
Source code in src/pydvl/valuation/dataset.py
from_sklearn
classmethod
¶
from_sklearn(
data: Bunch,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
**kwargs,
) -> tuple[GroupedDataset, GroupedDataset]
from_sklearn(
data: Bunch,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
data_groups: Sequence[int] | None = None,
**kwargs,
) -> tuple[GroupedDataset, GroupedDataset]
from_sklearn(
data: Bunch,
train_size: int | float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
data_groups: Sequence[int] | None = None,
**kwargs: dict[str, Any],
) -> tuple[GroupedDataset, GroupedDataset]
Constructs a GroupedDataset object, and an
ungrouped Dataset object from a
sklearn.utils.Bunch as returned by the load_*
functions in
scikit-learn toy datasets and groups
it.
Example
PARAMETER | DESCRIPTION |
---|---|
data
|
scikit-learn Bunch object. The following attributes are supported:
-
TYPE:
|
train_size
|
size of the training dataset. Used in |
random_state
|
seed for train / test split.
TYPE:
|
stratify_by_target
|
If
TYPE:
|
data_groups
|
an array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset. |
kwargs
|
Additional keyword arguments to pass to the Dataset constructor. |
RETURNS | DESCRIPTION |
---|---|
tuple[GroupedDataset, GroupedDataset]
|
Datasets with the selected sklearn data |
Changed in version 0.10.0
Returns a tuple of two GroupedDataset objects.
Source code in src/pydvl/valuation/dataset.py
from_arrays
classmethod
¶
from_arrays(
X: NDArray,
y: NDArray,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
**kwargs,
) -> tuple[GroupedDataset, GroupedDataset]
from_arrays(
X: NDArray,
y: NDArray,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
data_groups: Sequence[int] | None = None,
**kwargs,
) -> tuple[GroupedDataset, GroupedDataset]
from_arrays(
X: NDArray,
y: NDArray,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
data_groups: Sequence[int] | None = None,
**kwargs: Any,
) -> tuple[GroupedDataset, GroupedDataset]
Constructs a GroupedDataset object,
and an ungrouped Dataset object from X and y
numpy arrays as returned by the make_*
functions in
scikit-learn generated datasets.
Example
>>> from sklearn.datasets import make_classification
>>> from pydvl.valuation.dataset import GroupedDataset
>>> X, y = make_classification(
... n_samples=100,
... n_features=4,
... n_informative=2,
... n_redundant=0,
... random_state=0,
... shuffle=False
... )
>>> data_groups = X[:, 0] // 0.5
>>> train, test = GroupedDataset.from_arrays(X, y, data_groups=data_groups)
PARAMETER | DESCRIPTION |
---|---|
X
|
array of shape (n_samples, n_features)
TYPE:
|
y
|
array of shape (n_samples,)
TYPE:
|
train_size
|
size of the training dataset. Used in
TYPE:
|
random_state
|
seed for train / test split.
TYPE:
|
stratify_by_target
|
If
TYPE:
|
data_groups
|
an array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset. |
kwargs
|
Additional keyword arguments that will be passed to the GroupedDataset constructor.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple[GroupedDataset, GroupedDataset]
|
Dataset with the passed X and y arrays split across training and test sets. |
New in version 0.4.0
Changed in version 0.6.0
Added kwargs to pass to the GroupedDataset constructor.
Changed in version 0.10.0
Returns a tuple of two GroupedDataset objects.
Source code in src/pydvl/valuation/dataset.py
697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 |
|
from_dataset
classmethod
¶
from_dataset(
data: Dataset,
data_groups: Sequence[int] | NDArray[int_],
group_names: Sequence[str] | NDArray[str_] | None = None,
**kwargs: Any,
) -> GroupedDataset
Creates a GroupedDataset object from a Dataset object and a mapping of data groups.
Example
PARAMETER | DESCRIPTION |
---|---|
data
|
The original data.
TYPE:
|
data_groups
|
An array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset. |
group_names
|
Names of the groups. If not provided, the numerical group ids
from |
kwargs
|
Additional arguments to be passed to the GroupedDataset constructor.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
GroupedDataset
|
A GroupedDataset with the initial
Dataset grouped by |
Source code in src/pydvl/valuation/dataset.py
BetaShapleyValuation
¶
BetaShapleyValuation(
utility: UtilityBase,
sampler: IndexSampler,
is_done: StoppingCriterion,
alpha: float,
beta: float,
progress: bool = False,
skip_converged: bool = False,
)
Bases: SemivalueValuation
Computes Beta-Shapley values.
PARAMETER | DESCRIPTION |
---|---|
utility
|
Object to compute utilities.
TYPE:
|
sampler
|
Sampling scheme to use.
TYPE:
|
is_done
|
Stopping criterion to use.
TYPE:
|
skip_converged
|
Whether to skip converged indices. Convergence is determined
by the stopping criterion's
TYPE:
|
alpha
|
The alpha parameter of the Beta distribution.
TYPE:
|
beta
|
The beta parameter of the Beta distribution.
TYPE:
|
progress
|
Whether to show a progress bar. If a dictionary, it is passed to
TYPE:
|
Source code in src/pydvl/valuation/methods/beta_shapley.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
ClasswiseShapleyValuation
¶
ClasswiseShapleyValuation(
utility: ClasswiseModelUtility,
sampler: ClasswiseSampler,
is_done: StoppingCriterion,
progress: dict[str, Any] | bool = False,
*,
normalize_values: bool = True,
)
Bases: Valuation
Class to compute Class-wise Shapley values.
PARAMETER | DESCRIPTION |
---|---|
utility
|
Class-wise utility object with model and class-wise scoring function.
TYPE:
|
sampler
|
Class-wise sampling scheme to use.
TYPE:
|
is_done
|
Stopping criterion to use.
TYPE:
|
progress
|
Whether to show a progress bar. |
normalize_values
|
Whether to normalize values after valuation.
TYPE:
|
Source code in src/pydvl/valuation/methods/classwise_shapley.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
DataBanzhafValuation
¶
DataBanzhafValuation(
utility: UtilityBase,
sampler: IndexSampler,
is_done: StoppingCriterion,
skip_converged: bool = False,
show_warnings: bool = True,
progress: dict[str, Any] | bool = False,
)
Bases: SemivalueValuation
Computes Banzhaf values.
Source code in src/pydvl/valuation/methods/semivalue.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
BaggingModel
¶
Bases: Protocol
Any model with the attributes n_estimators
and max_samples
is considered a
bagging model.
fit
¶
Valuation
¶
Bases: ABC
Source code in src/pydvl/valuation/base.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
ValuationResult
¶
ValuationResult(
*,
values: Sequence[float64] | NDArray[float64],
variances: Sequence[float64] | NDArray[float64] | None = None,
counts: Sequence[int_] | NDArray[int_] | None = None,
indices: Sequence[IndexT] | NDArray[IndexT] | None = None,
data_names: Sequence[NameT] | NDArray[NameT] | None = None,
algorithm: str = "",
status: Status = Pending,
sort: bool | None = None,
**extra_values: Any,
)
Bases: Sequence
, Iterable[ValueItem]
Objects of this class hold the results of valuation algorithms.
These include indices in the original Dataset,
any data names (e.g. group names in GroupedDataset),
the values themselves, and variance of the computation in the case of Monte
Carlo methods. ValuationResults
can be iterated over like any Sequence
:
iter(valuation_result)
returns a generator of
ValueItem in the order in which the object
is sorted.
Indexing¶
Indexing can be position-based, when accessing any of the attributes values, variances, counts and indices, as well as when iterating over the object, or using the item access operator, both getter and setter. The "position" is either the original sequence in which the data was passed to the constructor, or the sequence in which the object is sorted, see below. One can retrieve the position for a given data index using the method positions().
Some methods use data indices instead. This is the case for get() and update().
Sorting¶
Results can be sorted in-place with
sort(), or alternatively using
python's standard sorted()
and reversed()
Note that sorting values affects how
iterators and the object itself as Sequence
behave: values[0]
returns a
ValueItem with the highest or lowest ranking
point if this object is sorted by descending or ascending value, respectively.the methods If
unsorted, values[0]
returns the ValueItem
at position 0, which has data index
indices[0]
in the Dataset.
The same applies to direct indexing of the ValuationResult
: the index
is positional, according to the sorting. It does not refer to the "data
index". To sort according to data index, use
sort() with key="index"
.
In order to access ValueItem objects by their data index, use get(), or use positions() to convert data indices to positions.
Converting back and forth from data indices and positions
data_indices = result.indices[result.positions(data_indices)]
is a noop.
Operating on results¶
Results can be added to each other with the +
operator. Means and
variances are correctly updated, using the counts
attribute.
Results can also be updated with new values using update(). Means and variances are updated accordingly using the Welford algorithm.
Empty objects behave in a special way, see empty().
PARAMETER | DESCRIPTION |
---|---|
values
|
An array of values. If omitted, defaults to an empty array
or to an array of zeros if |
indices
|
An optional array of indices in the original dataset. If
omitted, defaults to
TYPE:
|
variances
|
An optional array of variances of the marginals from which the values are computed.
TYPE:
|
counts
|
An optional array with the number of updates for each value. Defaults to an array of ones. |
data_names
|
Names for the data points. Defaults to index numbers if not set. |
algorithm
|
The method used.
TYPE:
|
status
|
The end status of the algorithm.
TYPE:
|
sort
|
Whether to sort the indices. Defaults to
TYPE:
|
extra_values
|
Additional values that can be passed as keyword arguments. This can contain, for example, the least core value.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
ValueError
|
If input arrays have mismatching lengths. |
Changed in 0.10.0
Changed the behaviour of the sort
argument.
Source code in src/pydvl/valuation/result.py
variances
property
¶
Variances of the marginals from which values were computed, possibly sorted.
Note that this is not the variance of the value estimate, but the sample variance of the marginals used to compute it.
indices
property
¶
indices: NDArray[IndexT]
The indices for the values, possibly sorted.
If the object is unsorted, then these are the same as declared at
construction or np.arange(len(values))
if none were passed.
names
property
¶
names: NDArray[NameT]
The names for the values, possibly sorted.
If the object is unsorted, then these are the same as declared at
construction or np.arange(len(values))
if none were passed.
sort
¶
sort(
reverse: bool = False,
key: Literal["value", "variance", "index", "name"] = "value",
) -> None
Sorts the indices in place by key
.
Once sorted, iteration over the results, and indexing of all the properties ValuationResult.values, ValuationResult.variances, ValuationResult.stderr, ValuationResult.counts, ValuationResult.indices and ValuationResult.names will follow the same order.
PARAMETER | DESCRIPTION |
---|---|
reverse
|
Whether to sort in descending order by value.
TYPE:
|
key
|
The key to sort by. Defaults to ValueItem.value.
TYPE:
|
Source code in src/pydvl/valuation/result.py
positions
¶
positions(data_indices: IndexSetT | list[IndexT]) -> IndexSetT
Return the location (indices) within the ValuationResult
for the given
data indices.
Sorting is taken into account. This operation is the inverse of indexing the indices property:
Source code in src/pydvl/valuation/result.py
copy
¶
copy() -> ValuationResult
Returns a copy of the object.
Source code in src/pydvl/valuation/result.py
__getattr__
¶
Allows access to extra values as if they were properties of the instance.
Source code in src/pydvl/valuation/result.py
__iter__
¶
Iterate over the results returning ValueItem objects. To sort in place before iteration, use sort().
Source code in src/pydvl/valuation/result.py
__add__
¶
__add__(other: ValuationResult) -> ValuationResult
Adds two ValuationResults.
The values must have been computed with the same algorithm. An exception to this is if one argument has empty values, in which case the other argument is returned.
Warning
Abusing this will introduce numerical errors.
Means and standard errors are correctly handled. Statuses are added with
bit-wise &
, see Status.
data_names
are taken from the left summand, or if unavailable from
the right one. The algorithm
string is carried over if both terms
have the same one or concatenated.
It is possible to add ValuationResults of different lengths, and with different or overlapping indices. The result will have the union of indices, and the values.
Warning
FIXME: Arbitrary extra_values
aren't handled.
Source code in src/pydvl/valuation/result.py
534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 |
|
update
¶
update(data_idx: int | IndexT, new_value: float) -> ValuationResult
Updates the result in place with a new value, using running mean and variance.
The variance computation uses Bessel's correction for sample estimates of the variance.
PARAMETER | DESCRIPTION |
---|---|
data_idx
|
Data index of the value to update.
TYPE:
|
new_value
|
New value to add to the result.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
A reference to the same, modified result. |
RAISES | DESCRIPTION |
---|---|
IndexError
|
If the index is not found. |
Source code in src/pydvl/valuation/result.py
scale
¶
Scales the values and variances of the result by a coefficient.
PARAMETER | DESCRIPTION |
---|---|
factor
|
Factor to scale by.
TYPE:
|
data_indices
|
Data indices to scale. If
TYPE:
|
Source code in src/pydvl/valuation/result.py
get
¶
Retrieves a ValueItem by data index, as opposed to sort index, like the indexing operator.
PARAMETER | DESCRIPTION |
---|---|
data_idx
|
Data index of the value to retrieve.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
IndexError
|
If the index is not found. |
Source code in src/pydvl/valuation/result.py
to_dataframe
¶
Returns values as a dataframe.
PARAMETER | DESCRIPTION |
---|---|
column
|
Name for the column holding the data value. Defaults to the name of the algorithm used.
TYPE:
|
use_names
|
Whether to use data names instead of indices for the DataFrame's index.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DataFrame
|
A dataframe with three columns: |
Source code in src/pydvl/valuation/result.py
from_random
classmethod
¶
from_random(
size: int,
total: float | None = None,
seed: Seed | None = None,
**kwargs: dict[str, Any],
) -> "ValuationResult"
Creates a ValuationResult object and fills it with an array of random values from a uniform distribution in [-1,1]. The values can be made to sum up to a given total number (doing so will change their range).
PARAMETER | DESCRIPTION |
---|---|
size
|
Number of values to generate
TYPE:
|
total
|
If set, the values are normalized to sum to this number ("efficiency" property of Shapley values).
TYPE:
|
seed
|
Random seed to use
TYPE:
|
kwargs
|
Additional options to pass to the constructor of ValuationResult. Use to override status, names, etc. |
RETURNS | DESCRIPTION |
---|---|
'ValuationResult'
|
A valuation result with its status set to |
'ValuationResult'
|
Status.Converged by default. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If |
Changed in version 0.6.0
Added parameter total
. Check for zero size
Source code in src/pydvl/valuation/result.py
empty
classmethod
¶
empty(
algorithm: str = "",
indices: IndexSetT | None = None,
data_names: Sequence[NameT] | NDArray[NameT] | None = None,
n_samples: int = 0,
) -> ValuationResult
Creates an empty ValuationResult object.
Empty results are characterised by having an empty array of values. When another result is added to an empty one, the empty one is discarded.
PARAMETER | DESCRIPTION |
---|---|
algorithm
|
Name of the algorithm used to compute the values
TYPE:
|
indices
|
Optional sequence or array of indices.
TYPE:
|
data_names
|
Optional sequences or array of names for the data points. Defaults to index numbers if not set. |
n_samples
|
Number of valuation result entries.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Source code in src/pydvl/valuation/result.py
zeros
classmethod
¶
zeros(
algorithm: str = "",
indices: IndexSetT | None = None,
data_names: Sequence[NameT] | NDArray[NameT] | None = None,
n_samples: int = 0,
) -> ValuationResult
Creates an empty ValuationResult object.
Empty results are characterised by having an empty array of values. When another result is added to an empty one, the empty one is ignored.
PARAMETER | DESCRIPTION |
---|---|
algorithm
|
Name of the algorithm used to compute the values
TYPE:
|
indices
|
Data indices to use. A copy will be made. If not given,
the indices will be set to the range
TYPE:
|
data_names
|
Data names to use. A copy will be made. If not given, the names will be set to the string representation of the indices. |
n_samples
|
Number of data points whose values are computed. If
not given, the length of
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Source code in src/pydvl/valuation/result.py
DataOOBValuation
¶
DataOOBValuation(model: BaggingModel, score: PointwiseScore | None = None)
Bases: Valuation
Computes Data Out-Of-Bag values.
This class implements the method described in (Kwon and Zou, 2023)1.
PARAMETER | DESCRIPTION |
---|---|
model
|
A fitted bagging model. Bagging models in sklearn include
[[BaggingClassifier]], [[BaggingRegressor]], [[IsolationForest]], RandomForest,
ExtraTrees, or any model which defines an attribute
TYPE:
|
score
|
A callable for point-wise comparison of true values with the predictions.
If
TYPE:
|
Source code in src/pydvl/valuation/methods/data_oob.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
fit
¶
fit(data: Dataset) -> Self
Compute the Data-OOB values.
This requires the bagging model passed upon construction to be fitted.
PARAMETER | DESCRIPTION |
---|---|
data
|
Data for which to compute values
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
The fitted object. |
Source code in src/pydvl/valuation/methods/data_oob.py
DeltaShapleyValuation
¶
DeltaShapleyValuation(
utility: UtilityBase,
is_done: StoppingCriterion,
lower_bound: int,
upper_bound: int,
seed: Seed | None = None,
skip_converged: bool = False,
progress: bool = False,
)
Bases: SemivalueValuation
Computes \(\delta\)-Shapley values.
\(\delta\)-Shapley does not accept custom samplers. Instead, it uses a StratifiedSampler with a lower and upper bound on the size of the sets to sample from.
PARAMETER | DESCRIPTION |
---|---|
utility
|
Object to compute utilities.
TYPE:
|
is_done
|
Stopping criterion to use.
TYPE:
|
lower_bound
|
The lower bound of the size of the subsets to sample from.
TYPE:
|
upper_bound
|
The upper bound of the size of the subsets to sample from.
TYPE:
|
seed
|
The seed for the random number generator used by the sampler.
TYPE:
|
progress
|
Whether to show a progress bar. If a dictionary, it is passed to
TYPE:
|
skip_converged
|
Whether to skip converged indices, as determined by the
stopping criterion's
TYPE:
|
Source code in src/pydvl/valuation/methods/delta_shapley.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
GroupTestingShapleyValuation
¶
GroupTestingShapleyValuation(
utility: UtilityBase,
n_samples: int,
epsilon: float,
solver_options: dict | None = None,
progress: bool = True,
seed: Seed | None = None,
batch_size: int = 1,
)
Bases: Valuation
Class to calculate the group-testing approximation to shapley values.
See Data valuation for an overview.
Warning
This method is very inefficient. Potential improvements to the implementation notwithstanding, convergence seems to be very slow (in terms of evaluations of the utility required). We recommend other Monte Carlo methods instead.
PARAMETER | DESCRIPTION |
---|---|
utility
|
Utility object with model, data and scoring function.
TYPE:
|
n_samples
|
The number of samples to use. A sample size with theoretical guarantees can be computed using compute_n_samples().
TYPE:
|
epsilon
|
The error tolerance.
TYPE:
|
solver_options
|
Optional dictionary containing a CVXPY solver and options to configure it. For valid values to the "solver" key see this tutorial. For additional options cvxpy's documentation.
TYPE:
|
progress
|
Whether to show a progress bar during the construction of the group-testing problem.
TYPE:
|
seed
|
Seed for the random number generator.
TYPE:
|
batch_size
|
The number of samples to draw in each batch. Can be used to reduce parallelization overhead for fast utilities. Defaults to 1.
TYPE:
|
Source code in src/pydvl/valuation/methods/gt_shapley.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
fit
¶
fit(data: Dataset) -> Self
Calculate the group-testing valuation on a dataset.
This method has to be called before calling values()
.
Calculating the least core valuation is a computationally expensive task that
can be parallelized. To do so, call the fit()
method inside a
joblib.parallel_config
context manager as follows:
Source code in src/pydvl/valuation/methods/gt_shapley.py
Status
¶
Bases: Enum
Status of a computation.
Statuses can be combined using bitwise or (|
) and bitwise and (&
) to
get the status of a combined computation. For example, if we have two
computations, one that has converged and one that has failed, then the
combined status is Status.Converged | Status.Failed == Status.Converged
,
but Status.Converged & Status.Failed == Status.Failed
.
OR¶
The result of bitwise or-ing two valuation statuses with |
is given
by the following table:
P | C | F | |
---|---|---|---|
P | P | C | P |
C | C | C | C |
F | P | C | F |
where P = Pending, C = Converged, F = Failed.
AND¶
The result of bitwise and-ing two valuation statuses with &
is given
by the following table:
P | C | F | |
---|---|---|---|
P | P | P | F |
C | P | C | F |
F | F | F | F |
where P = Pending, C = Converged, F = Failed.
NOT¶
The result of bitwise negation of a Status with ~
is Failed
if
the status is Converged
, or Converged
otherwise:
Boolean casting¶
A Status evaluates to True
iff it's Converged
or Failed
:
Warning
These truth values are inconsistent with the usual boolean
operations. In particular the XOR of two instances of Status
is not
the same as the XOR of their boolean values.
KNNShapleyValuation
¶
KNNShapleyValuation(
model: KNeighborsClassifier,
test_data: Dataset,
progress: bool = True,
clone_before_fit: bool = True,
)
Bases: Valuation
Computes exact Shapley values for a KNN classifier.
This implements the method described in (Jia, R. et al., 2019)1. It exploits the local structure of K-Nearest Neighbours to reduce the number of calls to the utility function to a constant number per index, thus reducing computation time to \(O(n)\).
PARAMETER | DESCRIPTION |
---|---|
model
|
KNeighborsClassifier model to use for valuation
TYPE:
|
test_data
|
Dataset containing test data for valuation
TYPE:
|
progress
|
Whether to display a progress bar.
TYPE:
|
clone_before_fit
|
Whether to clone the model before fitting.
TYPE:
|
Source code in src/pydvl/valuation/methods/knn_shapley.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
fit
¶
fit(data: Dataset) -> Self
Calculate exact shapley values for a KNN model on a dataset.
This fit method bypasses direct evaluations of the utility function and calculates the Shapley values directly.
In contrast to other data valuation models, the runtime increases linearly with the size of the dataset.
Calculating the KNN valuation is a computationally expensive task that
can be parallelized. To do so, call the fit()
method inside a
joblib.parallel_config
context manager as follows:
Source code in src/pydvl/valuation/methods/knn_shapley.py
LeastCoreValuation
¶
LeastCoreValuation(
utility: UtilityBase,
sampler: PowersetSampler,
n_samples: int | None = None,
non_negative_subsidy: bool = False,
solver_options: dict | None = None,
progress: bool = True,
)
Bases: Valuation
Umbrella class to calculate least-core values with multiple sampling methods.
See Data valuation for an overview.
Different samplers correspond to different least-core methods from the literature. For those, we provide convenience subclasses of LeastCoreValuation. See
Other samplers allow you to create your own method and might yield computational gains over a standard Monte Carlo method.
PARAMETER | DESCRIPTION |
---|---|
utility
|
Utility object with model, data and scoring function.
TYPE:
|
sampler
|
The sampler to use for the valuation.
TYPE:
|
n_samples
|
The number of samples to use for the valuation. If None, it will be
set to the sample limit of the chosen sampler (for finite samplers) or
TYPE:
|
non_negative_subsidy
|
If True, the least core subsidy \(e\) is constrained to be non-negative.
TYPE:
|
solver_options
|
Optional dictionary containing a CVXPY solver and options to configure it. For valid values to the "solver" key see here. For additional options see here.
TYPE:
|
progress
|
Whether to show a progress bar during the construction of the least-core problem.
TYPE:
|
Source code in src/pydvl/valuation/methods/least_core.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
fit
¶
fit(data: Dataset) -> Self
Calculate the least core valuation on a dataset.
This method has to be called before calling values()
.
Calculating the least core valuation is a computationally expensive task that
can be parallelized. To do so, call the fit()
method inside a
joblib.parallel_config
context manager as follows:
Source code in src/pydvl/valuation/methods/least_core.py
ExactLeastCoreValuation
¶
ExactLeastCoreValuation(
utility: UtilityBase,
non_negative_subsidy: bool = False,
solver_options: dict | None = None,
progress: bool = True,
batch_size: int = 1,
)
Bases: LeastCoreValuation
Class to calculate exact least-core values.
Equivalent to constructing a
LeastCoreValuation with a
DeterministicUniformSampler
and n_samples=None
.
The definition of the exact least-core valuation is:
Where \(N = \{1, 2, \dots, n\}\) are the training set's indices.
PARAMETER | DESCRIPTION |
---|---|
utility
|
Utility object with model, data and scoring function.
TYPE:
|
non_negative_subsidy
|
If True, the least core subsidy \(e\) is constrained to be non-negative.
TYPE:
|
solver_options
|
Optional dictionary containing a CVXPY solver and options to configure it. For valid values to the "solver" key see here. For additional options see here.
TYPE:
|
progress
|
Whether to show a progress bar during the construction of the least-core problem.
TYPE:
|
Source code in src/pydvl/valuation/methods/least_core.py
fit
¶
fit(data: Dataset) -> Self
Calculate the least core valuation on a dataset.
This method has to be called before calling values()
.
Calculating the least core valuation is a computationally expensive task that
can be parallelized. To do so, call the fit()
method inside a
joblib.parallel_config
context manager as follows:
Source code in src/pydvl/valuation/methods/least_core.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
MonteCarloLeastCoreValuation
¶
MonteCarloLeastCoreValuation(
utility: UtilityBase,
n_samples: int,
non_negative_subsidy: bool = False,
solver_options: dict | None = None,
progress: bool = True,
seed: Seed | None = None,
batch_size: int = 1,
)
Bases: LeastCoreValuation
Class to calculate exact least-core values.
Equivalent to calling LeastCoreValuation
with a UniformSampler
.
The definition of the Monte Carlo least-core valuation is:
Where:
- \(U(2^N)\) is the uniform distribution over the powerset of \(N\).
- \(m\) is the number of subsets that will be sampled and whose utility will be computed and used to compute the data values.
PARAMETER | DESCRIPTION |
---|---|
utility
|
Utility object with model, data and scoring function.
TYPE:
|
n_samples
|
The number of samples to use for the valuation. If None, it will be
set to
TYPE:
|
non_negative_subsidy
|
If True, the least core subsidy \(e\) is constrained to be non-negative.
TYPE:
|
solver_options
|
Optional dictionary containing a CVXPY solver and options to configure it. For valid values to the "solver" key see here. For additional options see here.
TYPE:
|
progress
|
Whether to show a progress bar during the construction of the least-core problem.
TYPE:
|
Source code in src/pydvl/valuation/methods/least_core.py
fit
¶
fit(data: Dataset) -> Self
Calculate the least core valuation on a dataset.
This method has to be called before calling values()
.
Calculating the least core valuation is a computationally expensive task that
can be parallelized. To do so, call the fit()
method inside a
joblib.parallel_config
context manager as follows:
Source code in src/pydvl/valuation/methods/least_core.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
LOOValuation
¶
LOOValuation(utility: UtilityBase, progress: bool = False)
Bases: SemivalueValuation
Computes LOO values for a dataset.
Source code in src/pydvl/valuation/methods/loo.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
log_coefficient
¶
The LOOSampler returns only complements of {idx}, so the weight is either 1/n (the probability of a set of size n-1) or 0 if k != n-1. We cancel this out here so that the final coefficient is either 1 if k == n-1 or 0 otherwise.
Source code in src/pydvl/valuation/methods/loo.py
SemivalueValuation
¶
SemivalueValuation(
utility: UtilityBase,
sampler: IndexSampler,
is_done: StoppingCriterion,
skip_converged: bool = False,
show_warnings: bool = True,
progress: dict[str, Any] | bool = False,
)
Bases: Valuation
Abstract class to define semi-values.
Implementations must only provide the log_coefficient()
method, corresponding
to the semi-value coefficient.
Note
For implementation consistency, we slightly depart from the common definition of semi-values, which includes a factor \(1/n\) in the sum over subsets. Instead, we subsume this factor into the coefficient \(w(k)\).
PARAMETER | DESCRIPTION |
---|---|
utility
|
Object to compute utilities.
TYPE:
|
sampler
|
Sampling scheme to use.
TYPE:
|
is_done
|
Stopping criterion to use.
TYPE:
|
skip_converged
|
Whether to skip converged indices, as determined by the
stopping criterion's
TYPE:
|
show_warnings
|
Whether to show warnings.
TYPE:
|
progress
|
Whether to show a progress bar. If a dictionary, it is passed to
|
Source code in src/pydvl/valuation/methods/semivalue.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
log_coefficient
abstractmethod
¶
The semi-value coefficient in log-space.
The semi-value coefficient is a function of the number of elements in the set, and the size of the subset for which the coefficient is being computed. Because both coefficients and sampler weights can be very large or very small, we perform all computations in log-space to avoid numerical issues.
PARAMETER | DESCRIPTION |
---|---|
n
|
Total number of elements in the set.
TYPE:
|
k
|
Size of the subset for which the coefficient is being computed
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
The natural logarithm of the semi-value coefficient. |
Source code in src/pydvl/valuation/methods/semivalue.py
ShapleyValuation
¶
ShapleyValuation(
utility: UtilityBase,
sampler: IndexSampler,
is_done: StoppingCriterion,
skip_converged: bool = False,
show_warnings: bool = True,
progress: dict[str, Any] | bool = False,
)
Bases: SemivalueValuation
Computes Shapley values.
Source code in src/pydvl/valuation/methods/semivalue.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
ResultUpdater
¶
ResultUpdater(result: ValuationResult)
IndexSampler
¶
IndexSampler(batch_size: int = 1)
Bases: ABC
, Generic[ValueUpdateT]
Samplers are custom iterables over batches of subsets of indices.
Calling from_indices(indexset)
on a sampler returns a generator over batches
of Samples
. A Sample is a tuple of the form
\((i, S)\), where \(i\) is an index of interest, and \(S \subset I \setminus \{i\}\) is a
subset of the complement of \(i\) in \(I\).
Note
Samplers are not iterators themselves, so that each call to
from_indices(data)
e.g. in a new for loop creates a new iterator.
Derived samplers must implement log_weight() and generate(). See the module's documentation for more on these.
Interrupting samplers¶
Calling interrupt() on a sampler will stop the batched generator after the current batch has been yielded.
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
The number of samples to generate per batch. Batches are processed by EvaluationStrategy so that individual valuations in batch are guaranteed to be received in the right sequence.
TYPE:
|
Example
processed by the
[EvaluationStrategy][pydvl.valuation.samplers.base.EvaluationStrategy]
Source code in src/pydvl/valuation/samplers/base.py
skip_indices
property
writable
¶
Indices being skipped in the sampler. The exact behaviour will be sampler-dependent, so that setting this property is disabled by default.
interrupt
¶
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
sample_limit
abstractmethod
¶
sample_limit(indices: IndexSetT) -> int | None
Number of samples that can be generated from the indices.
PARAMETER | DESCRIPTION |
---|---|
indices
|
The indices used in the sampler.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
int | None
|
The maximum number of samples that will be generated, or |
Source code in src/pydvl/valuation/samplers/base.py
generate
abstractmethod
¶
Generates single samples.
IndexSampler.generate_batches()
will batch these samples according to the
batch size set upon construction.
PARAMETER | DESCRIPTION |
---|---|
indices
|
TYPE:
|
YIELDS | DESCRIPTION |
---|---|
SampleGenerator
|
A tuple (idx, subset) for each sample. |
Source code in src/pydvl/valuation/samplers/base.py
log_weight
abstractmethod
¶
Factor by which to multiply Monte Carlo samples, so that the mean converges to the desired expression.
Log-space computation
Because the weight is a probability that can be arbitrarily small, we compute it in log-space for numerical stability.
By the Law of Large Numbers, the sample mean of \(f(S_j)\) converges to the expectation under the distribution from which \(S_j\) is sampled.
We add the factor \(w(S_j)\) in order to have this expectation coincide with the desired expression, by cancelling out \(\mathbb{P} (S)\).
PARAMETER | DESCRIPTION |
---|---|
n
|
The size of the index set. Note that the actual size of the set being sampled will often be n-1, as one index might be removed from the set. See IndexIteration for more.
TYPE:
|
subset_len
|
The size of the subset being sampled
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
The natural logarithm of the probability of sampling a set of the given
size, when the index set has size |
Source code in src/pydvl/valuation/samplers/base.py
make_strategy
abstractmethod
¶
make_strategy(
utility: UtilityBase,
log_coefficient: Callable[[int, int], float] | None = None,
) -> EvaluationStrategy
Returns the strategy for this sampler.
Source code in src/pydvl/valuation/samplers/base.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns a callable that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
EvaluationStrategy
¶
EvaluationStrategy(
sampler: SamplerT,
utility: UtilityBase,
log_coefficient: Callable[[int, int], float] | None = None,
)
Bases: ABC
, Generic[SamplerT, ValueUpdateT]
An evaluation strategy for samplers.
Implements the processing strategy for batches returned by an IndexSampler.
Different sampling schemes require different strategies for the evaluation of the utilities. For instance permutations generated by PermutationSampler must be evaluated in sequence to save computation, see PermutationEvaluationStrategy.
This class defines the common interface.
Usage pattern in valuation methods
def fit(self, data: Dataset):
self.utility = self.utility.with_dataset(data)
strategy = self.sampler.strategy(self.utility, self.log_coefficient)
delayed_batches = Parallel()(
delayed(strategy.process)(batch=list(batch), is_interrupted=flag)
for batch in self.sampler
)
for batch in delayed_batches:
for evaluation in batch:
self.result.update(evaluation.idx, evaluation.update)
if self.is_done(self.result):
flag.set()
break
PARAMETER | DESCRIPTION |
---|---|
sampler
|
Required to set up some strategies.
TYPE:
|
utility
|
Required to set up some strategies and to process the samples. Since this contains the training data, it is expensive to pickle and send to workers.
TYPE:
|
log_coefficient
|
An additional coefficient to multiply marginals with. This depends on the valuation method, hence the delayed setup. |
Source code in src/pydvl/valuation/samplers/base.py
process
abstractmethod
¶
process(
batch: SampleBatch, is_interrupted: NullaryPredicate
) -> list[ValueUpdateT]
Processes batches of samples using the evaluator, with the strategy required for the sampler.
Warning
This method is intended to be used by the evaluator to process the samples in one batch, which means it might be sent to another process. Be careful with the objects you use here, as they will be pickled and sent over the wire.
PARAMETER | DESCRIPTION |
---|---|
batch
|
A batch of samples to process.
TYPE:
|
is_interrupted
|
A predicate that returns True if the processing should be interrupted.
TYPE:
|
YIELDS | DESCRIPTION |
---|---|
list[ValueUpdateT]
|
Updates to values as tuples (idx, update) |
Source code in src/pydvl/valuation/samplers/base.py
ClasswiseSampler
¶
ClasswiseSampler(
in_class: IndexSampler,
out_of_class: PowersetSampler,
*,
min_elements_per_label: int = 1,
batch_size: int = 1,
)
Bases: IndexSampler
A sampler that samples elements from a dataset in two steps, based on the labels.
It proceeds by sampling out-of-class indices (training points with a different label to the point of interest), and in-class indices (training points with the same label as the point of interest), in the complement.
Used by the class-wise Shapley valuation method.
PARAMETER | DESCRIPTION |
---|---|
in_class
|
Sampling scheme for elements of a given label.
TYPE:
|
out_of_class
|
Sampling scheme for elements of different labels, i.e., the complement set.
TYPE:
|
min_elements_per_label
|
Minimum number of elements per label to sample from the complement set, i.e., out of class elements.
TYPE:
|
Source code in src/pydvl/valuation/samplers/classwise.py
skip_indices
property
writable
¶
Indices being skipped in the sampler. The exact behaviour will be sampler-dependent, so that setting this property is disabled by default.
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns a callable that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
interrupt
¶
Interrupts the current sampler as well as the passed in samplers
MSRSampler
¶
MSRSampler(batch_size: int = 1, seed: Seed | None = None)
Bases: StochasticSamplerMixin
, IndexSampler[MSRValueUpdate]
Sampler for unweighted Maximum Sample Re-use (MSR) valuation.
The sampling is similar to a UniformSampler but without an outer index. However,the MSR sampler uses a special evaluation strategy and result updater, as returned by the make_strategy() and result_updater() methods, respectively.
Two running means are updated separately for positive and negative updates. The two running means are later combined into a final result.
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
Number of samples to generate in each batch.
TYPE:
|
seed
|
Seed for the random number generator.
TYPE:
|
Source code in src/pydvl/valuation/samplers/msr.py
skip_indices
property
writable
¶
Indices being skipped in the sampler. The exact behaviour will be sampler-dependent, so that setting this property is disabled by default.
interrupt
¶
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
log_weight
¶
Probability of sampling a set of size k.
In the MSR scheme, the sampling is done from the full power set \(2^N\) (each set \(S \subseteq N\) with probability \(1 / 2^n\)), and then for each data point \(i\) one partitions the sample into:
* $\mathcal{S}_{\ni i} = \{S \in \mathcal{S}: i \in S\},$ and
* $\mathcal{S}_{\nni i} = \{S \in \mathcal{S}: i \nin S\}.$.
When we condition on the event \(i \in S\), the remaining part \(S_{- i}\) is uniformly distributed over \(2^{N_{- i}}\). In other words, the act of partitioning recovers the uniform distribution on \(2^{N_{- i}}\) "for free" because
for each \(T \subseteq N_{- i}\).
PARAMETER | DESCRIPTION |
---|---|
n
|
Size of the index set.
TYPE:
|
subset_len
|
Size of the subset.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
The logarithm of the probability of having sampled a set of size
|
Source code in src/pydvl/valuation/samplers/msr.py
make_strategy
¶
make_strategy(
utility: UtilityBase, coefficient: Callable[[int, int], float] | None = None
) -> MSREvaluationStrategy
Returns the strategy for this sampler.
PARAMETER | DESCRIPTION |
---|---|
utility
|
Utility function to evaluate.
TYPE:
|
coefficient
|
Coefficient function for the utility function. |
Source code in src/pydvl/valuation/samplers/msr.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater
Returns a callable that updates a valuation result with an MSR value update.
MSR updates two running means for positive and negative updates separately. The two running means are later combined into a final result.
PARAMETER | DESCRIPTION |
---|---|
result
|
The valuation result to update with each call of the returned callable.
TYPE:
|
Returns: A callable object that updates the valuation result with very MSRValueUpdate.
Source code in src/pydvl/valuation/samplers/msr.py
UniformOwenStrategy
¶
UniformOwenStrategy(n_samples_outer: int, seed: Seed | None = None)
Bases: OwenStrategy
A strategy for OwenSampler to sample probability values uniformly between 0 and \(q_ ext{stop}\).
PARAMETER | DESCRIPTION |
---|---|
n_samples_outer
|
The number of probability values \(q\) used for the outer loop. Since samples are taken anew for each index, a high number will delay updating new indices and has no effect on the final accuracy if using an infinite index iteration. In general, it only makes sense to change this number if using a finite index iteration.
TYPE:
|
seed
|
The seed for the random number generator.
TYPE:
|
Source code in src/pydvl/valuation/samplers/owen.py
GridOwenStrategy
¶
GridOwenStrategy(n_samples_outer: int)
Bases: OwenStrategy
A strategy for OwenSampler to sample probability values on a linear grid.
PARAMETER | DESCRIPTION |
---|---|
n_samples_outer
|
The number of probability values \(q\) used for the outer loop. These will be linearly spaced between 0 and \(q_ ext{stop}\).
TYPE:
|
Source code in src/pydvl/valuation/samplers/owen.py
OwenSampler
¶
OwenSampler(
outer_sampling_strategy: OwenStrategy,
n_samples_inner: int = 2,
batch_size: int = 1,
index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
seed: Seed | None = None,
)
Bases: StochasticSamplerMixin
, PowersetSampler
A sampler for semi-values using the Owen method.
For each index \(i\) we sample n_samples_outer
probability values \(q_j\) between 0
and 1 and then, for each \(j\) we draw n_samples_inner
subsets of the complement of
the current index where each element is sampled probability \(q_j\).
The distribution for the outer sampling can be either uniform or deterministic. The default is deterministic on a grid, which is the original method described in Okhrati and Lipani (2021)1. This can be achieved by using the GridOwenStrategy strategy.
Alternatively, the distribution can be uniform between 0 and 1. This can be achieved by using the UniformOwenStrategy strategy.
By combining a UniformOwenStrategy with an infinite IndexIteration strategy, this sampler can be used with a stopping criterion to estimate semi-values. This follows more closely the typical usage pattern in PyDVL than the original sampling method described in Okhrati and Lipani (2021)1.
Example usage
PARAMETER | DESCRIPTION |
---|---|
n_samples_inner
|
The number of samples drawn for each probability. In the original paper this was fixed to 2 for all experiments.
TYPE:
|
batch_size
|
The batch size of the sampler.
TYPE:
|
index_iteration
|
The index iteration strategy, sequential or random, finite or infinite.
TYPE:
|
seed
|
The seed for the random number generator.
TYPE:
|
Source code in src/pydvl/valuation/samplers/owen.py
interrupt
¶
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns a callable that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
index_iterator
¶
index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]
Iterates over indices with the method specified at construction.
Source code in src/pydvl/valuation/samplers/powerset.py
log_weight
¶
For each \(q_j, j \in \{1, ..., N\}\) in the outer probabilities, the probability of drawing a subset \(S_k\) of size \(k\) is:
So, if each \(q_j\) is chosen with equal weight (or more generally with probability \(p_j\)),then by total probability, the overall probability of obtaining a subset of size \(k\) is a mixture of the binomials: $$ P (| S | = k) = \sum_{j = 1}^N p_j \ \binom{n}{k} \ q_j^k (1 - q_j)^{n - k}. $$
In our case \(p_j = 1/N\), so that \(P(|S|=k) = \frac{1}{N} \sum_{j=1}^N P (| S_{q_j} | = k)\). For large enough \(N\) this is
where we computed the integral using the beta function and its expression as products of gamma functions.
Now, given the symmetry wrt. the indices in the sampling procedure, any given set \(S\) of size \(k\) is equally likely to be drawn. So the probability of a set being of size \(k\) must be equally divided by the number of sets of that size, and the weight of a set of size \(k\) is:
PARAMETER | DESCRIPTION |
---|---|
n
|
Size of the index set.
TYPE:
|
subset_len
|
Size of the subset.
TYPE:
|
Returns:
The logarithm of the weight of a subset of size subset_len
.
Source code in src/pydvl/valuation/samplers/owen.py
sample_limit
¶
sample_limit(indices: IndexSetT) -> int | None
The number of samples that will be generated by the sampler.
PARAMETER | DESCRIPTION |
---|---|
indices
|
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
int | None
|
0 if there are no indices, |
int | None
|
samples otherwise. |
Source code in src/pydvl/valuation/samplers/owen.py
AntitheticOwenSampler
¶
AntitheticOwenSampler(
outer_sampling_strategy: OwenStrategy,
n_samples_inner: int = 2,
batch_size: int = 1,
index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
seed: Seed | None = None,
)
Bases: OwenSampler
A sampler for antithetic Owen shapley values.
For each sample obtained with the method of OwenSampler, a second sample is generated by taking the complement of the first sample.
For the same number of total samples, the antithetic Owen sampler yields usually more precise estimates of shapley values than the regular Owen sampler.
Source code in src/pydvl/valuation/samplers/owen.py
interrupt
¶
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
log_weight
¶
For each \(q_j, j \in \{1, ..., N\}\) in the outer probabilities, the probability of drawing a subset \(S_k\) of size \(k\) is:
So, if each \(q_j\) is chosen with equal weight (or more generally with probability \(p_j\)),then by total probability, the overall probability of obtaining a subset of size \(k\) is a mixture of the binomials: $$ P (| S | = k) = \sum_{j = 1}^N p_j \ \binom{n}{k} \ q_j^k (1 - q_j)^{n - k}. $$
In our case \(p_j = 1/N\), so that \(P(|S|=k) = \frac{1}{N} \sum_{j=1}^N P (| S_{q_j} | = k)\). For large enough \(N\) this is
where we computed the integral using the beta function and its expression as products of gamma functions.
Now, given the symmetry wrt. the indices in the sampling procedure, any given set \(S\) of size \(k\) is equally likely to be drawn. So the probability of a set being of size \(k\) must be equally divided by the number of sets of that size, and the weight of a set of size \(k\) is:
PARAMETER | DESCRIPTION |
---|---|
n
|
Size of the index set.
TYPE:
|
subset_len
|
Size of the subset.
TYPE:
|
Returns:
The logarithm of the weight of a subset of size subset_len
.
Source code in src/pydvl/valuation/samplers/owen.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns a callable that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
index_iterator
¶
index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]
Iterates over indices with the method specified at construction.
Source code in src/pydvl/valuation/samplers/powerset.py
PermutationSampler
¶
PermutationSampler(
truncation: TruncationPolicy | None = None,
seed: Seed | None = None,
batch_size: int = 1,
)
Bases: StochasticSamplerMixin
, PermutationSamplerBase
Samples permutations of indices.
Batching
Even though this sampler supports batching, it is not recommended to use it since the PermutationEvaluationStrategy processes whole permutations in one go, effectively batching the computation of up to n-1 marginal utilities in one process.
PARAMETER | DESCRIPTION |
---|---|
truncation
|
A policy to stop the permutation early.
TYPE:
|
seed
|
Seed for the random number generator.
TYPE:
|
Source code in src/pydvl/valuation/samplers/permutation.py
interrupt
¶
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns a callable that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
generate
¶
Generates the permutation samples.
PARAMETER | DESCRIPTION |
---|---|
indices
|
The indices to sample from. If empty, no samples are generated. If skip_indices is set, these indices are removed from the set before generating the permutation.
TYPE:
|
Source code in src/pydvl/valuation/samplers/permutation.py
AntitheticPermutationSampler
¶
AntitheticPermutationSampler(
truncation: TruncationPolicy | None = None,
seed: Seed | None = None,
batch_size: int = 1,
)
Bases: PermutationSampler
Samples permutations like PermutationSampler, but after each permutation, it returns the same permutation in reverse order.
This sampler was suggested in (Mitchell et al. 2022)1
New in version 0.7.1
Source code in src/pydvl/valuation/samplers/permutation.py
interrupt
¶
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns a callable that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
DeterministicPermutationSampler
¶
DeterministicPermutationSampler(
*args,
truncation: TruncationPolicy | None = None,
batch_size: int = 1,
**kwargs,
)
Bases: PermutationSamplerBase
Samples all n! permutations of the indices deterministically, and iterates through them, returning sets as required for the permutation-based definition of semi-values.
Source code in src/pydvl/valuation/samplers/permutation.py
skip_indices
property
writable
¶
Indices being skipped in the sampler. The exact behaviour will be sampler-dependent, so that setting this property is disabled by default.
interrupt
¶
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns a callable that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
PermutationEvaluationStrategy
¶
PermutationEvaluationStrategy(
sampler: PermutationSamplerBase,
utility: UtilityBase,
coefficient: Callable[[int, int], float] | None = None,
)
Bases: EvaluationStrategy[PermutationSamplerBase, ValueUpdate]
Computes marginal values for permutation sampling schemes in log-space.
This strategy iterates over permutations from left to right, computing the marginal utility wrt. the previous one at each step to save computation.
Source code in src/pydvl/valuation/samplers/permutation.py
IndexIteration
¶
Bases: ABC
Source code in src/pydvl/valuation/samplers/powerset.py
length
abstractmethod
staticmethod
¶
Returns the length of the iteration over the index set
PARAMETER | DESCRIPTION |
---|---|
n_indices
|
The number of indices in the set.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
int | None
|
The length of the iteration. It can be:
- a non-negative integer, if the iteration is finite
- |
Source code in src/pydvl/valuation/samplers/powerset.py
complement_size
abstractmethod
staticmethod
¶
Returns the size of complements of sets of size n, with respect to the indices returned by the iteration.
If the iteration returns single indices, then this is n-1, if it returns no indices, then it is n. If it returned tuples, then n-2, etc.
Source code in src/pydvl/valuation/samplers/powerset.py
SequentialIndexIteration
¶
Bases: InfiniteIterationMixin
, IndexIteration
Samples indices sequentially, indefinitely.
Source code in src/pydvl/valuation/samplers/powerset.py
FiniteSequentialIndexIteration
¶
Bases: FiniteIterationMixin
, SequentialIndexIteration
Samples indices sequentially, once.
Source code in src/pydvl/valuation/samplers/powerset.py
RandomIndexIteration
¶
RandomIndexIteration(indices: NDArray[IndexT], seed: Seed)
Bases: InfiniteIterationMixin
, StochasticSamplerMixin
, IndexIteration
Samples indices at random, indefinitely
Source code in src/pydvl/valuation/samplers/powerset.py
FiniteRandomIndexIteration
¶
FiniteRandomIndexIteration(indices: NDArray[IndexT], seed: Seed)
Bases: FiniteIterationMixin
, RandomIndexIteration
Samples indices at random, once
Source code in src/pydvl/valuation/samplers/powerset.py
NoIndexIteration
¶
Bases: InfiniteIterationMixin
, IndexIteration
An infinite iteration over no indices.
Source code in src/pydvl/valuation/samplers/powerset.py
FiniteNoIndexIteration
¶
Bases: FiniteIterationMixin
, NoIndexIteration
A finite iteration over no indices. The iterator will yield None once and then stop.
Source code in src/pydvl/valuation/samplers/powerset.py
length
staticmethod
¶
PowersetSampler
¶
PowersetSampler(
batch_size: int = 1,
index_iteration: Type[IndexIteration] = SequentialIndexIteration,
)
Bases: IndexSampler
, ABC
An abstract class for samplers which iterate over the powerset of the complement of an index in the training set.
This is done in two nested loops, where the outer loop iterates over the set of indices, and the inner loop iterates over subsets of the complement of the current index. The outer iteration can be either sequential or at random.
processed together by
[UtilityEvaluator][pydvl.valuation.utility.evaluator.UtilityEvaluator].
index_iteration: the strategy to use for iterating over indices to update
Source code in src/pydvl/valuation/samplers/powerset.py
interrupt
¶
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
sample_limit
abstractmethod
¶
sample_limit(indices: IndexSetT) -> int | None
Number of samples that can be generated from the indices.
PARAMETER | DESCRIPTION |
---|---|
indices
|
The indices used in the sampler.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
int | None
|
The maximum number of samples that will be generated, or |
Source code in src/pydvl/valuation/samplers/base.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns a callable that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
index_iterator
¶
index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]
Iterates over indices with the method specified at construction.
Source code in src/pydvl/valuation/samplers/powerset.py
generate
abstractmethod
¶
Generates samples over the powerset of indices
Each PowersetSampler
defines its own way to generate the subsets by
implementing this method. The outer loop is handled by the index_iterator
.
Batching is handled by the generate_batches
method.
PARAMETER | DESCRIPTION |
---|---|
indices
|
The set from which to generate samples.
TYPE:
|
Source code in src/pydvl/valuation/samplers/powerset.py
log_weight
¶
Correction coming from Monte Carlo integration so that the mean of the marginals converges to the value: the uniform distribution over the powerset of a set with n-1 elements has mass 1/2^{n-1} over each subset.
Source code in src/pydvl/valuation/samplers/powerset.py
LOOSampler
¶
LOOSampler(
batch_size: int = 1,
index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
seed: Seed | None = None,
)
Bases: PowersetSampler
Leave-One-Out sampler.
In this special case of a powerset sampler, for every index \(i\) in the set \(S\), the sample \((i, S_{-i})\) is returned.
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
The number of samples to generate per batch. Batches are processed together by each subprocess when working in parallel.
TYPE:
|
index_iteration
|
the strategy to use for iterating over indices to update. By default, a finite sequential index iteration is used, which is what LOOValuation expects.
TYPE:
|
seed
|
The seed for the random number generator used in case the index iteration is random.
TYPE:
|
New in version 0.10.0
Source code in src/pydvl/valuation/samplers/powerset.py
interrupt
¶
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns a callable that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
index_iterator
¶
index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]
Iterates over indices with the method specified at construction.
Source code in src/pydvl/valuation/samplers/powerset.py
log_weight
¶
This sampler returns only sets of size n-1. There are n such sets, so the probability of drawing one is 1/n, or 0 if subset_len != n-1.
Source code in src/pydvl/valuation/samplers/powerset.py
DeterministicUniformSampler
¶
DeterministicUniformSampler(
batch_size: int = 1,
index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
)
Bases: PowersetSampler
An iterator to perform uniform deterministic sampling of subsets.
For every index \(i\), each subset of the complement indices - {i}
is
returned.
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
The number of samples to generate per batch. Batches are processed together by each subprocess when working in parallel.
TYPE:
|
index_iteration
|
the strategy to use for iterating over indices to update. This iteration can be either finite or infinite.
TYPE:
|
Example
The code:
from pydvl.valuation.samplers import DeterministicUniformSampler
import numpy as np
sampler = DeterministicUniformSampler()
for idx, s in sampler.generate_batches(np.arange(2)):
print(f"{idx} - {s}", end=", ")
Should produce the output:
Source code in src/pydvl/valuation/samplers/powerset.py
interrupt
¶
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
log_weight
¶
Correction coming from Monte Carlo integration so that the mean of the marginals converges to the value: the uniform distribution over the powerset of a set with n-1 elements has mass 1/2^{n-1} over each subset.
Source code in src/pydvl/valuation/samplers/powerset.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns a callable that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
index_iterator
¶
index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]
Iterates over indices with the method specified at construction.
Source code in src/pydvl/valuation/samplers/powerset.py
UniformSampler
¶
UniformSampler(
batch_size: int = 1,
index_iteration: Type[IndexIteration] = SequentialIndexIteration,
seed: Seed | None = None,
)
Bases: StochasticSamplerMixin
, PowersetSampler
Draws random samples uniformly from the powerset of the index set.
Iterating over every index \(i\), either in sequence or at random depending on
the value of index_iteration
, one subset of the complement
indices - {i}
is sampled with equal probability \(2^{n-1}\).
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
The number of samples to generate per batch. Batches are processed together by each subprocess when working in parallel.
TYPE:
|
index_iteration
|
the strategy to use for iterating over indices to update. This iteration can be either finite or infinite.
TYPE:
|
seed
|
The seed for the random number generator.
TYPE:
|
Example
The code
Produces the output:Source code in src/pydvl/valuation/samplers/powerset.py
interrupt
¶
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
log_weight
¶
Correction coming from Monte Carlo integration so that the mean of the marginals converges to the value: the uniform distribution over the powerset of a set with n-1 elements has mass 1/2^{n-1} over each subset.
Source code in src/pydvl/valuation/samplers/powerset.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns a callable that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
index_iterator
¶
index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]
Iterates over indices with the method specified at construction.
Source code in src/pydvl/valuation/samplers/powerset.py
AntitheticSampler
¶
Bases: StochasticSamplerMixin
, PowersetSampler
A sampler that draws samples uniformly and their complements.
Works as UniformSampler, but for every tuple \((i,S)\), it subsequently returns \((i,S^c)\), where \(S^c\) is the complement of the set \(S\) in the set of indices, excluding \(i\).
Source code in src/pydvl/valuation/samplers/utils.py
interrupt
¶
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
log_weight
¶
Correction coming from Monte Carlo integration so that the mean of the marginals converges to the value: the uniform distribution over the powerset of a set with n-1 elements has mass 1/2^{n-1} over each subset.
Source code in src/pydvl/valuation/samplers/powerset.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns a callable that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
index_iterator
¶
index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]
Iterates over indices with the method specified at construction.
Source code in src/pydvl/valuation/samplers/powerset.py
SampleSizeStrategy
¶
SampleSizeStrategy(n_samples: int)
Bases: ABC
An object to compute the number of samples to take for a given set size. Based on Wu et al. (2023)1, Theorem 4.2.
To be used with StratifiedSampler.
Sets the number of sets at size \(k\) to be
for some choice of \(f.\) Implementations of this base class must override the
method fun()
. It is provided both the size \(k\) and the total number of indices \(n\)
as arguments.
PARAMETER | DESCRIPTION |
---|---|
n_samples
|
Number of samples for the stratified sampler to generate, per index. If the sampler uses NoIndexIteration, then this will coincide with the total number of samples.
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
fun
abstractmethod
¶
The function \(f\) to use in the heuristic. Args: n_indices: Size of the index set. subset_len: Size of the subset.
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
This method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.
Note
A naive implementation with e.g.
would not respect the total number of samples, and would not distribute remainders correctly.PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
quantize
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
quantize
is True
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
ConstantSampleSize
¶
Bases: SampleSizeStrategy
Use a constant number of samples for each set size between two (optional) bounds. The total number of samples (per index) is respected.
PARAMETER | DESCRIPTION |
---|---|
n_samples
|
Total number of samples to generate per index.
TYPE:
|
lower_bound
|
Lower bound for the set size. If the set size is smaller than this, the probability of sampling is 0.
TYPE:
|
upper_bound
|
Upper bound for the set size. If the set size is larger than this,
the probability of sampling is 0. If
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
This method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.
Note
A naive implementation with e.g.
would not respect the total number of samples, and would not distribute remainders correctly.PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
quantize
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
quantize
is True
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
GroupTestingSampleSize
¶
GroupTestingSampleSize(n_samples: int = 1)
Bases: SampleSizeStrategy
Heuristic choice of samples per set size used for Group Testing.
GroupTestingShapleyValuation uses this strategy for the stratified sampling of samples with which to construct the linear problem it requires.
This heuristic sets the number of sets at size \(k\) to be
for a total number of samples \(m\) and:
For GT Shapley, \(m=1\) and \(m_k\) is interpreted as a probability of sampling size \(k.\)
Source code in src/pydvl/valuation/samplers/stratified.py
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
This method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.
Note
A naive implementation with e.g.
would not respect the total number of samples, and would not distribute remainders correctly.PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
quantize
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
quantize
is True
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
HarmonicSampleSize
¶
HarmonicSampleSize(n_samples: int)
Bases: SampleSizeStrategy
Heuristic choice of samples per set size for VRDS.
Sets the number of sets at size \(k\) to be
for a total number of samples \(m\) and:
PARAMETER | DESCRIPTION |
---|---|
n_samples
|
Number of samples for the stratified sampler to generate, per index. If the sampler uses NoIndexIteration, then this will coincide with the total number of samples.
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
This method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.
Note
A naive implementation with e.g.
would not respect the total number of samples, and would not distribute remainders correctly.PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
quantize
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
quantize
is True
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
PowerLawSampleSize
¶
Bases: SampleSizeStrategy
Heuristic choice of samples per set size for VRDS.
Sets the number of sets at size \(k\) to be
for a total number of samples \(m\) and:
and some exponent \(a.\) With \(a=1\) one recovers the HarmonicSampleSize heuristic.
PARAMETER | DESCRIPTION |
---|---|
n_samples
|
Total number of samples to generate per index.
TYPE:
|
exponent
|
The exponent to use. Recommended values are between -1 and -0.5.
TYPE:
|
Source code in src/pydvl/valuation/samplers/stratified.py
sample_sizes
cached
¶
Precomputes the number of samples to take for each set size, from 0 up to
n_indices
inclusive.
This method corrects rounding errors taking into account the fractional parts so that the total number of samples is respected, while allocating remainders in a way that follows the relative sizes of the fractional parts.
Note
A naive implementation with e.g.
would not respect the total number of samples, and would not distribute remainders correctly.PARAMETER | DESCRIPTION |
---|---|
n_indices
|
number of indices in the index set from which to sample. This is
typically
TYPE:
|
quantize
|
Whether to perform the remainder distribution. If
TYPE:
|
Returns:
The exact (integer) number of samples to take for each set size, if
quantize
is True
. Otherwise, the fractional number of samples.
Source code in src/pydvl/valuation/samplers/stratified.py
SampleSizeIteration
¶
SampleSizeIteration(strategy: SampleSizeStrategy, n_indices: int)
Bases: ABC
Given a strategy and the number of indices, yield tuples (k, count) that the sampler loop will use. Args: strategy: The strategy to use for computing the number of samples to take. n_indices: The number of indices in the index set from which samples are taken.
Source code in src/pydvl/valuation/samplers/stratified.py
DeterministicSizeIteration
¶
DeterministicSizeIteration(strategy: SampleSizeStrategy, n_indices: int)
Bases: SampleSizeIteration
Generates exactly \(m_k\) samples for each set size \(k\) before moving to the next.
Source code in src/pydvl/valuation/samplers/stratified.py
RandomSizeIteration
¶
RandomSizeIteration(
strategy: SampleSizeStrategy, n_indices: int, seed: Seed | None = None
)
Bases: SampleSizeIteration
Draws a set size \(k\) following the distribution of sizes given by the strategy.
Source code in src/pydvl/valuation/samplers/stratified.py
RoundRobinIteration
¶
RoundRobinIteration(strategy: SampleSizeStrategy, n_indices: int)
Bases: SampleSizeIteration
Generates one sample for each set size \(k\) before moving to the next.
This continues yielding until every size \(k\) has been emitted exactly \(m_k\) times.
For example, if strategy.sample_sizes() == [2, 3, 1]
then we want the sequence:
(0,1), (1,1), (2,1), (0,1), (1,1), (1,1)
Source code in src/pydvl/valuation/samplers/stratified.py
StratifiedSampler
¶
StratifiedSampler(
sample_sizes: SampleSizeStrategy,
sample_sizes_iteration: Type[
SampleSizeIteration
] = DeterministicSizeIteration,
batch_size: int = 1,
index_iteration: Type[IndexIteration] = FiniteSequentialIndexIteration,
seed: Seed | None = None,
)
Bases: StochasticSamplerMixin
, PowersetSampler
A sampler stratified by coalition size with variable number of samples per set size.
Variance Reduced Stratified Sampler (VRDS)¶
Stratified sampling was introduced at least as early as Maleki et al. (2014)3. Wu et al. 20232, introduced heuristics adequate for ML tasks.
Choosing the number of samples per set size¶
The idea of VRDS is to allow per-set-size configuration of the total number of samples in order to reduce the variance coming from the marginal utility evaluations.
It is known (Wu et al. (2023), Theorem 4.2) that a minimum variance estimator of
Shapley values samples a number \(m_k\) of sets of size \(k\) based on the variance of
the marginal utility at that set size. However, this quantity is unknown in
practice, so the authors propose a simple heuristic. This function
(sample_sizes
in the arguments) is deterministic, and in particular does
not depend on run-time variance estimates, as an adaptive method might do. Section 4
of Wu et al. (2023) shows a good default choice is based on the harmonic function
of the set size \(k\) (see
HarmonicSampleSize).
PARAMETER | DESCRIPTION |
---|---|
sample_sizes
|
An object which returns the number of samples to
take for a given set size. If
TYPE:
|
sample_sizes_iteration
|
How to loop over sample sizes. The main modes are:
* deterministically. For every k generate m_k samples before moving to k+1.
* stochastically. Sample sizes k according to the distribution given by
TYPE:
|
batch_size
|
The number of samples to generate per batch. Batches are processed together by each subprocess when working in parallel.
TYPE:
|
index_iteration
|
the strategy to use for iterating over indices to update. Note that anything other than returning index exactly once will break the weight computation.
TYPE:
|
seed
|
The seed for the random number generator.
TYPE:
|
New in version 0.10.0
Source code in src/pydvl/valuation/samplers/stratified.py
interrupt
¶
__len__
¶
__len__() -> int
Returns the length of the current sample generation in generate_batches.
RAISES | DESCRIPTION |
---|---|
`TypeError`
|
if the sampler is infinite or generate_batches has not been called yet. |
Source code in src/pydvl/valuation/samplers/base.py
generate_batches
¶
Batches the samples and yields them.
Source code in src/pydvl/valuation/samplers/base.py
result_updater
¶
result_updater(result: ValuationResult) -> ResultUpdater[ValueUpdateT]
Returns a callable that updates a valuation result with a value update.
Because we use log-space computation for numerical stability, the default result updater keeps track of several quantities required to maintain accurate running 1st and 2nd moments.
PARAMETER | DESCRIPTION |
---|---|
result
|
The result to update
TYPE:
|
Returns: A callable object that updates the result with a value update
Source code in src/pydvl/valuation/samplers/base.py
index_iterator
¶
index_iterator(indices: IndexSetT) -> Generator[IndexT | None, None, None]
Iterates over indices with the method specified at construction.
Source code in src/pydvl/valuation/samplers/powerset.py
log_weight
¶
The probability of sampling a set of size k is 1/(n choose k) times the probability of choosing size k, which is the number of samples for that size divided by the total number of samples for all sizes:
where \(m_k\) is the number of samples of size \(k\) and \(m\) is the total number of samples.
PARAMETER | DESCRIPTION |
---|---|
n
|
Size of the index set.
TYPE:
|
subset_len
|
Size of the subset.
TYPE:
|
Returns:
The logarithm of the probability of having sampled a set of size subset_len
.
Source code in src/pydvl/valuation/samplers/stratified.py
TruncationPolicy
¶
Bases: ABC
A policy for deciding whether to stop computation of a batch of samples
Statistics are kept on the total number of calls and truncations as n_calls
and
n_truncations
respectively.
ATTRIBUTE | DESCRIPTION |
---|---|
n_calls |
Number of calls to the policy.
TYPE:
|
n_truncations |
Number of truncations made by the policy.
TYPE:
|
Todo
Because the policy objects are copied to the workers, the statistics are not accessible from the coordinating process. We need to add methods for this.
Source code in src/pydvl/valuation/samplers/truncation.py
reset
abstractmethod
¶
reset(utility: UtilityBase)
__call__
¶
Check whether the computation should be interrupted.
PARAMETER | DESCRIPTION |
---|---|
idx
|
Position in the batch currently being computed.
TYPE:
|
score
|
Last utility computed.
TYPE:
|
batch_size
|
Size of the batch being computed.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
bool
|
|
Source code in src/pydvl/valuation/samplers/truncation.py
NoTruncation
¶
Bases: TruncationPolicy
A policy which never interrupts the computation.
Source code in src/pydvl/valuation/samplers/truncation.py
__call__
¶
Check whether the computation should be interrupted.
PARAMETER | DESCRIPTION |
---|---|
idx
|
Position in the batch currently being computed.
TYPE:
|
score
|
Last utility computed.
TYPE:
|
batch_size
|
Size of the batch being computed.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
bool
|
|
Source code in src/pydvl/valuation/samplers/truncation.py
FixedTruncation
¶
FixedTruncation(fraction: float)
Bases: TruncationPolicy
Break a computation after a fixed number of updates.
The experiments in Appendix B of (Ghorbani and Zou, 2019)1 show that when the training set size is large enough, one can simply truncate the iteration over permutations after a fixed number of steps. This happens because beyond a certain number of samples in a training set, the model becomes insensitive to new ones. Alas, this strongly depends on the data distribution and the model and there is no automatic way of estimating this number.
PARAMETER | DESCRIPTION |
---|---|
fraction
|
Fraction of updates in a batch to compute before stopping (e.g. 0.5 to compute half of the marginals in a permutation).
TYPE:
|
Source code in src/pydvl/valuation/samplers/truncation.py
__call__
¶
Check whether the computation should be interrupted.
PARAMETER | DESCRIPTION |
---|---|
idx
|
Position in the batch currently being computed.
TYPE:
|
score
|
Last utility computed.
TYPE:
|
batch_size
|
Size of the batch being computed.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
bool
|
|
Source code in src/pydvl/valuation/samplers/truncation.py
RelativeTruncation
¶
Bases: TruncationPolicy
Break a computation if the utility is close enough to the total utility.
This is called "performance tolerance" in (Ghorbani and Zou, 2019)1.
Warning
Initialization and reset()
of this policy imply the computation of the total
utility for the dataset, which can be expensive!
PARAMETER | DESCRIPTION |
---|---|
rtol
|
Relative tolerance. The permutation is broken if the last computed utility is within this tolerance of the total utility.
TYPE:
|
burn_in_fraction
|
Fraction of samples within a permutation to wait until actually checking.
TYPE:
|
Source code in src/pydvl/valuation/samplers/truncation.py
DeviationTruncation
¶
Bases: TruncationPolicy
Break a computation if the last computed utility is close to the total utility.
This is essentially the same as RelativeTruncation, but with the tolerance determined by a multiple of the standard deviation of the utilities.
Danger
This policy can break early if the utility function has high variance. This can lead to gross underestimation of values. Use with caution.
Warning
Initialization and reset()
of this policy imply the computation of the total
utility for the dataset, which can be expensive!
PARAMETER | DESCRIPTION |
---|---|
burn_in_fraction
|
Fraction of samples within a permutation to wait until actually checking.
TYPE:
|
sigmas
|
Number of standard deviations to use as a threshold.
TYPE:
|
Source code in src/pydvl/valuation/samplers/truncation.py
Dataset
¶
Dataset(
x: NDArray,
y: NDArray,
feature_names: Sequence[str] | NDArray[str_] | None = None,
target_names: Sequence[str] | NDArray[str_] | None = None,
data_names: Sequence[str] | NDArray[str_] | None = None,
description: str | None = None,
multi_output: bool = False,
)
A convenience class to handle datasets.
It holds a dataset, together with info on feature names, target names, and data names. It is used to pass data around to valuation methods.
The underlying data arrays can be accessed via
Dataset.data(), which returns the tuple
(X, y)
as a read-only RawData object. The data
can be accessed by indexing the object directly, e.g. dataset[0]
will return the
data point corresponding to index 0 in dataset
. For this base class, this is the
same as dataset.data([0])
, which is the first point in the data array, but derived
classes can behave differently.
PARAMETER | DESCRIPTION |
---|---|
x
|
training data
TYPE:
|
y
|
labels for training data
TYPE:
|
feature_names
|
names of the features of x data |
target_names
|
names of the features of y data |
data_names
|
names assigned to data points. For example, if the dataset is a time series, each entry can be a timestamp which can be referenced directly instead of using a row number. |
description
|
A textual description of the dataset.
TYPE:
|
multi_output
|
set to
TYPE:
|
Changed in version 0.10.0
No longer holds split data, but only x, y.
Changed in version 0.10.0
Slicing now return a new Dataset
object, not raw data.
Source code in src/pydvl/valuation/dataset.py
indices
property
¶
Index of positions in data.x_train.
Contiguous integers from 0 to len(Dataset).
names
property
¶
Names of each individual datapoint.
Used for reporting Shapley values.
feature
¶
Returns a slice for the feature with the given name.
Source code in src/pydvl/valuation/dataset.py
data
¶
Given a set of indices, returns the training data that refer to those indices, as a read-only tuple-like structure.
This is used mainly by subclasses of UtilityBase to retrieve subsets of the data from indices.
PARAMETER | DESCRIPTION |
---|---|
indices
|
Optional indices that will be used to select points from
the training data. If
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
RawData
|
If |
Source code in src/pydvl/valuation/dataset.py
data_indices
¶
Returns a subset of indices.
This is equivalent to using Dataset.indices[logical_indices]
but allows
subclasses to define special behaviour, e.g. when indices in Dataset
do not
match the indices in the data.
For Dataset
, this is a simple pass-through.
PARAMETER | DESCRIPTION |
---|---|
indices
|
A set of indices held by this object |
RETURNS | DESCRIPTION |
---|---|
NDArray[int_]
|
The indices of the data points in the data array. |
Source code in src/pydvl/valuation/dataset.py
logical_indices
¶
Returns the indices in this Dataset
for the given indices in the data array.
This is equivalent to using Dataset.indices[data_indices]
but allows
subclasses to define special behaviour, e.g. when indices in Dataset
do not
match the indices in the data.
PARAMETER | DESCRIPTION |
---|---|
indices
|
A set of indices in the data array. |
RETURNS | DESCRIPTION |
---|---|
NDArray[int_]
|
The abstract indices for the given data indices. |
Source code in src/pydvl/valuation/dataset.py
from_sklearn
classmethod
¶
from_sklearn(
data: Bunch,
train_size: int | float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
**kwargs,
) -> tuple[Dataset, Dataset]
Constructs two Dataset objects from a
sklearn.utils.Bunch, as returned by the load_*
functions in scikit-learn toy datasets.
Example
PARAMETER | DESCRIPTION |
---|---|
data
|
scikit-learn Bunch object. The following attributes are supported:
TYPE:
|
train_size
|
size of the training dataset. Used in |
the value is automatically set to the complement of the test size.
random_state: seed for train / test split
stratify_by_target: If True
, data is split in a stratified
fashion, using the target variable as labels. Read more in
scikit-learn's user guide.
kwargs: Additional keyword arguments to pass to the
Dataset constructor. Use this to pass e.g. is_multi_output
.
RETURNS | DESCRIPTION |
---|---|
tuple[Dataset, Dataset]
|
Object with the sklearn dataset |
Changed in version 0.6.0
Added kwargs to pass to the Dataset constructor.
Changed in version 0.10.0
Returns a tuple of two Dataset objects.
Source code in src/pydvl/valuation/dataset.py
from_arrays
classmethod
¶
from_arrays(
X: NDArray,
y: NDArray,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
**kwargs: Any,
) -> tuple[Dataset, Dataset]
Constructs a Dataset object from X and y numpy arrays as
returned by the make_*
functions in sklearn generated datasets.
Example
PARAMETER | DESCRIPTION |
---|---|
X
|
numpy array of shape (n_samples, n_features)
TYPE:
|
y
|
numpy array of shape (n_samples,)
TYPE:
|
train_size
|
size of the training dataset. Used in
TYPE:
|
random_state
|
seed for train / test split
TYPE:
|
stratify_by_target
|
If
TYPE:
|
kwargs
|
Additional keyword arguments to pass to the
Dataset constructor. Use this to pass
e.g.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple[Dataset, Dataset]
|
Object with the passed X and y arrays split across training and test sets. |
New in version 0.4.0
Changed in version 0.6.0
Added kwargs to pass to the Dataset constructor.
Changed in version 0.10.0
Returns a tuple of two Dataset objects.
Source code in src/pydvl/valuation/dataset.py
Scorer
¶
Bases: ABC
A scoring callable that takes a model and returns a scalar.
Added in version 0.10.0
ABC added
SupervisedScorer
¶
SupervisedScorer(
scoring: str
| SupervisedScorerCallable[SupervisedModelT]
| SupervisedModelT,
test_data: Dataset,
default: float,
range: tuple[float, float] = (-inf, inf),
name: str | None = None,
)
Bases: Generic[SupervisedModelT]
, Scorer
A scoring callable that takes a model, data, and labels and returns a scalar.
PARAMETER | DESCRIPTION |
---|---|
scoring
|
Either a string or callable that can be passed to get_scorer.
TYPE:
|
test_data
|
Dataset where the score will be evaluated.
TYPE:
|
default
|
score to be used when a model cannot be fit, e.g. when too little data is passed, or errors arise.
TYPE:
|
range
|
numerical range of the score function. Some Monte Carlo
methods can use this to estimate the number of samples required for a
certain quality of approximation. If not provided, it can be read from
the |
name
|
The name of the scorer. If not provided, the name of the function passed will be used.
TYPE:
|
New in version 0.5.0
Changed in version 0.10.0
This is now SupervisedScorer
and holds the test data used to evaluate the
model.
Source code in src/pydvl/valuation/scorers/supervised.py
ClasswiseSupervisedScorer
¶
ClasswiseSupervisedScorer(
scoring: str
| SupervisedScorerCallable[SupervisedModelT]
| SupervisedModelT,
test_data: Dataset,
default: float = 0.0,
range: tuple[float, float] = (0, 1),
in_class_discount_fn: Callable[[float], float] = lambda x: x,
out_of_class_discount_fn: Callable[[float], float] = exp,
rescale_scores: bool = True,
name: str | None = None,
)
Bases: SupervisedScorer[SupervisedModelT]
A Scorer designed for evaluation in classification problems.
The final score is the combination of the in-class and out-of-class scores, which are e.g. the accuracy of the trained model over the instances of the test set with the same, and different, labels, respectively. See the module's documentation for more on this.
These two scores are computed with an "inner" scoring function, which must be provided upon construction.
Multi-class support
The inner score must support multiple class labels if you intend to apply them
to a multi-class problem. For instance, 'accuracy' supports multiple classes,
but f1
does not. For a two-class classification problem, using f1_weighted
is essentially equivalent to using accuracy
.
PARAMETER | DESCRIPTION |
---|---|
scoring
|
Name of the scoring function or a callable that can be passed to SupervisedScorer.
TYPE:
|
default
|
Score to use when a model fails to provide a number, e.g. when too little was used to train it, or errors arise.
TYPE:
|
range
|
Numerical range of the score function. Some Monte Carlo methods
can use this to estimate the number of samples required for a
certain quality of approximation. If not provided, it can be read
from the |
in_class_discount_fn
|
Continuous, monotonic increasing function used to discount the in-class score. |
out_of_class_discount_fn
|
Continuous, monotonic increasing function used to discount the out-of-class score. |
rescale_scores
|
If set to True, the scores will be denormalized. This is particularly useful when the inner score function \(a_S\) is calculated by an estimator of the form $rac{1}{N} \sum_i x_i$.
TYPE:
|
name
|
Name of the scorer. If not provided, the name of the inner scoring
function will be prefixed by
TYPE:
|
New in version 0.7.1
Source code in src/pydvl/valuation/scorers/classwise.py
compute_in_and_out_of_class_scores
¶
compute_in_and_out_of_class_scores(
model: SupervisedModelT, rescale_scores: bool = True
) -> tuple[float, float]
Computes in-class and out-of-class scores using the provided inner scoring function. The result is
In this context, for label \(c\) calculations are executed twice: once for \(D_c\) and once for \(D_{-c}\) to determine the in-class and out-of-class scores, respectively. By default, the raw scores are multiplied by \(\frac{|D_c|}{|D|}\) and \(\frac{|D_{-c}|}{|D|}\), respectively. This is done to ensure that both scores are of the same order of magnitude. This normalization is particularly useful when the inner score function \(a_S\) is calculated by an estimator of the form \(\frac{1}{N} \sum_i x_i\), e.g. the accuracy.
PARAMETER | DESCRIPTION |
---|---|
model
|
Model used for computing the score on the validation set.
TYPE:
|
rescale_scores
|
If set to True, the scores will be denormalized. This is particularly useful when the inner score function \(a_S\) is calculated by an estimator of the form \(\frac{1}{N} \sum_i x_i\).
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple[float, float]
|
Tuple containing the in-class and out-of-class scores. |
Source code in src/pydvl/valuation/scorers/classwise.py
StoppingCriterion
¶
StoppingCriterion(modify_result: bool = True)
Bases: ABC
A composable callable object to determine whether a computation must stop.
A StoppingCriterion
is a callable taking a
ValuationResult and returning a
Status. It also keeps track of individual
convergence of values with
converged, and reports
the overall completion of the computation with
completion.
Instances of StoppingCriterion
can be composed with the binary operators
&
(and), and |
(or), following the truth tables of
Status. The unary operator ~
(not) is
also supported. These boolean operations act according to the following
rules:
- The results of
check()
are combined with the operator. See Status for the truth tables. - The results of converged are combined with the operator (returning another boolean array).
- The completion method returns the min, max, or the complement to 1 of the completions of the operands, for AND, OR and NOT respectively. This is required for cases where one of the criteria does not keep track of the convergence of single values, e.g. MaxUpdates, because completion by default returns the mean of the boolean convergence array.
Subclassing¶
Subclassing this class requires implementing a check()
method that
returns a Status object based on a given
ValuationResult. This method should
update the attribute _converged
, which is a boolean array indicating
whether the value for each index has converged.
When this does not make sense for a particular stopping criterion,
completion should be
overridden to provide an overall completion value, since its default
implementation attempts to compute the mean of _converged
.
PARAMETER | DESCRIPTION |
---|---|
modify_result
|
If
TYPE:
|
Source code in src/pydvl/valuation/stopping.py
converged
property
¶
completion
¶
completion() -> float
Returns a value between 0 and 1 indicating the completion of the computation.
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/valuation/stopping.py
AbsoluteStandardError
¶
AbsoluteStandardError(
threshold: float,
fraction: float = 1.0,
burn_in: int = 4,
modify_result: bool = True,
)
Bases: StoppingCriterion
Determine convergence based on the standard error of the values.
If \(s_i\) is the standard error for datum \(i\), then this criterion returns Converged if \(s_i < \epsilon\) for all \(i\) and a threshold value \(\epsilon \gt 0\).
Warning
This criterion should be used with care. The standard error is a measure of the uncertainty of the estimate, but it does not guarantee that the estimate is close to the true value. For example, if the utility function is very noisy, the standard error might be very low, but the estimate might be far from the true value. In this case, one might want to use a RankCorrelation instead, which checks whether the rank of the values is stable.
PARAMETER | DESCRIPTION |
---|---|
threshold
|
A value is considered to have converged if the standard error is below this threshold. A way of choosing it is to pick some percentage of the range of the values. For Shapley values this is the difference between the maximum and minimum of the utility function (to see this substitute the maximum and minimum values of the utility into the marginal contribution formula).
TYPE:
|
fraction
|
The fraction of values that must have converged for the criterion to return Converged.
TYPE:
|
burn_in
|
The number of iterations to ignore before checking for convergence. This is required because computations typically start with zero variance, as a result of using zeros(). The default is set to an arbitrary minimum which is usually enough but may need to be increased.
TYPE:
|
modify_result
|
If
TYPE:
|
Source code in src/pydvl/valuation/stopping.py
converged
property
¶
completion
¶
completion() -> float
Returns a value between 0 and 1 indicating the completion of the computation.
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/valuation/stopping.py
MaxChecks
¶
Bases: StoppingCriterion
Terminate as soon as the number of checks exceeds the threshold.
A "check" is one call to the criterion. Note that this might have different
interpretations depending on the sampler. For example,
MSRSampler performs a single
utility evaluation to update all indices, so that's len(training_data)
checks for
a single training of the model. But it also only changes the counts
field of the
ValuationResult for about half of the
indices, which is what e.g. MaxUpdates checks.
PARAMETER | DESCRIPTION |
---|---|
n_checks
|
Threshold: if
TYPE:
|
modify_result
|
If
TYPE:
|
Source code in src/pydvl/valuation/stopping.py
converged
property
¶
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/valuation/stopping.py
MaxUpdates
¶
Bases: StoppingCriterion
Terminate if any number of value updates exceeds or equals the given threshold.
Note
If you want to ensure that all values have been updated, you probably want MinUpdates instead.
This checks the counts
field of a
ValuationResult, i.e. the number of
times that each index has been updated. For powerset samplers, the maximum
of this number coincides with the maximum number of subsets sampled. For
permutation samplers, it coincides with the number of permutations sampled.
PARAMETER | DESCRIPTION |
---|---|
n_updates
|
Threshold: if
TYPE:
|
modify_result
|
If
TYPE:
|
Source code in src/pydvl/valuation/stopping.py
converged
property
¶
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/valuation/stopping.py
NoStopping
¶
NoStopping(sampler: IndexSampler | None = None, modify_result: bool = True)
Bases: StoppingCriterion
Keep running forever or until sampling stops.
If a sampler instance is passed, and it is a finite sampler, its counter will be used to update completion status.
PARAMETER | DESCRIPTION |
---|---|
sampler
|
A sampler instance to use for completion status.
TYPE:
|
modify_result
|
If
TYPE:
|
Source code in src/pydvl/valuation/stopping.py
converged
property
¶
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/valuation/stopping.py
MinUpdates
¶
Bases: StoppingCriterion
Terminate as soon as all value updates exceed or equal the given threshold.
This checks the counts
field of a
ValuationResult, i.e. the number of times
that each index has been updated. For powerset samplers, the minimum of this number
is a lower bound for the number of subsets sampled. For permutation samplers, it
lower-bounds the amount of permutations sampled.
PARAMETER | DESCRIPTION |
---|---|
n_updates
|
Threshold: if
TYPE:
|
modify_result
|
If
TYPE:
|
Source code in src/pydvl/valuation/stopping.py
converged
property
¶
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/valuation/stopping.py
MaxTime
¶
Bases: StoppingCriterion
Terminate if the computation time exceeds the given number of seconds.
Checks the elapsed time since construction.
PARAMETER | DESCRIPTION |
---|---|
seconds
|
Threshold: The computation is terminated if the elapsed time
between object construction and a _check exceeds this value. If
TYPE:
|
modify_result
|
If
TYPE:
|
Source code in src/pydvl/valuation/stopping.py
converged
property
¶
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/valuation/stopping.py
HistoryDeviation
¶
HistoryDeviation(
n_steps: int,
rtol: float,
pin_converged: bool = True,
modify_result: bool = True,
)
Bases: StoppingCriterion
A simple check for relative distance to a previous step in the computation.
The method used by Ghorbani and Zou, (2019)1 computes the relative distances between the current values \(v_i^t\) and the values at the previous checkpoint \(v_i^{t-\tau}\). If the sum is below a given threshold, the computation is terminated.
When the denominator is zero, the summand is set to the value of \(v_i^{ t-\tau}\).
This implementation is slightly generalised to allow for different number of updates to individual indices, as happens with powerset samplers instead of permutations. Every subset of indices that is found to converge can be pinned to that state. Once all indices have converged the method has converged.
Warning
This criterion is meant for the reproduction of the results in the paper, but we do not recommend using it in practice.
PARAMETER | DESCRIPTION |
---|---|
n_steps
|
Compare values after so many steps. A step is one evaluation of the criterion, which happens once per batch.
TYPE:
|
rtol
|
Relative tolerance for convergence (\(\epsilon\) in the formula).
TYPE:
|
pin_converged
|
If
TYPE:
|
Source code in src/pydvl/valuation/stopping.py
converged
property
¶
completion
¶
completion() -> float
Returns a value between 0 and 1 indicating the completion of the computation.
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/valuation/stopping.py
RankCorrelation
¶
Bases: StoppingCriterion
A check for stability of Spearman correlation between checks.
Convergence is reached when the change in rank correlation between two successive iterations is below a given threshold.
This criterion is used in (Wang et al.)2.
The meaning of successive iterations
Stopping criteria in pyDVL are typically evaluated after each batch of value
updates is received. This can imply very different things, depending on the
configuration of the samplers. For this reason, RankCorrelation
keeps itself
track of the number of updates that each index has seen, and only checks for
correlation changes when a given fraction of all indices has been updated more
than burn_in
times and once since last time the criterion was checked.
PARAMETER | DESCRIPTION |
---|---|
rtol
|
Relative tolerance for convergence (\(\epsilon\) in the formula)
TYPE:
|
burn_in
|
The minimum number of updates an index must have seen before checking for convergence. This is required because the first correlation checks are usually meaningless.
TYPE:
|
fraction
|
The fraction of values that must have been updated between two correlation checks. This is to avoid comparing two results where only one value has been updated, which would have almost perfect rank correlation.
TYPE:
|
modify_result
|
If
TYPE:
|
Added in 0.9.0
Changed in 0.10.0
The behaviour of the burn_in
parameter was changed to look at value updates.
The parameter fraction
was added.
Source code in src/pydvl/valuation/stopping.py
converged
property
¶
__call__
¶
__call__(result: ValuationResult) -> Status
Calls check()
, maybe updating the result.
Source code in src/pydvl/valuation/stopping.py
ModelUtility
¶
ModelUtility(
model: ModelT,
scorer: Scorer,
*,
catch_errors: bool = True,
show_warnings: bool = False,
cache_backend: CacheBackend | None = None,
cached_func_options: CachedFuncConfig | None = None,
clone_before_fit: bool = True,
)
Bases: UtilityBase[SampleT]
, Generic[SampleT, ModelT]
Convenience wrapper with configurable memoization of the utility.
An instance of ModelUtility
holds the tuple of model, and scoring function which
determines the value of data points. This is used for the computation of all
game-theoretic values like Shapley
values and the Least
Core.
ModelUtility
expects the model to fulfill at least the
BaseModel interface, i.e. to have a fit()
method
When calling the utility, the model will be cloned if it is a Scikit-Learn model, otherwise a copy is created using copy.deepcopy
Since evaluating the scoring function requires retraining the model and that can be time-consuming, this class wraps it and caches the results of each execution. Caching is available both locally and across nodes, but must always be enabled for your project first, because most stochastic methods do not benefit much from it. See the documentation and the module documentation.
ATTRIBUTE | DESCRIPTION |
---|---|
model |
The supervised model.
TYPE:
|
scorer |
A scoring function. If None, the
TYPE:
|
PARAMETER | DESCRIPTION |
---|---|
model
|
Any supervised model. Typical choices can be found in the sci-kit learn documentation.
TYPE:
|
scorer
|
A scoring object. If None, the
TYPE:
|
catch_errors
|
set to
TYPE:
|
show_warnings
|
Set to
TYPE:
|
cache_backend
|
Optional instance of CacheBackend used to memoize results to avoid duplicate computation. Note however, that for most stochastic methods, cache hits are rare, making the memory expense of caching not worth it (YMMV).
TYPE:
|
cached_func_options
|
Optional configuration object for cached utility evaluation.
TYPE:
|
clone_before_fit
|
If
TYPE:
|
Example
>>> from pydvl.valuation.utility import ModelUtility, DataUtilityLearning
>>> from pydvl.valuation.dataset import Dataset
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> train, test = Dataset.from_sklearn(load_iris(), random_state=16)
>>> u = ModelUtility(LogisticRegression(random_state=16), Scorer("accuracy"))
>>> u(Sample(subset=dataset.indices))
0.9
With caching enabled:
>>> from pydvl.valuation.utility import ModelUtility, DataUtilityLearning
>>> from pydvl.valuation.dataset import Dataset
>>> from pydvl.utils.caching.memory import InMemoryCacheBackend
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> train, test = Dataset.from_sklearn(load_iris(), random_state=16)
>>> cache_backend = InMemoryCacheBackend()
>>> u = ModelUtility(LogisticRegression(random_state=16), Scorer("accuracy"), cache_backend=cache_backend)
>>> u(Sample(subset=train.indices))
0.9
Source code in src/pydvl/valuation/utility/modelutility.py
training_data
property
¶
training_data: Dataset | None
Retrieves the training data used by this utility.
This property is read-only. In order to set it, use with_dataset().
cache_stats
property
¶
cache_stats: CacheStats | None
Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.
with_dataset
¶
Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.
Source code in src/pydvl/valuation/utility/base.py
__call__
¶
__call__(sample: SampleT | None) -> float
PARAMETER | DESCRIPTION |
---|---|
sample
|
contains a subset of valid indices for the
TYPE:
|
Source code in src/pydvl/valuation/utility/modelutility.py
ClasswiseModelUtility
¶
ClasswiseModelUtility(
model: SupervisedModel,
scorer: ClasswiseSupervisedScorer,
*,
catch_errors: bool = True,
show_warnings: bool = False,
cache_backend: CacheBackend | None = None,
cached_func_options: CachedFuncConfig | None = None,
clone_before_fit: bool = True,
)
Bases: ModelUtility[ClasswiseSample, SupervisedModel]
ModelUtility class that is specific to classwise shapley valuation.
It expects a classwise scorer and a classification task.
PARAMETER | DESCRIPTION |
---|---|
model
|
Any supervised model. Typical choices can be found in the sci-kit learn documentation.
TYPE:
|
scorer
|
A class-wise scoring object. |
catch_errors
|
set to
TYPE:
|
show_warnings
|
Set to
TYPE:
|
cache_backend
|
Optional instance of CacheBackend used to wrap the _utility method of the Utility instance. By default, this is set to None and that means that the utility evaluations will not be cached.
TYPE:
|
cached_func_options
|
Optional configuration object for cached utility evaluation.
TYPE:
|
clone_before_fit
|
If
TYPE:
|
Source code in src/pydvl/valuation/utility/classwise.py
training_data
property
¶
training_data: Dataset | None
Retrieves the training data used by this utility.
This property is read-only. In order to set it, use with_dataset().
cache_stats
property
¶
cache_stats: CacheStats | None
Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.
with_dataset
¶
Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.
Source code in src/pydvl/valuation/utility/base.py
__call__
¶
__call__(sample: SampleT | None) -> float
PARAMETER | DESCRIPTION |
---|---|
sample
|
contains a subset of valid indices for the
TYPE:
|
Source code in src/pydvl/valuation/utility/modelutility.py
KNNClassifierUtility
¶
KNNClassifierUtility(
model: KNeighborsClassifier,
test_data: Dataset,
*,
catch_errors: bool = True,
show_warnings: bool = False,
cache_backend: CacheBackend | None = None,
cached_func_options: CachedFuncConfig | None = None,
clone_before_fit: bool = True,
)
Bases: ModelUtility[Sample, KNeighborsClassifier]
Utility object for KNN Classifiers.
The utility function is the model's predicted probability for the true class.
Uses of this utility
Although this class can be used in conjunction with any semi-value method and sampler, when computing Shapley values, it is recommended to use the dedicated class KNNShapleyValuation, because it implements a more efficient algorithm for computing Shapley values which runs in O(n log n) time for each test point.
PARAMETER | DESCRIPTION |
---|---|
model
|
A KNN classifier model.
TYPE:
|
test_data
|
The test data to evaluate the model on.
TYPE:
|
catch_errors
|
set to
TYPE:
|
show_warnings
|
Set to
TYPE:
|
cache_backend
|
Optional instance of [CacheBackend][ pydvl.utils.caching.base.CacheBackend] used to wrap the _utility method of the Utility instance. By default, this is set to None and that means that the utility evaluations will not be cached.
TYPE:
|
cached_func_options
|
Optional configuration object for cached utility evaluation.
TYPE:
|
clone_before_fit
|
If
TYPE:
|
Source code in src/pydvl/valuation/utility/knn.py
training_data
property
¶
training_data: Dataset | None
Retrieves the training data used by this utility.
This property is read-only. In order to set it, use with_dataset().
cache_stats
property
¶
cache_stats: CacheStats | None
Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.
__call__
¶
__call__(sample: SampleT | None) -> float
PARAMETER | DESCRIPTION |
---|---|
sample
|
contains a subset of valid indices for the
TYPE:
|
Source code in src/pydvl/valuation/utility/modelutility.py
with_dataset
¶
Return the utility, or a copy of it, with the given dataset and the model fitted on it.
PARAMETER | DESCRIPTION |
---|---|
data
|
The dataset to use.
TYPE:
|
copy
|
Whether to copy the utility object or not. Additionally, if
TYPE:
|
Returns: The utility object.
Source code in src/pydvl/valuation/utility/knn.py
UtilityModel
¶
Bases: ABC
Interface for utility models.
A utility model predicts the value of a utility function given a sample. The model is trained on a collection of samples and their respective utility values. These tuples are called Utility Samples.
Utility models:
- are fitted on dictionaries of Sample -> utility value
- predict: Collection[samples] -> NDArray[utility values]
IndicatorUtilityModel
¶
IndicatorUtilityModel(predictor: SupervisedModel, n_data: int)
Bases: UtilityModel
A simple wrapper for arbitrary predictors.
Uses 1-hot encoding of the indices as input for the model, as done in Wang et al., (2022)1.
Source code in src/pydvl/valuation/utility/learning.py
DataUtilityLearning
¶
DataUtilityLearning(
utility: UtilityBase,
training_budget: int,
model: UtilityModel,
show_warnings: bool = True,
)
Bases: UtilityBase[SampleT]
This object wraps any class derived from
UtilityBase and delegates calls to it,
up until a given budget (number of iterations). Every tuple of input and output (a
so-called utility sample) is stored. Once the budget is exhausted,
DataUtilityLearning
fits the given model to the utility samples. Subsequent
calls will use the learned model to predict the utility instead of delegating.
PARAMETER | DESCRIPTION |
---|---|
utility
|
The utility to learn. Typically, this will be a ModelUtility object encapsulating a machine learning model which requires fitting on each evaluation of the utility.
TYPE:
|
training_budget
|
Number of utility samples to collect before fitting the given model.
TYPE:
|
model
|
A supervised regression model
TYPE:
|
Example
from pydvl.valuation import Dataset, DataUtilityLearning, ModelUtility, Sample, SupervisedScorer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import load_iris
train, test = Dataset.from_sklearn(load_iris())
scorer = SupervisedScorer("accuracy", test, 0, (0,1))
utility = ModelUtility(LinearRegression(), scorer)
utility_model = IndicatorUtilityModel(LinearRegression(), len(train))
dul = DataUtilityLearning(utility, 3, utility_model)
# First 3 calls will be computed normally
for i in range(3):
_ = dul(Sample(0, np.array([])))
# Subsequent calls will be computed using the fitted utility_model
dul(Sample(0, np.array([1, 2, 3])))
Source code in src/pydvl/valuation/utility/learning.py
training_data
property
¶
training_data: Dataset | None
Retrieves the training data used by this utility.
This property is read-only. In order to set it, use with_dataset().
with_dataset
¶
Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.
Source code in src/pydvl/valuation/utility/base.py
point_wise_accuracy
¶
Point-wise accuracy, or 0-1 score between two arrays.
Higher is better.
PARAMETER | DESCRIPTION |
---|---|
y_true
|
Array of true values (e.g. labels)
TYPE:
|
y_pred
|
Array of estimated values (e.g. model predictions)
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray[T]
|
Array with point-wise 0-1 accuracy between labels and model predictions |
Source code in src/pydvl/valuation/methods/data_oob.py
neg_l2_distance
¶
Point-wise negative \(l_2\) distance between two arrays.
Higher is better.
PARAMETER | DESCRIPTION |
---|---|
y_true
|
Array of true values (e.g. labels)
TYPE:
|
y_pred
|
Array of estimated values (e.g. model predictions)
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray[T]
|
Array with point-wise negative \(l_2\) distances between labels and model |
NDArray[T]
|
predictions |
Source code in src/pydvl/valuation/methods/data_oob.py
compute_n_samples
¶
Compute the minimal sample size with epsilon-delta guarantees.
Based on the formula in Theorem 4 of (Jia, R. et al., 2023)2 which gives a lower bound on the number of samples required to obtain an (ε/√n,δ/(N(N-1))-approximation to all pair-wise differences of Shapley values, wrt. \(\ell_2\) norm.
The updated version refines the lower bound of the original paper. Note that the bound is tighter than earlier versions but might still overestimate the number of samples required.
PARAMETER | DESCRIPTION |
---|---|
epsilon
|
The error tolerance.
TYPE:
|
delta
|
The confidence level.
TYPE:
|
n_obs
|
Number of data points.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
int
|
The sample size. |
Source code in src/pydvl/valuation/methods/gt_shapley.py
get_unique_labels
¶
Returns unique labels in a categorical dataset.
PARAMETER | DESCRIPTION |
---|---|
array
|
The input array to find unique labels from. It should be of categorical types such as Object, String, Unicode, Unsigned integer, Signed integer, or Boolean.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray
|
An array of unique labels. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the input array is not of a categorical type. |
Source code in src/pydvl/valuation/samplers/classwise.py
compose_score
¶
compose_score(
scorer: SupervisedScorer,
transformation: Callable[[float], float],
name: str,
) -> SupervisedScorer
Composes a scoring function with an arbitrary scalar transformation.
Useful to squash unbounded scores into ranges manageable by data valuation methods.
Example
PARAMETER | DESCRIPTION |
---|---|
scorer
|
The object to be composed.
TYPE:
|
transformation
|
A scalar transformation |
name
|
A string representation for the composition, for
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
SupervisedScorer
|
The composite SupervisedScorer. |