pydvl.valuation.dataset
¶
This module contains convenience classes to handle data and groups thereof.
Value computations with supervised models benefit from a unified interface to handle data. This module provides two classes to handle data and labels, as well as feature names and other information:
Objects of both types can be used to construct scorers (for
the valuation set) and to fit
(most) valuation methods.
The underlying data arrays can always be accessed (read-only) via
Dataset.data(), which returns the tuple (x,
y)
.
Logical vs data indices
Dataset and
GroupedDataset use two different types of
indices:
* Logical indices: These are the indices used to access the elements in the
(Grouped)Dataset
objects when slicing or indexing them.
* Data indices: These are the indices used to access the data points in the
underlying data arrays. They are the indices used in the
RawData object returned by
Dataset.data().
NumPy and PyTorch support¶
It is possible to instantiate Dataset and GroupedDataset with either NumPy arrays or PyTorch tensors. The type of the input arrays is preserved throughout all operations. Indices are always returned as NumPy arrays, even if the input arrays are PyTorch tensors.
Memmapping data¶
When using numpy arrays for data (and only in that case), it is possible to use
memory mapping to avoid loading the whole dataset into memory, e.g. in subprocesses.
This is done by passing mmap=True
to the constructor. The data will be stored in a
temporary file, which will be deleted automatically when the creating object is garbage
collected.
If you pass numpy arrays to the constructor, a temporary file will be created and the
data dumped to it, then opened again as a memory map. You must set mmap=True
even if
the data is already memory-mapped, and in this case no additional shape consistency
checks will be performed, since these might result in new copies. This means that you
might want to call check_X_y()
manually before constructing the Dataset
.
See Working with large datasets.
Slicing¶
Slicing a Dataset object, e.g. dataset[0]
, will return a new Dataset
with the
data corresponding to that slice. Note however that the contents of the new object, i.e.
dataset[0].data().x
, may not be the same as dataset.data().x[0]
, which is the first
point in the original data array. This is in particular true for
GroupedDatasets where one "logical" index may
correspond to multiple data points.
Slicing with None
, i.e. dataset[None]
, will return a copy of the whole dataset.
Grouped datasets and logical indices¶
As mentioned above, it is also possible to group data points together with GroupedDataset. In order to handle groups correctly, Datasets map "logical" indices to "data" indices and vice versa. The latter correspond to indices in the data arrays themselves, while the former may map to groups of data points.
A call to GroupedDataset.data(indices)
will return the data and labels of all samples for the given groups. But
grouped_data[0]
will return the data and labels of the first group, not the first data
point and will therefore be in general different from grouped_data.data([0])
.
Grouping data can be useful to reduce computation time, e.g. for Shapley-based methods, or to look at the importance of certain feature sets for the model.
Tip
It is important to keep in mind the distinction between logical and data indices for valuation methods that require computation on individual data points, like KNNShapleyValuation or DataOOBValuation. In these cases, the logical indices are used to compute the Shapley values, while the data indices are used internally by the method.
New in version 0.11.0
The Dataset and GroupedDataset classes now work with both NumPy arrays and PyTorch tensors, preserving the input type throughout operations.
Dataset
¶
Dataset(
x: ArrayT,
y: ArrayT,
feature_names: Sequence[str] | NDArray[str_] | None = None,
target_names: Sequence[str] | NDArray[str_] | None = None,
data_names: Sequence[str] | NDArray[str_] | None = None,
description: str | None = None,
multi_output: bool = False,
mmap: bool = False,
)
Bases: Generic[ArrayT]
A convenience class to handle datasets.
It holds a dataset, together with info on feature names, target names, and data names. It is used to pass data around to valuation methods.
The underlying data arrays can be accessed via
Dataset.data(), which returns the tuple
(X, y)
as a read-only RawData object. The data
can be accessed by indexing the object directly, e.g. dataset[0]
will return the
data point corresponding to index 0 in dataset
. For this base class, this is the
same as dataset.data([0])
, which is the first point in the data array, but derived
classes can behave differently.
PARAMETER | DESCRIPTION |
---|---|
x
|
training data (NumPy array or PyTorch tensor)
TYPE:
|
y
|
labels for training data (same type as x)
TYPE:
|
feature_names
|
names of the features of x data |
target_names
|
names of the features of y data |
data_names
|
names assigned to data points. For example, if the dataset is a time series, each entry can be a timestamp which can be referenced directly instead of using a row number. |
description
|
A textual description of the dataset.
TYPE:
|
multi_output
|
set to
TYPE:
|
mmap
|
Whether to use memory-mapped files (for NumPy arrays only). If you pass
TYPE:
|
Type preservation
The Dataset
class preserves the type of input arrays (NumPy or PyTorch)
throughout all operations. When you access data with slicing or the
data() method, the arrays maintain their
original type.
Changed in version 0.10.0
No longer holds split data, but only x
, y
.
Slicing now return a new Dataset
object, not raw data.
New in version 0.11.0
Added support for PyTorch tensors. Added memory mapping support for numpy arrays.
Source code in src/pydvl/valuation/dataset.py
indices
property
¶
Index of positions in data.x_train.
Contiguous integers from 0 to len(Dataset).
names
property
¶
Names of each individual datapoint.
Used for reporting Shapley values.
__getstate__
¶
Prepare the object state for pickling replacing memmapped arrays with their file paths
Source code in src/pydvl/valuation/dataset.py
__setstate__
¶
Restore the object state from pickling.
Source code in src/pydvl/valuation/dataset.py
data
¶
Given a set of indices, returns the training data that refer to those indices, as a read-only tuple-like structure.
This is used mainly by subclasses of UtilityBase to retrieve subsets of the data from indices.
PARAMETER | DESCRIPTION |
---|---|
indices
|
Optional indices that will be used to select points from
the training data. If
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
RawData
|
If |
Source code in src/pydvl/valuation/dataset.py
data_indices
¶
Returns a subset of indices.
This is equivalent to using Dataset.indices[logical_indices]
but allows
subclasses to define special behaviour, e.g. when indices in Dataset
do not
match the indices in the data.
For Dataset
, this is a simple pass-through.
PARAMETER | DESCRIPTION |
---|---|
indices
|
A set of indices held by this object
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray[int_]
|
The indices of the data points in the data array. |
Source code in src/pydvl/valuation/dataset.py
feature
¶
Returns a slice for the feature with the given name.
Source code in src/pydvl/valuation/dataset.py
from_arrays
classmethod
¶
from_arrays(
X: ArrayT,
y: ArrayT,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
**kwargs: Any,
) -> tuple[Dataset, Dataset]
Constructs a Dataset object from X and
y arrays as returned by the make_*
functions in sklearn generated datasets.
Example
PARAMETER | DESCRIPTION |
---|---|
X
|
array of shape (n_samples, n_features) - either numpy array or PyTorch
TYPE:
|
y
|
array of shape (n_samples,) - must be same type as X
TYPE:
|
train_size
|
size of the training dataset. Used in
TYPE:
|
random_state
|
seed for train / test split
TYPE:
|
stratify_by_target
|
If stratification).¶
TYPE:
|
kwargs
|
Additional keyword arguments to pass to the
Dataset constructor. Use this to pass
e.g.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple[Dataset, Dataset]
|
Object with the passed X and y arrays split across training and test sets. |
New in version 0.4.0
Changed in version 0.6.0
Added kwargs to pass to the Dataset constructor.
Changed in version 0.10.0
Returns a tuple of two Dataset objects.
New in version 0.11.0
Added support for PyTorch tensors.
Source code in src/pydvl/valuation/dataset.py
from_sklearn
classmethod
¶
from_sklearn(
data: Bunch,
train_size: int | float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
**kwargs,
) -> tuple[Dataset, Dataset]
Constructs two Dataset objects from a
sklearn.utils.Bunch, as returned by the load_*
functions in scikit-learn toy datasets.
Example
PARAMETER | DESCRIPTION |
---|---|
data
|
scikit-learn Bunch object. The following attributes are supported:
TYPE:
|
train_size
|
size of the training dataset. Used in |
the value is automatically set to the complement of the test size.
random_state: seed for train / test split
stratify_by_target: If True
, data is split in a stratified
fashion, using the target variable as labels. Read more in
scikit-learn's user guide.
kwargs: Additional keyword arguments to pass to the
Dataset constructor. Use this to
pass e.g. is_multi_output
.
RETURNS | DESCRIPTION |
---|---|
tuple[Dataset, Dataset]
|
Object with the sklearn dataset |
Changed in version 0.6.0
Added kwargs to pass to the Dataset constructor.
Changed in version 0.10.0
Returns a tuple of two Dataset objects.
Source code in src/pydvl/valuation/dataset.py
538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 |
|
logical_indices
¶
Returns the indices in this Dataset
for the given indices in the data
array.
This is equivalent to using Dataset.indices[data_indices]
but allows
subclasses to define special behaviour, e.g. when indices in Dataset
do not
match the indices in the data.
PARAMETER | DESCRIPTION |
---|---|
indices
|
A set of indices in the data array.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray[int_]
|
The abstract indices for the given data indices. |
Source code in src/pydvl/valuation/dataset.py
target
¶
Returns a slice or index for the target with the given name.
If targets are multidimensional (2D array), returns a tuple (slice(None), target_idx). If targets are 1D, returns just a slice(None).
PARAMETER | DESCRIPTION |
---|---|
name
|
The name of the target to retrieve.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple[slice, int] | slice
|
For multi-output targets: tuple of (slice(None), target_idx) |
tuple[slice, int] | slice
|
For single-output targets: target_idx (usually 0) |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the target name is not found. |
Source code in src/pydvl/valuation/dataset.py
GroupedDataset
¶
GroupedDataset(
x: ArrayT,
y: ArrayT,
data_groups: Sequence[int] | NDArray[int_],
feature_names: Sequence[str] | NDArray[str_] | None = None,
target_names: Sequence[str] | NDArray[str_] | None = None,
data_names: Sequence[str] | NDArray[str_] | None = None,
group_names: Sequence[str] | NDArray[str_] | None = None,
description: str | None = None,
**kwargs: Any,
)
Bases: Dataset[ArrayT]
Class for grouping datasets.
Used for calculating values of subsets of the data considered as logical units. For instance, one can group by value of a categorical feature, by bin into which a continuous feature falls, or by label.
PARAMETER | DESCRIPTION |
---|---|
x
|
training data (numpy array or PyTorch tensor)
TYPE:
|
y
|
labels of training data (same type as x)
TYPE:
|
data_groups
|
Sequence of the same length as |
feature_names
|
names of the covariates' features. |
target_names
|
names of the labels or targets y |
data_names
|
names of the data points. For example, if the dataset is a time series, each entry can be a timestamp. |
group_names
|
names of the groups. If not provided, the numerical group ids
from |
description
|
A textual description of the dataset
TYPE:
|
kwargs
|
Additional keyword arguments to pass to the Dataset constructor.
TYPE:
|
Type Preservation with Groups
GroupedDataset preserves the type of input arrays (NumPy or PyTorch) while
maintaining all indices as numpy arrays. When accessing grouped data with
methods like data()
, the x and y arrays will maintain their original type,
but indices are always returned as numpy arrays.
Changed in version 0.6.0
Added group_names
and forwarding of kwargs
Changed in version 0.10.0
No longer holds split data, but only x, y and group information. Added methods to retrieve indices for groups and vice versa.
New in version 0.11.0
Added support for PyTorch tensors.
Source code in src/pydvl/valuation/dataset.py
692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 |
|
__getstate__
¶
Prepare the object state for pickling replacing memmapped arrays with their file paths
Source code in src/pydvl/valuation/dataset.py
__setstate__
¶
Restore the object state from pickling.
Source code in src/pydvl/valuation/dataset.py
data
¶
Returns the data and labels of all samples in the given groups.
PARAMETER | DESCRIPTION |
---|---|
indices
|
group indices whose elements to return. If
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
RawData
|
Tuple of training data |
Source code in src/pydvl/valuation/dataset.py
data_indices
¶
data_indices(
indices: int | slice | Sequence[int] | NDArray[int_] | None = None,
) -> NDArray[int_]
Returns the indices of the samples in the given groups.
PARAMETER | DESCRIPTION |
---|---|
indices
|
group indices whose elements to return. If
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray[int_]
|
Indices of the samples in the given groups. |
Source code in src/pydvl/valuation/dataset.py
feature
¶
Returns a slice for the feature with the given name.
Source code in src/pydvl/valuation/dataset.py
from_arrays
classmethod
¶
from_arrays(
X: ArrayT,
y: ArrayT,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
**kwargs,
) -> tuple[GroupedDataset, GroupedDataset]
from_arrays(
X: ArrayT,
y: ArrayT,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
data_groups: Sequence[int] | None = None,
**kwargs,
) -> tuple[GroupedDataset, GroupedDataset]
from_arrays(
X: ArrayT,
y: ArrayT,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
data_groups: Sequence[int] | None = None,
**kwargs: Any,
) -> tuple[GroupedDataset, GroupedDataset]
Constructs a GroupedDataset object,
and an ungrouped Dataset object from X and y
numpy arrays as returned by the make_*
functions in
scikit-learn generated datasets.
Example
>>> from sklearn.datasets import make_classification
>>> from pydvl.valuation.dataset import GroupedDataset
>>> X, y = make_classification(
... n_samples=100,
... n_features=4,
... n_informative=2,
... n_redundant=0,
... random_state=0,
... shuffle=False
... )
>>> data_groups = X[:, 0] // 0.5
>>> train, test = GroupedDataset.from_arrays(X, y, data_groups=data_groups)
PARAMETER | DESCRIPTION |
---|---|
X
|
array of shape (n_samples, n_features)
TYPE:
|
y
|
array of shape (n_samples,)
TYPE:
|
train_size
|
size of the training dataset. Used in
TYPE:
|
random_state
|
seed for train / test split.
TYPE:
|
stratify_by_target
|
If stratification).¶
TYPE:
|
data_groups
|
an array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset. |
kwargs
|
Additional keyword arguments that will be passed to the GroupedDataset constructor.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple[GroupedDataset, GroupedDataset]
|
Dataset with the passed X and y arrays split across training and test sets. |
New in version 0.4.0
Changed in version 0.6.0
Added kwargs to pass to the GroupedDataset constructor.
Changed in version 0.10.0
Returns a tuple of two GroupedDataset objects.
Source code in src/pydvl/valuation/dataset.py
989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 |
|
from_dataset
classmethod
¶
from_dataset(
data: Dataset,
data_groups: Sequence[int] | NDArray[int_],
group_names: Sequence[str] | NDArray[str_] | None = None,
**kwargs: Any,
) -> GroupedDataset
Creates a GroupedDataset object from a Dataset object and a mapping of data groups.
Example
PARAMETER | DESCRIPTION |
---|---|
data
|
The original data.
TYPE:
|
data_groups
|
An array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset. |
group_names
|
Names of the groups. If not provided, the numerical group ids
from |
kwargs
|
Additional arguments to be passed to the GroupedDataset constructor.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
GroupedDataset
|
A GroupedDataset with the initial
Dataset grouped by |
Source code in src/pydvl/valuation/dataset.py
from_sklearn
classmethod
¶
from_sklearn(
data: Bunch,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
**kwargs,
) -> tuple[GroupedDataset, GroupedDataset]
from_sklearn(
data: Bunch,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
data_groups: Sequence[int] | None = None,
**kwargs,
) -> tuple[GroupedDataset, GroupedDataset]
from_sklearn(
data: Bunch,
train_size: int | float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
data_groups: Sequence[int] | None = None,
**kwargs: dict[str, Any],
) -> tuple[GroupedDataset, GroupedDataset]
Constructs a GroupedDataset
object, and an
ungrouped Dataset object from a
sklearn.utils.Bunch as returned by the load_*
functions in
scikit-learn toy datasets and groups
it.
Example
PARAMETER | DESCRIPTION |
---|---|
data
|
scikit-learn Bunch object. The following attributes are supported:
-
TYPE:
|
train_size
|
size of the training dataset. Used in |
random_state
|
seed for train / test split.
TYPE:
|
stratify_by_target
|
If stratification).¶
TYPE:
|
data_groups
|
an array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset. |
kwargs
|
Additional keyword arguments to pass to the Dataset constructor. |
RETURNS | DESCRIPTION |
---|---|
tuple[GroupedDataset, GroupedDataset]
|
Datasets with the selected sklearn data |
Changed in version 0.10.0
Returns a tuple of two [GroupedDataset][ pydvl.valuation.dataset.GroupedDataset] objects.
Source code in src/pydvl/valuation/dataset.py
logical_indices
¶
Returns the group indices for the given data indices.
PARAMETER | DESCRIPTION |
---|---|
indices
|
indices of the data points in the data array. If
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray[int_]
|
Group indices for the given data indices. |
Source code in src/pydvl/valuation/dataset.py
target
¶
Returns a slice or index for the target with the given name.
If targets are multidimensional (2D array), returns a tuple (slice(None), target_idx). If targets are 1D, returns just a slice(None).
PARAMETER | DESCRIPTION |
---|---|
name
|
The name of the target to retrieve.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple[slice, int] | slice
|
For multi-output targets: tuple of (slice(None), target_idx) |
tuple[slice, int] | slice
|
For single-output targets: target_idx (usually 0) |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the target name is not found. |
Source code in src/pydvl/valuation/dataset.py
RawData
dataclass
¶
Bases: Generic[ArrayT]
A view on a dataset's raw data. This is not a copy.
RawData is a generic container that preserves the type of the arrays it holds.
It supports both NumPy arrays and PyTorch tensors, but both x
and y
must be of
the same type.