pydvl.utils.dataset
¶
This module contains convenience classes to handle data and groups thereof.
Shapley and Least Core value computations require evaluation of a scoring function (the utility). This is typically the performance of the model on a test set (as an approximation to its true expected performance). It is therefore convenient to keep both the training data and the test data together to be passed around to methods in shapley and least_core. This is done with Dataset.
This abstraction layer also seamlessly grouping data points together if one is interested in computing their value as a group, see GroupedDataset.
Objects of both types are used to construct a Utility object.
Dataset
¶
Dataset(
x_train: Union[NDArray, DataFrame],
y_train: Union[NDArray, DataFrame],
x_test: Union[NDArray, DataFrame],
y_test: Union[NDArray, DataFrame],
feature_names: Optional[Sequence[str]] = None,
target_names: Optional[Sequence[str]] = None,
data_names: Optional[Sequence[str]] = None,
description: Optional[str] = None,
is_multi_output: bool = False,
)
A convenience class to handle datasets.
It holds a dataset, split into training and test data, together with several labels on feature names, data point names and a description.
PARAMETER | DESCRIPTION |
---|---|
x_train |
training data |
y_train |
labels for training data |
x_test |
test data |
y_test |
labels for test data |
feature_names |
name of the features of input data |
target_names |
names of the features of target data |
data_names |
names assigned to data points. For example, if the dataset is a time series, each entry can be a timestamp which can be referenced directly instead of using a row number. |
description |
A textual description of the dataset. |
is_multi_output |
set to
TYPE:
|
Source code in src/pydvl/utils/dataset.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
|
indices
property
¶
Index of positions in data.x_train.
Contiguous integers from 0 to len(Dataset).
data_names
property
¶
Names of each individual datapoint.
Used for reporting Shapley values.
get_training_data
¶
Given a set of indices, returns the training data that refer to those indices.
This is used mainly by Utility to retrieve subsets of the data from indices. It is typically not needed in algorithms.
PARAMETER | DESCRIPTION |
---|---|
indices |
Optional indices that will be used to select points from
the training data. If |
RETURNS | DESCRIPTION |
---|---|
Tuple[NDArray, NDArray]
|
If |
Source code in src/pydvl/utils/dataset.py
get_test_data
¶
Returns the entire test set regardless of the passed indices.
The passed indices will not be used because for data valuation we generally want to score the trained model on the entire test data.
Additionally, the way this method is used in the Utility class, the passed indices will be those of the training data and would not work on the test data.
There may be cases where it is desired to use parts of the test data. In those cases, it is recommended to inherit from Dataset and override get_test_data().
For example, the following snippet shows how one could go about mapping the training data indices into test data indices inside get_test_data():
Example
>>> from pydvl.utils import Dataset
>>> import numpy as np
>>> class DatasetWithTestDataIndices(Dataset):
... def get_test_data(self, indices=None):
... if indices is None:
... return self.x_test, self.y_test
... fraction = len(list(indices)) / len(self)
... mapped_indices = len(self.x_test) / len(self) * np.asarray(indices)
... mapped_indices = np.unique(mapped_indices.astype(int))
... return self.x_test[mapped_indices], self.y_test[mapped_indices]
...
>>> X = np.random.rand(100, 10)
>>> y = np.random.randint(0, 2, 100)
>>> dataset = DatasetWithTestDataIndices.from_arrays(X, y)
>>> indices = np.random.choice(dataset.indices, 30, replace=False)
>>> _ = dataset.get_training_data(indices)
>>> _ = dataset.get_test_data(indices)
PARAMETER | DESCRIPTION |
---|---|
indices |
Optional indices into the test data. This argument is unused left for compatibility with get_training_data(). |
RETURNS | DESCRIPTION |
---|---|
Tuple[NDArray, NDArray]
|
The entire test data. |
Source code in src/pydvl/utils/dataset.py
from_sklearn
classmethod
¶
from_sklearn(
data: Bunch,
train_size: float = 0.8,
random_state: Optional[int] = None,
stratify_by_target: bool = False,
**kwargs
) -> Dataset
Constructs a Dataset object from a
sklearn.utils.Bunch, as returned by the load_*
functions in scikit-learn toy datasets.
Example
PARAMETER | DESCRIPTION |
---|---|
data |
scikit-learn Bunch object. The following attributes are supported:
TYPE:
|
train_size |
size of the training dataset. Used in
TYPE:
|
random_state |
seed for train / test split |
stratify_by_target |
If
TYPE:
|
kwargs |
Additional keyword arguments to pass to the
Dataset constructor. Use this to pass e.g.
DEFAULT:
|
RETURNS | DESCRIPTION |
---|---|
Dataset
|
Object with the sklearn dataset |
Changed in version 0.6.0
Added kwargs to pass to the Dataset constructor.
Source code in src/pydvl/utils/dataset.py
from_arrays
classmethod
¶
from_arrays(
X: NDArray,
y: NDArray,
train_size: float = 0.8,
random_state: Optional[int] = None,
stratify_by_target: bool = False,
**kwargs
) -> Dataset
Constructs a Dataset object from X and y numpy arrays as
returned by the make_*
functions in sklearn generated datasets.
Example
PARAMETER | DESCRIPTION |
---|---|
X |
numpy array of shape (n_samples, n_features)
TYPE:
|
y |
numpy array of shape (n_samples,)
TYPE:
|
train_size |
size of the training dataset. Used in
TYPE:
|
random_state |
seed for train / test split |
stratify_by_target |
If
TYPE:
|
kwargs |
Additional keyword arguments to pass to the
Dataset constructor. Use this to pass e.g.
DEFAULT:
|
RETURNS | DESCRIPTION |
---|---|
Dataset
|
Object with the passed X and y arrays split across training and test sets. |
New in version 0.4.0
Changed in version 0.6.0
Added kwargs to pass to the Dataset constructor.
Source code in src/pydvl/utils/dataset.py
GroupedDataset
¶
GroupedDataset(
x_train: NDArray,
y_train: NDArray,
x_test: NDArray,
y_test: NDArray,
data_groups: Sequence,
feature_names: Optional[Sequence[str]] = None,
target_names: Optional[Sequence[str]] = None,
group_names: Optional[Sequence[str]] = None,
description: Optional[str] = None,
**kwargs
)
Bases: Dataset
Used for calculating Shapley values of subsets of the data considered as logical units. For instance, one can group by value of a categorical feature, by bin into which a continuous feature falls, or by label.
PARAMETER | DESCRIPTION |
---|---|
x_train |
training data
TYPE:
|
y_train |
labels of training data
TYPE:
|
x_test |
test data
TYPE:
|
y_test |
labels of test data
TYPE:
|
data_groups |
Iterable of the same length as
TYPE:
|
feature_names |
names of the covariates' features. |
target_names |
names of the labels or targets y |
group_names |
names of the groups. If not provided, the labels
from |
description |
A textual description of the dataset |
kwargs |
Additional keyword arguments to pass to the Dataset constructor.
DEFAULT:
|
Changed in version 0.6.0
Added group_names
and forwarding of kwargs
Source code in src/pydvl/utils/dataset.py
get_test_data
¶
Returns the entire test set regardless of the passed indices.
The passed indices will not be used because for data valuation we generally want to score the trained model on the entire test data.
Additionally, the way this method is used in the Utility class, the passed indices will be those of the training data and would not work on the test data.
There may be cases where it is desired to use parts of the test data. In those cases, it is recommended to inherit from Dataset and override get_test_data().
For example, the following snippet shows how one could go about mapping the training data indices into test data indices inside get_test_data():
Example
>>> from pydvl.utils import Dataset
>>> import numpy as np
>>> class DatasetWithTestDataIndices(Dataset):
... def get_test_data(self, indices=None):
... if indices is None:
... return self.x_test, self.y_test
... fraction = len(list(indices)) / len(self)
... mapped_indices = len(self.x_test) / len(self) * np.asarray(indices)
... mapped_indices = np.unique(mapped_indices.astype(int))
... return self.x_test[mapped_indices], self.y_test[mapped_indices]
...
>>> X = np.random.rand(100, 10)
>>> y = np.random.randint(0, 2, 100)
>>> dataset = DatasetWithTestDataIndices.from_arrays(X, y)
>>> indices = np.random.choice(dataset.indices, 30, replace=False)
>>> _ = dataset.get_training_data(indices)
>>> _ = dataset.get_test_data(indices)
PARAMETER | DESCRIPTION |
---|---|
indices |
Optional indices into the test data. This argument is unused left for compatibility with get_training_data(). |
RETURNS | DESCRIPTION |
---|---|
Tuple[NDArray, NDArray]
|
The entire test data. |
Source code in src/pydvl/utils/dataset.py
get_training_data
¶
Returns the data and labels of all samples in the given groups.
PARAMETER | DESCRIPTION |
---|---|
indices |
group indices whose elements to return. If |
RETURNS | DESCRIPTION |
---|---|
Tuple[NDArray, NDArray]
|
Tuple of training data x and labels y. |
Source code in src/pydvl/utils/dataset.py
from_sklearn
classmethod
¶
from_sklearn(
data: Bunch,
train_size: float = 0.8,
random_state: Optional[int] = None,
stratify_by_target: bool = False,
data_groups: Optional[Sequence] = None,
**kwargs
) -> GroupedDataset
Constructs a GroupedDataset object from a
sklearn.utils.Bunch as returned by the load_*
functions in
scikit-learn toy datasets and groups
it.
Example
PARAMETER | DESCRIPTION |
---|---|
data |
scikit-learn Bunch object. The following attributes are supported:
TYPE:
|
train_size |
size of the training dataset. Used in
TYPE:
|
random_state |
seed for train / test split. |
stratify_by_target |
If
TYPE:
|
data_groups |
an array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset. |
kwargs |
Additional keyword arguments to pass to the Dataset constructor.
DEFAULT:
|
RETURNS | DESCRIPTION |
---|---|
GroupedDataset
|
Dataset with the selected sklearn data |
Source code in src/pydvl/utils/dataset.py
from_arrays
classmethod
¶
from_arrays(
X: NDArray,
y: NDArray,
train_size: float = 0.8,
random_state: Optional[int] = None,
stratify_by_target: bool = False,
data_groups: Optional[Sequence] = None,
**kwargs
) -> Dataset
Constructs a GroupedDataset object from X and y numpy arrays
as returned by the make_*
functions in
scikit-learn generated datasets.
Example
>>> from sklearn.datasets import make_classification
>>> from pydvl.utils import GroupedDataset
>>> X, y = make_classification(
... n_samples=100,
... n_features=4,
... n_informative=2,
... n_redundant=0,
... random_state=0,
... shuffle=False
... )
>>> data_groups = X[:, 0] // 0.5
>>> dataset = GroupedDataset.from_arrays(X, y, data_groups=data_groups)
PARAMETER | DESCRIPTION |
---|---|
X |
array of shape (n_samples, n_features)
TYPE:
|
y |
array of shape (n_samples,)
TYPE:
|
train_size |
size of the training dataset. Used in
TYPE:
|
random_state |
seed for train / test split. |
stratify_by_target |
If
TYPE:
|
data_groups |
an array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset. |
kwargs |
Additional keyword arguments that will be passed to the Dataset constructor.
DEFAULT:
|
RETURNS | DESCRIPTION |
---|---|
Dataset
|
Dataset with the passed X and y arrays split across training and test sets. |
New in version 0.4.0
Changed in version 0.6.0
Added kwargs to pass to the Dataset constructor.
Source code in src/pydvl/utils/dataset.py
from_dataset
classmethod
¶
from_dataset(dataset: Dataset, data_groups: Sequence[Any]) -> GroupedDataset
Creates a GroupedDataset object from the data a Dataset object and a mapping of data groups.
Example
PARAMETER | DESCRIPTION |
---|---|
dataset |
The original data.
TYPE:
|
data_groups |
An array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset. |
RETURNS | DESCRIPTION |
---|---|
GroupedDataset
|
A GroupedDataset with the initial Dataset grouped by data_groups. |