pydvl.valuation.dataset
¶
This module contains convenience classes to handle data and groups thereof.
Model-based value computations require evaluation of a scoring function (the utility). This is typically the performance of the model on a test set (as an approximation to its true expected performance). It is therefore convenient to keep both the training data and the test data grouped to be passed around to methods in shapley. This is done with Dataset.
This abstraction layer also seamlessly groups data points together if one is interested in computing their value as a group, see [GroupedDataset][pydvl.valuation.dataset.dataset.GroupedDataset].
Objects of both types can be used to construct scorers and to fit valuation methods.
Dataset
¶
Dataset(
x: NDArray,
y: NDArray,
feature_names: Sequence[str] | None = None,
target_names: Sequence[str] | None = None,
data_names: Sequence[str] | None = None,
description: str | None = None,
multi_output: bool = False,
)
A convenience class to handle datasets.
It holds a dataset, together with info on feature names, target names, and data names. It is used to pass data around to valuation methods.
PARAMETER | DESCRIPTION |
---|---|
x |
training data
TYPE:
|
y |
labels for training data
TYPE:
|
feature_names |
names of the features of x data |
target_names |
names of the features of y data |
data_names |
names assigned to data points. For example, if the dataset is a time series, each entry can be a timestamp which can be referenced directly instead of using a row number. |
description |
A textual description of the dataset.
TYPE:
|
multi_output |
set to
TYPE:
|
Changed in version 0.10.0
No longer holds split data, but only x, y.
Source code in src/pydvl/valuation/dataset.py
indices
property
¶
Index of positions in data.x_train.
Contiguous integers from 0 to len(Dataset).
data_names
property
¶
Names of each individual datapoint.
Used for reporting Shapley values.
get_data
¶
Given a set of indices, returns the training data that refer to those indices.
This is used mainly by [Utility][pydvl.valuation.dataset.utility.Utility] to retrieve subsets of the data from indices. It is typically not needed in algorithms.
PARAMETER | DESCRIPTION |
---|---|
indices |
Optional indices that will be used to select points from
the training data. If |
RETURNS | DESCRIPTION |
---|---|
tuple[NDArray, NDArray]
|
If |
Source code in src/pydvl/valuation/dataset.py
from_sklearn
classmethod
¶
from_sklearn(
data: Bunch,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
**kwargs
) -> tuple[Dataset, Dataset]
Constructs two Dataset objects from a
sklearn.utils.Bunch, as returned by the load_*
functions in scikit-learn toy datasets.
Example
PARAMETER | DESCRIPTION |
---|---|
data |
scikit-learn Bunch object. The following attributes are supported:
TYPE:
|
train_size |
size of the training dataset. Used in
TYPE:
|
random_state |
seed for train / test split
TYPE:
|
stratify_by_target |
If
TYPE:
|
kwargs |
Additional keyword arguments to pass to the
Dataset constructor. Use this to pass e.g.
DEFAULT:
|
RETURNS | DESCRIPTION |
---|---|
tuple[Dataset, Dataset]
|
Object with the sklearn dataset |
Changed in version 0.6.0
Added kwargs to pass to the Dataset constructor.
Changed in version 0.10.0
Returns a tuple of two Dataset objects.
Source code in src/pydvl/valuation/dataset.py
from_arrays
classmethod
¶
from_arrays(
X: NDArray,
y: NDArray,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
**kwargs
) -> tuple[Dataset, Dataset]
Constructs a Dataset object from X and y numpy arrays as
returned by the make_*
functions in sklearn generated datasets.
Example
PARAMETER | DESCRIPTION |
---|---|
X |
numpy array of shape (n_samples, n_features)
TYPE:
|
y |
numpy array of shape (n_samples,)
TYPE:
|
train_size |
size of the training dataset. Used in
TYPE:
|
random_state |
seed for train / test split
TYPE:
|
stratify_by_target |
If
TYPE:
|
kwargs |
Additional keyword arguments to pass to the
Dataset constructor. Use this to pass e.g.
DEFAULT:
|
RETURNS | DESCRIPTION |
---|---|
tuple[Dataset, Dataset]
|
Object with the passed X and y arrays split across training and test sets. |
New in version 0.4.0
Changed in version 0.6.0
Added kwargs to pass to the Dataset constructor.
Changed in version 0.10.0
Returns a tuple of two Dataset objects.
Source code in src/pydvl/valuation/dataset.py
GroupedDataset
¶
GroupedDataset(
x: NDArray,
y: NDArray,
data_groups: Sequence,
feature_names: Sequence[str] | None = None,
target_names: Sequence[str] | None = None,
group_names: Sequence[str] | None = None,
description: str | None = None,
**kwargs
)
Bases: Dataset
Used for calculating Shapley values of subsets of the data considered as logical units. For instance, one can group by value of a categorical feature, by bin into which a continuous feature falls, or by label.
PARAMETER | DESCRIPTION |
---|---|
x |
training data
TYPE:
|
y |
labels of training data
TYPE:
|
data_groups |
Iterable of the same length as
TYPE:
|
feature_names |
names of the covariates' features. |
target_names |
names of the labels or targets y |
group_names |
names of the groups. If not provided, the labels
from |
description |
A textual description of the dataset
TYPE:
|
kwargs |
Additional keyword arguments to pass to the Dataset constructor.
DEFAULT:
|
Changed in version 0.6.0
Added group_names
and forwarding of kwargs
Changed in version 0.10.0
No longer holds split data, but only x,y and group information.
Source code in src/pydvl/valuation/dataset.py
get_data
¶
Returns the data and labels of all samples in the given groups.
PARAMETER | DESCRIPTION |
---|---|
indices |
group indices whose elements to return. If |
RETURNS | DESCRIPTION |
---|---|
tuple[NDArray, NDArray]
|
Tuple of training data x and labels y. |
Source code in src/pydvl/valuation/dataset.py
from_sklearn
classmethod
¶
from_sklearn(
data: Bunch,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
data_groups: Sequence | None = None,
**kwargs
) -> tuple[GroupedDataset, Dataset]
Constructs a GroupedDataset object, and an
ungrouped Dataset object from a
sklearn.utils.Bunch as returned by the load_*
functions in
scikit-learn toy datasets and groups
it.
Example
PARAMETER | DESCRIPTION |
---|---|
data |
scikit-learn Bunch object. The following attributes are supported:
TYPE:
|
train_size |
size of the training dataset. Used in
TYPE:
|
random_state |
seed for train / test split.
TYPE:
|
stratify_by_target |
If
TYPE:
|
data_groups |
an array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset.
TYPE:
|
kwargs |
Additional keyword arguments to pass to the Dataset constructor.
DEFAULT:
|
RETURNS | DESCRIPTION |
---|---|
tuple[GroupedDataset, Dataset]
|
Datasets with the selected sklearn data |
Changed in version 0.10.0
Returns a tuple of two Dataset objects.
Source code in src/pydvl/valuation/dataset.py
from_arrays
classmethod
¶
from_arrays(
X: NDArray,
y: NDArray,
train_size: float = 0.8,
random_state: int | None = None,
stratify_by_target: bool = False,
data_groups: Sequence | None = None,
**kwargs
) -> tuple[GroupedDataset, Dataset]
Constructs a GroupedDataset object,
and an ungrouped Dataset object from X and y
numpy arrays as returned by the make_*
functions in
scikit-learn generated datasets.
Example
>>> from sklearn.datasets import make_classification
>>> from pydvl.valuation.dataset import GroupedDataset
>>> X, y = make_classification(
... n_samples=100,
... n_features=4,
... n_informative=2,
... n_redundant=0,
... random_state=0,
... shuffle=False
... )
>>> data_groups = X[:, 0] // 0.5
>>> train, test = GroupedDataset.from_arrays(X, y, data_groups=data_groups)
PARAMETER | DESCRIPTION |
---|---|
X |
array of shape (n_samples, n_features)
TYPE:
|
y |
array of shape (n_samples,)
TYPE:
|
train_size |
size of the training dataset. Used in
TYPE:
|
random_state |
seed for train / test split.
TYPE:
|
stratify_by_target |
If
TYPE:
|
data_groups |
an array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset.
TYPE:
|
kwargs |
Additional keyword arguments that will be passed to the Dataset constructor.
DEFAULT:
|
RETURNS | DESCRIPTION |
---|---|
tuple[GroupedDataset, Dataset]
|
Dataset with the passed X and y arrays split across training and test sets. |
New in version 0.4.0
Changed in version 0.6.0
Added kwargs to pass to the Dataset constructor.
Changed in version 0.10.0
Returns a tuple of two Dataset objects.
Source code in src/pydvl/valuation/dataset.py
from_dataset
classmethod
¶
from_dataset(data: Dataset, data_groups: Sequence[Any]) -> GroupedDataset
Creates a GroupedDataset object from a Dataset object and a mapping of data groups.
Example
PARAMETER | DESCRIPTION |
---|---|
data |
The original data.
TYPE:
|
data_groups |
An array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset. |
RETURNS | DESCRIPTION |
---|---|
GroupedDataset
|
A GroupedDataset with the initial Dataset grouped by data_groups. |