pydvl.utils
¶
squashed_r2
module-attribute
¶
squashed_r2 = compose_score(Scorer('r2'), _sigmoid, (0, 1), 'squashed r2')
A scorer that squashes the R² score into the range [0, 1] using a sigmoid.
squashed_variance
module-attribute
¶
squashed_variance = compose_score(
Scorer("explained_variance"),
_sigmoid,
(0, 1),
"squashed explained variance",
)
A scorer that squashes the explained variance score into the range [0, 1] using a sigmoid.
CacheStats
dataclass
¶
CacheStats(
sets: int = 0,
misses: int = 0,
hits: int = 0,
timeouts: int = 0,
errors: int = 0,
reconnects: int = 0,
)
Class used to store statistics gathered by cached functions.
ATTRIBUTE | DESCRIPTION |
---|---|
sets |
Number of times a value was set in the cache.
TYPE:
|
misses |
Number of times a value was not found in the cache.
TYPE:
|
hits |
Number of times a value was found in the cache.
TYPE:
|
timeouts |
Number of times a timeout occurred.
TYPE:
|
errors |
Number of times an error occurred.
TYPE:
|
reconnects |
Number of times the client reconnected to the server.
TYPE:
|
CacheBackend
¶
Bases: ABC
Abstract base class for cache backends.
Defines interface for cache access including wrapping callables, getting/setting results, clearing cache, and combining cache keys.
ATTRIBUTE | DESCRIPTION |
---|---|
stats |
Cache statistics tracker.
|
Source code in src/pydvl/utils/caching/base.py
wrap
¶
wrap(
func: Callable, *, config: Optional[CachedFuncConfig] = None
) -> CachedFunc
Wraps a function to cache its results.
PARAMETER | DESCRIPTION |
---|---|
func
|
The function to wrap.
TYPE:
|
config
|
Optional caching options for the wrapped function.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
CachedFunc
|
The wrapped cached function. |
Source code in src/pydvl/utils/caching/base.py
get
abstractmethod
¶
get(key: str) -> Optional[CacheResult]
Abstract method to retrieve a cached result.
Implemented by subclasses.
PARAMETER | DESCRIPTION |
---|---|
key
|
The cache key.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Optional[CacheResult]
|
The cached result or None if not found. |
Source code in src/pydvl/utils/caching/base.py
set
abstractmethod
¶
set(key: str, value: CacheResult) -> None
Abstract method to set a cached result.
Implemented by subclasses.
PARAMETER | DESCRIPTION |
---|---|
key
|
The cache key.
TYPE:
|
value
|
The result to cache.
TYPE:
|
clear
abstractmethod
¶
CachedFunc
¶
CachedFunc(
func: Callable[..., float],
*,
cache_backend: CacheBackend,
config: Optional[CachedFuncConfig] = None,
)
Caches callable function results with a provided cache backend.
Wraps a callable function to cache its results using a provided an instance of a subclass of CacheBackend.
This class is heavily inspired from that of joblib.memory.MemorizedFunc.
This class caches calls to the wrapped callable by generating a hash key based on the wrapped callable's code, the arguments passed to it and the optional hash_prefix.
Warning
This class only works with hashable arguments to the wrapped callable.
PARAMETER | DESCRIPTION |
---|---|
func
|
Callable to wrap. |
cache_backend
|
Instance of CacheBackendBase that handles setting and getting values.
TYPE:
|
config
|
Configuration for wrapped function.
TYPE:
|
Source code in src/pydvl/utils/caching/base.py
DiskCacheBackend
¶
Bases: CacheBackend
Disk cache backend that stores results in files.
Implements the CacheBackend interface for a disk-based cache. Stores cache entries as pickled files on disk, keyed by cache key. This allows sharing evaluations across processes in a single node/computer.
PARAMETER | DESCRIPTION |
---|---|
cache_dir
|
Base directory for cache storage. |
ATTRIBUTE | DESCRIPTION |
---|---|
cache_dir |
Base directory for cache storage.
|
Example
Basic usage:
>>> from pydvl.utils.caching.disk import DiskCacheBackend
>>> cache_backend = DiskCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> cache_backend.set("key", value)
>>> cache_backend.get("key")
42
Callable wrapping:
>>> from pydvl.utils.caching.disk import DiskCacheBackend
>>> cache_backend = DiskCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> def foo(x: int):
... return x + 1
...
>>> wrapped_foo = cache_backend.wrap(foo)
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
0
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
1
PARAMETER | DESCRIPTION |
---|---|
cache_dir
|
Base directory for cache storage. If not provided, this defaults to a newly created temporary directory. |
Source code in src/pydvl/utils/caching/disk.py
wrap
¶
wrap(
func: Callable, *, config: Optional[CachedFuncConfig] = None
) -> CachedFunc
Wraps a function to cache its results.
PARAMETER | DESCRIPTION |
---|---|
func
|
The function to wrap.
TYPE:
|
config
|
Optional caching options for the wrapped function.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
CachedFunc
|
The wrapped cached function. |
Source code in src/pydvl/utils/caching/base.py
get
¶
Get a value from the cache.
PARAMETER | DESCRIPTION |
---|---|
key
|
Cache key.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Optional[Any]
|
Cached value or None if not found. |
Source code in src/pydvl/utils/caching/disk.py
set
¶
clear
¶
InMemoryCacheBackend
¶
Bases: CacheBackend
In-memory cache backend that stores results in a dictionary.
Implements the CacheBackend interface for an in-memory-based cache. Stores cache entries as values in a dictionary, keyed by cache key. This allows sharing evaluations across threads in a single process.
The implementation is not thread-safe.
ATTRIBUTE | DESCRIPTION |
---|---|
cached_values |
Dictionary used to store cached values. |
Example
Basic usage:
>>> from pydvl.utils.caching.memory import InMemoryCacheBackend
>>> cache_backend = InMemoryCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> cache_backend.set("key", value)
>>> cache_backend.get("key")
42
Callable wrapping:
>>> from pydvl.utils.caching.memory import InMemoryCacheBackend
>>> cache_backend = InMemoryCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> def foo(x: int):
... return x + 1
...
>>> wrapped_foo = cache_backend.wrap(foo)
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
0
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
1
Source code in src/pydvl/utils/caching/memory.py
wrap
¶
wrap(
func: Callable, *, config: Optional[CachedFuncConfig] = None
) -> CachedFunc
Wraps a function to cache its results.
PARAMETER | DESCRIPTION |
---|---|
func
|
The function to wrap.
TYPE:
|
config
|
Optional caching options for the wrapped function.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
CachedFunc
|
The wrapped cached function. |
Source code in src/pydvl/utils/caching/base.py
get
¶
set
¶
clear
¶
MemcachedClientConfig
dataclass
¶
MemcachedClientConfig(
server: Tuple[str, int] = ("localhost", 11211),
connect_timeout: float = 1.0,
timeout: float = 1.0,
no_delay: bool = True,
serde: PickleSerde = PickleSerde(pickle_version=PICKLE_VERSION),
)
Configuration of the memcached client.
PARAMETER | DESCRIPTION |
---|---|
server
|
A tuple of (IP|domain name, port). |
connect_timeout
|
How many seconds to wait before raising
TYPE:
|
timeout
|
Duration in seconds to wait for send or recv calls on the socket connected to memcached.
TYPE:
|
no_delay
|
If True, set the
TYPE:
|
serde
|
Serializer / Deserializer ("serde"). The default
TYPE:
|
MemcachedCacheBackend
¶
MemcachedCacheBackend(config: MemcachedClientConfig = MemcachedClientConfig())
Bases: CacheBackend
Memcached cache backend for the distributed caching of functions.
Implements the CacheBackend interface for a memcached based cache. This allows sharing evaluations across processes and nodes in a cluster. You can run memcached as a service, locally or remotely, see the caching documentation.
PARAMETER | DESCRIPTION |
---|---|
config
|
Memcached client configuration.
TYPE:
|
ATTRIBUTE | DESCRIPTION |
---|---|
config |
Memcached client configuration.
|
client |
Memcached client instance.
|
Example
Basic usage:
>>> from pydvl.utils.caching.memcached import MemcachedCacheBackend
>>> cache_backend = MemcachedCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> cache_backend.set("key", value)
>>> cache_backend.get("key")
42
Callable wrapping:
>>> from pydvl.utils.caching.memcached import MemcachedCacheBackend
>>> cache_backend = MemcachedCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> def foo(x: int):
... return x + 1
...
>>> wrapped_foo = cache_backend.wrap(foo)
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
0
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
1
PARAMETER | DESCRIPTION |
---|---|
config
|
Memcached client configuration.
TYPE:
|
Source code in src/pydvl/utils/caching/memcached.py
wrap
¶
wrap(
func: Callable, *, config: Optional[CachedFuncConfig] = None
) -> CachedFunc
Wraps a function to cache its results.
PARAMETER | DESCRIPTION |
---|---|
func
|
The function to wrap.
TYPE:
|
config
|
Optional caching options for the wrapped function.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
CachedFunc
|
The wrapped cached function. |
Source code in src/pydvl/utils/caching/base.py
get
¶
Get value from memcached.
PARAMETER | DESCRIPTION |
---|---|
key
|
Cache key.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Optional[Any]
|
Cached value or None if not found or client disconnected. |
Source code in src/pydvl/utils/caching/memcached.py
set
¶
clear
¶
combine_hashes
¶
__getstate__
¶
__getstate__() -> Dict
Enables pickling after a socket has been opened to the memcached server, by removing the client from the stored data.
CachedFuncConfig
dataclass
¶
CachedFuncConfig(
hash_prefix: Optional[str] = None,
ignore_args: Collection[str] = list(),
time_threshold: float = 0.3,
allow_repeated_evaluations: bool = False,
rtol_stderr: float = 0.1,
min_repetitions: int = 3,
)
Configuration for cached functions and methods, providing memoization of function calls.
Instances of this class are typically used as arguments for the construction of a Utility.
PARAMETER | DESCRIPTION |
---|---|
hash_prefix
|
Optional string prefix that be prepended to the cache key. This can be provided in order to guarantee cache reuse across runs. |
ignore_args
|
Do not take these keyword arguments into account when
hashing the wrapped function for usage as key. This allows
sharing the cache among different jobs for the same experiment run if
the callable happens to have "nuisance" parameters like
TYPE:
|
time_threshold
|
Computations taking less time than this many seconds are not cached. A value of 0 means that it will always cache results.
TYPE:
|
allow_repeated_evaluations
|
If
TYPE:
|
rtol_stderr
|
relative tolerance for repeated evaluations. More precisely,
memcached() will stop evaluating the function
once the standard deviation of the mean is smaller than
TYPE:
|
min_repetitions
|
minimum number of times that a function evaluation on the same arguments is repeated before returning cached values. Useful for stochastic functions only. If the model training is very noisy, set this number to higher values to reduce variance.
TYPE:
|
ParallelConfig
dataclass
¶
ParallelConfig(
backend: Literal["joblib", "ray"] = "joblib",
address: Optional[Union[str, Tuple[str, int]]] = None,
n_cpus_local: Optional[int] = None,
logging_level: Optional[int] = None,
wait_timeout: float = 1.0,
)
Configuration for parallel computation backend.
PARAMETER | DESCRIPTION |
---|---|
backend
|
Type of backend to use. Defaults to 'joblib'
TYPE:
|
address
|
(DEPRECATED) Address of existing remote or local cluster to use. |
n_cpus_local
|
(DEPRECATED) Number of CPUs to use when creating a local ray cluster. This has no effect when using an existing ray cluster. |
logging_level
|
(DEPRECATED) Logging level for the parallel backend's worker. |
wait_timeout
|
(DEPRECATED) Timeout in seconds for waiting on futures.
TYPE:
|
Dataset
¶
Dataset(
x_train: Union[NDArray, DataFrame],
y_train: Union[NDArray, DataFrame],
x_test: Union[NDArray, DataFrame],
y_test: Union[NDArray, DataFrame],
feature_names: Optional[Sequence[str]] = None,
target_names: Optional[Sequence[str]] = None,
data_names: Optional[Sequence[str]] = None,
description: Optional[str] = None,
is_multi_output: bool = False,
)
A convenience class to handle datasets.
It holds a dataset, split into training and test data, together with several labels on feature names, data point names and a description.
PARAMETER | DESCRIPTION |
---|---|
x_train
|
training data |
y_train
|
labels for training data |
x_test
|
test data |
y_test
|
labels for test data |
feature_names
|
name of the features of input data |
target_names
|
names of the features of target data |
data_names
|
names assigned to data points. For example, if the dataset is a time series, each entry can be a timestamp which can be referenced directly instead of using a row number. |
description
|
A textual description of the dataset. |
is_multi_output
|
set to
TYPE:
|
Source code in src/pydvl/utils/dataset.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
|
indices
property
¶
Index of positions in data.x_train.
Contiguous integers from 0 to len(Dataset).
data_names
property
¶
Names of each individual datapoint.
Used for reporting Shapley values.
get_training_data
¶
Given a set of indices, returns the training data that refer to those indices.
This is used mainly by Utility to retrieve subsets of the data from indices. It is typically not needed in algorithms.
PARAMETER | DESCRIPTION |
---|---|
indices
|
Optional indices that will be used to select points from
the training data. If |
RETURNS | DESCRIPTION |
---|---|
Tuple[NDArray, NDArray]
|
If |
Source code in src/pydvl/utils/dataset.py
get_test_data
¶
Returns the entire test set regardless of the passed indices.
The passed indices will not be used because for data valuation we generally want to score the trained model on the entire test data.
Additionally, the way this method is used in the Utility class, the passed indices will be those of the training data and would not work on the test data.
There may be cases where it is desired to use parts of the test data. In those cases, it is recommended to inherit from Dataset and override get_test_data().
For example, the following snippet shows how one could go about mapping the training data indices into test data indices inside get_test_data():
Example
>>> from pydvl.utils import Dataset
>>> import numpy as np
>>> class DatasetWithTestDataIndices(Dataset):
... def get_test_data(self, indices=None):
... if indices is None:
... return self.x_test, self.y_test
... fraction = len(list(indices)) / len(self)
... mapped_indices = len(self.x_test) / len(self) * np.asarray(indices)
... mapped_indices = np.unique(mapped_indices.astype(int))
... return self.x_test[mapped_indices], self.y_test[mapped_indices]
...
>>> X = np.random.rand(100, 10)
>>> y = np.random.randint(0, 2, 100)
>>> dataset = DatasetWithTestDataIndices.from_arrays(X, y)
>>> indices = np.random.choice(dataset.indices, 30, replace=False)
>>> _ = dataset.get_training_data(indices)
>>> _ = dataset.get_test_data(indices)
PARAMETER | DESCRIPTION |
---|---|
indices
|
Optional indices into the test data. This argument is unused left for compatibility with get_training_data(). |
RETURNS | DESCRIPTION |
---|---|
Tuple[NDArray, NDArray]
|
The entire test data. |
Source code in src/pydvl/utils/dataset.py
from_sklearn
classmethod
¶
from_sklearn(
data: Bunch,
train_size: float = 0.8,
random_state: Optional[int] = None,
stratify_by_target: bool = False,
**kwargs: Any,
) -> Dataset
Constructs a Dataset object from a
sklearn.utils.Bunch, as returned by the load_*
functions in scikit-learn toy datasets.
Example
PARAMETER | DESCRIPTION |
---|---|
data
|
scikit-learn Bunch object. The following attributes are supported:
TYPE:
|
train_size
|
size of the training dataset. Used in
TYPE:
|
random_state
|
seed for train / test split |
stratify_by_target
|
If
TYPE:
|
kwargs
|
Additional keyword arguments to pass to the
Dataset constructor. Use this to pass e.g.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Dataset
|
Object with the sklearn dataset |
Changed in version 0.6.0
Added kwargs to pass to the Dataset constructor.
Source code in src/pydvl/utils/dataset.py
from_arrays
classmethod
¶
from_arrays(
X: NDArray,
y: NDArray,
train_size: float = 0.8,
random_state: Optional[int] = None,
stratify_by_target: bool = False,
**kwargs: Any,
) -> Dataset
Constructs a Dataset object from X and y numpy arrays as
returned by the make_*
functions in sklearn generated datasets.
Example
PARAMETER | DESCRIPTION |
---|---|
X
|
numpy array of shape (n_samples, n_features)
TYPE:
|
y
|
numpy array of shape (n_samples,)
TYPE:
|
train_size
|
size of the training dataset. Used in
TYPE:
|
random_state
|
seed for train / test split |
stratify_by_target
|
If
TYPE:
|
kwargs
|
Additional keyword arguments to pass to the
Dataset constructor. Use this to pass e.g.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Dataset
|
Object with the passed X and y arrays split across training and test sets. |
New in version 0.4.0
Changed in version 0.6.0
Added kwargs to pass to the Dataset constructor.
Source code in src/pydvl/utils/dataset.py
GroupedDataset
¶
GroupedDataset(
x_train: NDArray,
y_train: NDArray,
x_test: NDArray,
y_test: NDArray,
data_groups: Sequence,
feature_names: Optional[Sequence[str]] = None,
target_names: Optional[Sequence[str]] = None,
group_names: Optional[Sequence[str]] = None,
description: Optional[str] = None,
**kwargs: Any,
)
Bases: Dataset
Used for calculating Shapley values of subsets of the data considered as logical units. For instance, one can group by value of a categorical feature, by bin into which a continuous feature falls, or by label.
PARAMETER | DESCRIPTION |
---|---|
x_train
|
training data
TYPE:
|
y_train
|
labels of training data
TYPE:
|
x_test
|
test data
TYPE:
|
y_test
|
labels of test data
TYPE:
|
data_groups
|
Iterable of the same length as
TYPE:
|
feature_names
|
names of the covariates' features. |
target_names
|
names of the labels or targets y |
group_names
|
names of the groups. If not provided, the labels
from |
description
|
A textual description of the dataset |
kwargs
|
Additional keyword arguments to pass to the Dataset constructor.
TYPE:
|
Changed in version 0.6.0
Added group_names
and forwarding of kwargs
Source code in src/pydvl/utils/dataset.py
get_test_data
¶
Returns the entire test set regardless of the passed indices.
The passed indices will not be used because for data valuation we generally want to score the trained model on the entire test data.
Additionally, the way this method is used in the Utility class, the passed indices will be those of the training data and would not work on the test data.
There may be cases where it is desired to use parts of the test data. In those cases, it is recommended to inherit from Dataset and override get_test_data().
For example, the following snippet shows how one could go about mapping the training data indices into test data indices inside get_test_data():
Example
>>> from pydvl.utils import Dataset
>>> import numpy as np
>>> class DatasetWithTestDataIndices(Dataset):
... def get_test_data(self, indices=None):
... if indices is None:
... return self.x_test, self.y_test
... fraction = len(list(indices)) / len(self)
... mapped_indices = len(self.x_test) / len(self) * np.asarray(indices)
... mapped_indices = np.unique(mapped_indices.astype(int))
... return self.x_test[mapped_indices], self.y_test[mapped_indices]
...
>>> X = np.random.rand(100, 10)
>>> y = np.random.randint(0, 2, 100)
>>> dataset = DatasetWithTestDataIndices.from_arrays(X, y)
>>> indices = np.random.choice(dataset.indices, 30, replace=False)
>>> _ = dataset.get_training_data(indices)
>>> _ = dataset.get_test_data(indices)
PARAMETER | DESCRIPTION |
---|---|
indices
|
Optional indices into the test data. This argument is unused left for compatibility with get_training_data(). |
RETURNS | DESCRIPTION |
---|---|
Tuple[NDArray, NDArray]
|
The entire test data. |
Source code in src/pydvl/utils/dataset.py
get_training_data
¶
Returns the data and labels of all samples in the given groups.
PARAMETER | DESCRIPTION |
---|---|
indices
|
group indices whose elements to return. If |
RETURNS | DESCRIPTION |
---|---|
Tuple[NDArray, NDArray]
|
Tuple of training data x and labels y. |
Source code in src/pydvl/utils/dataset.py
from_sklearn
classmethod
¶
from_sklearn(
data: Bunch,
train_size: float = 0.8,
random_state: Optional[int] = None,
stratify_by_target: bool = False,
data_groups: Optional[Sequence] = None,
**kwargs: Any,
) -> GroupedDataset
Constructs a GroupedDataset object from a
sklearn.utils.Bunch as returned by the load_*
functions in
scikit-learn toy datasets and groups
it.
Example
PARAMETER | DESCRIPTION |
---|---|
data
|
scikit-learn Bunch object. The following attributes are supported:
TYPE:
|
train_size
|
size of the training dataset. Used in
TYPE:
|
random_state
|
seed for train / test split. |
stratify_by_target
|
If
TYPE:
|
data_groups
|
an array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset. |
kwargs
|
Additional keyword arguments to pass to the Dataset constructor.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
GroupedDataset
|
Dataset with the selected sklearn data |
Source code in src/pydvl/utils/dataset.py
from_arrays
classmethod
¶
from_arrays(
X: NDArray,
y: NDArray,
train_size: float = 0.8,
random_state: Optional[int] = None,
stratify_by_target: bool = False,
data_groups: Optional[Sequence] = None,
**kwargs: Any,
) -> Dataset
Constructs a GroupedDataset object from X and y numpy arrays
as returned by the make_*
functions in
scikit-learn generated datasets.
Example
>>> from sklearn.datasets import make_classification
>>> from pydvl.utils import GroupedDataset
>>> X, y = make_classification(
... n_samples=100,
... n_features=4,
... n_informative=2,
... n_redundant=0,
... random_state=0,
... shuffle=False
... )
>>> data_groups = X[:, 0] // 0.5
>>> dataset = GroupedDataset.from_arrays(X, y, data_groups=data_groups)
PARAMETER | DESCRIPTION |
---|---|
X
|
array of shape (n_samples, n_features)
TYPE:
|
y
|
array of shape (n_samples,)
TYPE:
|
train_size
|
size of the training dataset. Used in
TYPE:
|
random_state
|
seed for train / test split. |
stratify_by_target
|
If
TYPE:
|
data_groups
|
an array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset. |
kwargs
|
Additional keyword arguments that will be passed to the Dataset constructor.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Dataset
|
Dataset with the passed X and y arrays split across training and test sets. |
New in version 0.4.0
Changed in version 0.6.0
Added kwargs to pass to the Dataset constructor.
Source code in src/pydvl/utils/dataset.py
from_dataset
classmethod
¶
from_dataset(dataset: Dataset, data_groups: Sequence[Any]) -> GroupedDataset
Creates a GroupedDataset object from the data a Dataset object and a mapping of data groups.
Example
PARAMETER | DESCRIPTION |
---|---|
dataset
|
The original data.
TYPE:
|
data_groups
|
An array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset. |
RETURNS | DESCRIPTION |
---|---|
GroupedDataset
|
A GroupedDataset with the initial Dataset grouped by data_groups. |
Source code in src/pydvl/utils/dataset.py
Progress
¶
Progress(iterable: Iterable[T], is_done: StoppingCriterion, **kwargs: Any)
Bases: Generic[T]
Displays an optional progress bar for an iterable, using StoppingCriterion.completion for the progress.
PARAMETER | DESCRIPTION |
---|---|
iterable
|
The iterable to wrap.
TYPE:
|
is_done
|
The stopping criterion.
TYPE:
|
kwargs
|
Additional keyword arguments passed to tqdm.
-
TYPE:
|
Source code in src/pydvl/utils/progress.py
Scorer
¶
Scorer(
scoring: Union[str, ScorerCallable],
default: float = nan,
range: Tuple = (-inf, inf),
name: Optional[str] = None,
)
A scoring callable that takes a model, data, and labels and returns a scalar.
PARAMETER | DESCRIPTION |
---|---|
scoring
|
Either a string or callable that can be passed to get_scorer.
TYPE:
|
default
|
score to be used when a model cannot be fit, e.g. when too little data is passed, or errors arise. |
range
|
numerical range of the score function. Some Monte Carlo
methods can use this to estimate the number of samples required for a
certain quality of approximation. If not provided, it can be read from
the |
name
|
The name of the scorer. If not provided, the name of the function passed will be used. |
New in version 0.5.0
Source code in src/pydvl/utils/score.py
Status
¶
Bases: Enum
Status of a computation.
Statuses can be combined using bitwise or (|
) and bitwise and (&
) to
get the status of a combined computation. For example, if we have two
computations, one that has converged and one that has failed, then the
combined status is Status.Converged | Status.Failed == Status.Converged
,
but Status.Converged & Status.Failed == Status.Failed
.
OR¶
The result of bitwise or-ing two valuation statuses with |
is given
by the following table:
P | C | F | |
---|---|---|---|
P | P | C | P |
C | C | C | C |
F | P | C | F |
where P = Pending, C = Converged, F = Failed.
AND¶
The result of bitwise and-ing two valuation statuses with &
is given
by the following table:
P | C | F | |
---|---|---|---|
P | P | P | F |
C | P | C | F |
F | F | F | F |
where P = Pending, C = Converged, F = Failed.
NOT¶
The result of bitwise negation of a Status with ~
is Failed
if
the status is Converged
, or Converged
otherwise:
Boolean casting¶
A Status evaluates to True
iff it's Converged
or Failed
:
Warning
These truth values are inconsistent with the usual boolean
operations. In particular the XOR of two instances of Status
is not
the same as the XOR of their boolean values.
BaseModel
¶
SupervisedModel
¶
Bases: Protocol
This is the standard sklearn Protocol with the methods fit()
, predict()
and
score()
.
fit
¶
predict
¶
BaggingModel
¶
Bases: Protocol
Any model with the attributes n_estimators
and max_samples
is considered a
bagging model.
fit
¶
Utility
¶
Utility(
model: SupervisedModel,
data: Dataset,
scorer: Optional[Union[str, Scorer]] = None,
*,
default_score: float = 0.0,
score_range: Tuple[float, float] = (-inf, inf),
catch_errors: bool = True,
show_warnings: bool = False,
cache_backend: Optional[CacheBackend] = None,
cached_func_options: Optional[CachedFuncConfig] = None,
clone_before_fit: bool = True,
)
Convenience wrapper with configurable memoization of the scoring function.
An instance of Utility
holds the triple of model, dataset and scoring
function which determines the value of data points. This is used for the
computation of all game-theoretic values like
Shapley values and the Least
Core.
The Utility expect the model to fulfill the
SupervisedModel interface i.e.
to have fit()
, predict()
, and score()
methods.
When calling the utility, the model will be cloned if it is a Scikit-Learn model, otherwise a copy is created using copy.deepcopy
Since evaluating the scoring function requires retraining the model and that can be time-consuming, this class wraps it and caches the results of each execution. Caching is available both locally and across nodes, but must always be enabled for your project first, see the documentation and the module documentation.
ATTRIBUTE | DESCRIPTION |
---|---|
model |
The supervised model.
TYPE:
|
data |
An object containing the split data.
TYPE:
|
scorer |
A scoring function. If None, the
TYPE:
|
PARAMETER | DESCRIPTION |
---|---|
model
|
Any supervised model. Typical choices can be found in the sci-kit learn documentation.
TYPE:
|
data
|
Dataset or GroupedDataset instance.
TYPE:
|
scorer
|
|
default_score
|
As a convenience when no
TYPE:
|
score_range
|
As with |
catch_errors
|
set to
TYPE:
|
show_warnings
|
Set to
TYPE:
|
cache_backend
|
Optional instance of CacheBackend used to wrap the _utility method of the Utility instance. By default, this is set to None and that means that the utility evaluations will not be cached.
TYPE:
|
cached_func_options
|
Optional configuration object for cached utility evaluation.
TYPE:
|
clone_before_fit
|
If
TYPE:
|
Example
>>> from pydvl.utils import Utility, DataUtilityLearning, Dataset
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> dataset = Dataset.from_sklearn(load_iris(), random_state=16)
>>> u = Utility(LogisticRegression(random_state=16), dataset)
>>> u(dataset.indices)
0.9
With caching enabled:
>>> from pydvl.utils import Utility, DataUtilityLearning, Dataset
>>> from pydvl.utils.caching.memory import InMemoryCacheBackend
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> dataset = Dataset.from_sklearn(load_iris(), random_state=16)
>>> cache_backend = InMemoryCacheBackend()
>>> u = Utility(LogisticRegression(random_state=16), dataset, cache_backend=cache_backend)
>>> u(dataset.indices)
0.9
Source code in src/pydvl/utils/utility.py
cache_stats
property
¶
cache_stats: Optional[CacheStats]
Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.
DataUtilityLearning
¶
DataUtilityLearning(u: Utility, training_budget: int, model: SupervisedModel)
Implementation of Data Utility Learning (Wang et al., 2022)1.
This object wraps a Utility and delegates
calls to it, up until a given budget (number of iterations). Every tuple
of input and output (a so-called utility sample) is stored. Once the
budget is exhausted, DataUtilityLearning
fits the given model to the
utility samples. Subsequent calls will use the learned model to predict the
utility instead of delegating.
PARAMETER | DESCRIPTION |
---|---|
u
|
The Utility to learn.
TYPE:
|
training_budget
|
Number of utility samples to collect before fitting the given model.
TYPE:
|
model
|
A supervised regression model
TYPE:
|
Example
>>> from pydvl.utils import Utility, DataUtilityLearning, Dataset
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> dataset = Dataset.from_sklearn(load_iris())
>>> u = Utility(LogisticRegression(), dataset)
>>> wrapped_u = DataUtilityLearning(u, 3, LinearRegression())
... # First 3 calls will be computed normally
>>> for i in range(3):
... _ = wrapped_u((i,))
>>> wrapped_u((1, 2, 3)) # Subsequent calls will be computed using the fit model for DUL
0.0
Source code in src/pydvl/utils/utility.py
maybe_add_argument
¶
Wraps a function to accept the given keyword parameter if it doesn't already.
If fun
already takes a keyword parameter of name new_arg
, then it is
returned as is. Otherwise, a wrapper is returned which merely ignores the
argument.
PARAMETER | DESCRIPTION |
---|---|
fun
|
The function to wrap
TYPE:
|
new_arg
|
The name of the argument that the new function will accept (and ignore).
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Callable
|
A new function accepting one more keyword argument. |
Changed in version 0.7.0
Ability to work with partials.
Source code in src/pydvl/utils/functional.py
suppress_warnings
¶
suppress_warnings(
fun: Callable[P, R] | None = None,
*,
categories: Sequence[Type[Warning]] = (Warning,),
flag: str = "",
) -> Union[Callable[[Callable[P, R]], Callable[P, R]], Callable[P, R]]
Decorator for class methods to conditionally suppress warnings.
The decorated method will execute with warnings suppressed for the specified
categories. If the instance has the attribute named by flag
, and it evaluates to
True
, then suppression will be deactivated.
Suppress only UserWarning
Configuring behaviour at runtime
PARAMETER | DESCRIPTION |
---|---|
fun
|
Optional callable to decorate. If provided, the decorator is applied inline.
TYPE:
|
categories
|
Sequence of warning categories to suppress. |
flag
|
Name of an instance attribute to check for enabling warnings. If the
attribute exists and evaluates to
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Union[Callable[[Callable[P, R]], Callable[P, R]], Callable[P, R]]
|
Either a decorator (if no function is provided) or the decorated callable. |
Source code in src/pydvl/utils/functional.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 |
|
timed
¶
timed(fun: Callable[P, R]) -> TimedCallable[P, R]
timed(
fun: Callable[P, R] | None = None,
*,
accumulate: bool = False,
logger: Logger | None = None,
) -> Union[
Callable[[Callable[P, R]], TimedCallable[P, R]], TimedCallable[P, R]
]
A decorator that measures the execution time of the wrapped function. Optionally logs the time taken.
Decorator usage
Inline usage
PARAMETER | DESCRIPTION |
---|---|
fun
|
TYPE:
|
accumulate
|
If
TYPE:
|
logger
|
If provided, the execution time will be logged at the logger's level.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Union[Callable[[Callable[P, R]], TimedCallable[P, R]], TimedCallable[P, R]]
|
A decorator that wraps a function, measuring and optionally logging its |
Union[Callable[[Callable[P, R]], TimedCallable[P, R]], TimedCallable[P, R]]
|
execution time. The function will have an attribute |
Union[Callable[[Callable[P, R]], TimedCallable[P, R]], TimedCallable[P, R]]
|
either the time of the last execution or the accumulated total is stored. |
Source code in src/pydvl/utils/functional.py
340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 |
|
complement
¶
Returns the complement of the set of indices excluding the given indices.
PARAMETER | DESCRIPTION |
---|---|
include
|
The set of indices to consider.
TYPE:
|
exclude
|
The indices to exclude from the complement. These must be a subset
of |
RETURNS | DESCRIPTION |
---|---|
NDArray[T]
|
The complement of the set of indices excluding the given indices. |
Source code in src/pydvl/utils/numeric.py
powerset
¶
powerset(s: NDArray[T]) -> Iterator[Collection[T]]
Returns an iterator for the power set of the argument.
Subsets are generated in sequence by growing size. See random_powerset() for random sampling.
Example
PARAMETER | DESCRIPTION |
---|---|
s
|
The set to use
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Iterator[Collection[T]]
|
An iterator over all subsets of the set of indices |
Source code in src/pydvl/utils/numeric.py
num_samples_permutation_hoeffding
¶
Lower bound on the number of samples required for MonteCarlo Shapley to obtain an (ε,δ)-approximation.
That is: with probability 1-δ, the estimated value for one data point will be ε-close to the true quantity, if at least this many permutations are sampled.
PARAMETER | DESCRIPTION |
---|---|
eps
|
ε > 0
TYPE:
|
delta
|
0 < δ <= 1
TYPE:
|
u_range
|
Range of the Utility function
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
int
|
Number of permutations required to guarantee ε-correct Shapley values with probability 1-δ |
Source code in src/pydvl/utils/numeric.py
random_subset
¶
Returns one subset at random from s
.
PARAMETER | DESCRIPTION |
---|---|
s
|
set to sample from
TYPE:
|
q
|
Sampling probability for elements. The default 0.5 yields a uniform distribution over the power set of s.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray[T]
|
The subset |
Source code in src/pydvl/utils/numeric.py
random_powerset
¶
random_powerset(
s: NDArray[T],
n_samples: Optional[int] = None,
q: float = 0.5,
seed: Optional[Seed] = None,
) -> Generator[NDArray[T], None, None]
Samples subsets from the power set of the argument, without pre-generating all subsets and in no order.
See powerset if you wish to deterministically generate all subsets.
To generate subsets, len(s)
Bernoulli draws with probability q
are
drawn. The default value of q = 0.5
provides a uniform distribution over
the power set of s
. Other choices can be used e.g. to implement
owen_sampling_shapley.
PARAMETER | DESCRIPTION |
---|---|
s
|
set to sample from
TYPE:
|
n_samples
|
if set, stop the generator after this many steps.
Defaults to |
q
|
Sampling probability for elements. The default 0.5 yields a uniform distribution over the power set of s.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
None
|
Samples from the power set of |
RAISES | DESCRIPTION |
---|---|
ValueError
|
if the element sampling probability is not in [0,1] |
Source code in src/pydvl/utils/numeric.py
random_powerset_label_min
¶
random_powerset_label_min(
s: NDArray[T],
labels: NDArray[int_],
min_elements_per_label: int = 1,
seed: Optional[Seed] = None,
) -> Generator[NDArray[T], None, None]
Draws random subsets from s
, while ensuring that at least
min_elements_per_label
elements per label are included in the draw. It can be used
for classification problems to ensure that a set contains information for all labels
(or not if min_elements_per_label=0
).
PARAMETER | DESCRIPTION |
---|---|
s
|
Set to sample from
TYPE:
|
labels
|
Labels for the samples |
min_elements_per_label
|
Minimum number of elements for each label.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
None
|
Generated draw from the powerset of s with |
None
|
label. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If |
Source code in src/pydvl/utils/numeric.py
random_subset_of_size
¶
Samples a random subset of given size uniformly from the powerset
of s
.
PARAMETER | DESCRIPTION |
---|---|
s
|
Set to sample from
TYPE:
|
size
|
Size of the subset to generate
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray[T]
|
The subset |
Raises ValueError: If size > len(s)
Source code in src/pydvl/utils/numeric.py
random_matrix_with_condition_number
¶
random_matrix_with_condition_number(
n: int, condition_number: float, seed: Optional[Seed] = None
) -> NDArray
Constructs a square matrix with a given condition number.
Taken from: https://gist.github.com/bstellato/23322fe5d87bb71da922fbc41d658079#file-random_mat_condition_number-py
Also see: https://math.stackexchange.com/questions/1351616/condition-number-of-ata.
PARAMETER | DESCRIPTION |
---|---|
n
|
size of the matrix
TYPE:
|
condition_number
|
duh
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray
|
An (n,n) matrix with the requested condition number. |
Source code in src/pydvl/utils/numeric.py
running_moments
¶
running_moments(
previous_avg: float,
previous_variance: float,
count: int,
new_value: float,
unbiased: bool = True,
) -> tuple[float, float]
Calculates running average and variance of a series of numbers.
See Welford's algorithm in wikipedia
Warning
This is not really using Welford's correction for numerical stability for the variance. (FIXME)
Todo
This could be generalised to arbitrary moments. See this paper
PARAMETER | DESCRIPTION |
---|---|
previous_avg
|
average value at previous step.
TYPE:
|
previous_variance
|
variance at previous step.
TYPE:
|
count
|
number of points seen so far,
TYPE:
|
new_value
|
new value in the series of numbers.
TYPE:
|
unbiased
|
whether to use the unbiased variance estimator (same as
TYPE:
|
Returns: new_average, new_variance, calculated with the new count
Source code in src/pydvl/utils/numeric.py
top_k_value_accuracy
¶
Computes the top-k accuracy for the estimated values by comparing indices of the highest k values.
PARAMETER | DESCRIPTION |
---|---|
y_true
|
Exact/true value |
y_pred
|
Predicted/estimated value |
k
|
Number of the highest values taken into account
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
Accuracy |
Source code in src/pydvl/utils/numeric.py
logcomb
¶
Computes the log of the binomial coefficient (n choose k).
PARAMETER | DESCRIPTION |
---|---|
n
|
Total number of elements
TYPE:
|
k
|
Number of elements to choose
TYPE:
|
Returns: The log of the binomial coefficient
Source code in src/pydvl/utils/numeric.py
logexp
¶
logsumexp_two
¶
Numerically stable computation of log(exp(log_a) + exp(log_b)).
Uses standard log sum exp trick:
where \(m = \max(\log a, \log b)\).
PARAMETER | DESCRIPTION |
---|---|
log_a
|
Log of the first value
TYPE:
|
log_b
|
Log of the second value
TYPE:
|
Returns: The log of the sum of the exponentials
Source code in src/pydvl/utils/numeric.py
log_running_moments
¶
log_running_moments(
previous_log_sum_pos: float,
previous_log_sum_neg: float,
previous_log_sum2: float,
count: int,
new_log_value: float,
new_sign: int,
unbiased: bool = True,
) -> tuple[float, float, float, float, float]
Update running moments when the new value is provided in log space, allowing for negative values via an explicit sign.
Here the actual value is x = new_sign * exp(new_log_value). Rather than updating the arithmetic sum S = sum(x) and S2 = sum(x^2) directly, we maintain:
L_S+ = log(sum_{i: x_i >= 0} x_i) L_S- = log(sum_{i: x_i < 0} |x_i|) L_S2 = log(sum_i x_i^2)
The running mean is then computed as:
mean = exp(L_S+) - exp(L_S-)
and the second moment is:
second_moment = exp(L_S2 - log(count))
so that the variance is:
variance = second_moment - mean^2
For the unbiased (sample) estimator, we scale the variance by count/(count-1) when count > 1 (and define variance = 0 when count == 1).
PARAMETER | DESCRIPTION |
---|---|
previous_log_sum_pos
|
running log(sum of positive contributions), or -inf if none.
TYPE:
|
previous_log_sum_neg
|
running log(sum of negative contributions in absolute value), or -inf if none.
TYPE:
|
previous_log_sum2
|
running log(sum of squares) so far (or -inf if none).
TYPE:
|
count
|
number of points processed so far.
TYPE:
|
new_log_value
|
log(|x_new|), where x_new is the new value.
TYPE:
|
new_sign
|
sign of the new value (should be +1, 0, or -1).
TYPE:
|
unbiased
|
if True, compute the unbiased estimator of the variance.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
new_mean
|
running mean in the linear domain.
TYPE:
|
new_variance
|
running variance in the linear domain.
TYPE:
|
new_log_sum_pos
|
updated running log(sum of positive contributions).
TYPE:
|
new_log_sum_neg
|
updated running log(sum of negative contributions).
TYPE:
|
new_log_sum2
|
updated running log(sum of squares).
TYPE:
|
new_count
|
updated count. |
Source code in src/pydvl/utils/numeric.py
421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 |
|
repeat_indices
¶
repeat_indices(
indices: Collection[int],
result: ValuationResult,
done: StoppingCriterion,
**kwargs: Any,
) -> Iterator[int]
Helper function to cycle indefinitely over a collection of indices until the stopping criterion is satisfied while displaying progress.
PARAMETER | DESCRIPTION |
---|---|
indices
|
Collection of indices that will be cycled until done.
TYPE:
|
result
|
Object containing the current results.
TYPE:
|
done
|
Stopping criterion.
TYPE:
|
kwargs
|
Keyword arguments passed to tqdm.
TYPE:
|
Source code in src/pydvl/utils/progress.py
log_duration
¶
log_duration(_func=None, *, log_level=DEBUG)
Decorator to log execution time of a function with a configurable logging level. It can be used with or without specifying a log level.
Source code in src/pydvl/utils/progress.py
compose_score
¶
compose_score(
scorer: Scorer,
transformation: Callable[[float], float],
range: Tuple[float, float],
name: str,
) -> Scorer
Composes a scoring function with an arbitrary scalar transformation.
Useful to squash unbounded scores into ranges manageable by data valuation methods.
Example:
sigmoid = lambda x: 1/(1+np.exp(-x))
compose_score(Scorer("r2"), sigmoid, range=(0,1), name="squashed r2")
PARAMETER | DESCRIPTION |
---|---|
scorer
|
The object to be composed.
TYPE:
|
transformation
|
A scalar transformation |
range
|
The range of the transformation. This will be used e.g. by Utility for the range of the composed. |
name
|
A string representation for the composition, for
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Scorer
|
The composite Scorer. |
Source code in src/pydvl/utils/score.py
ensure_seed_sequence
¶
ensure_seed_sequence(
seed: Optional[Union[Seed, SeedSequence]] = None,
) -> SeedSequence
If the passed seed is a SeedSequence object then it is returned as is. If it is a Generator the internal protected seed sequence from the generator gets extracted. Otherwise, a new SeedSequence object is created from the passed (optional) seed.
PARAMETER | DESCRIPTION |
---|---|
seed
|
Either an int, a Generator object a SeedSequence object or None.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
SeedSequence
|
A SeedSequence object. |
New in version 0.7.0