pydvl.utils.utility
¶
This module contains classes to manage and learn utility functions for the computation of values. Please see the documentation on Computing Data Values for more information.
Utility holds information about model, data and scoring function (the latter being what one usually understands under utility in the general definition of Shapley value). It is automatically cached across machines when the cache is configured and it is enabled upon construction.
DataUtilityLearning adds support for learning the scoring function to avoid repeated re-training of the model to compute the score.
This module also contains derived Utility
classes for toy games that are used
for testing and for demonstration purposes.
References¶
-
Wang, T., Yang, Y. and Jia, R., 2021. Improving cooperative game theory-based data valuation via data utility learning. arXiv preprint arXiv:2107.06336. ↩
Utility
¶
Utility(
model: SupervisedModel,
data: Dataset,
scorer: Optional[Union[str, Scorer]] = None,
*,
default_score: float = 0.0,
score_range: Tuple[float, float] = (-np.inf, np.inf),
catch_errors: bool = True,
show_warnings: bool = False,
cache_backend: Optional[CacheBackend] = None,
cached_func_options: Optional[CachedFuncConfig] = None,
clone_before_fit: bool = True
)
Convenience wrapper with configurable memoization of the scoring function.
An instance of Utility
holds the triple of model, dataset and scoring
function which determines the value of data points. This is used for the
computation of all game-theoretic values like
Shapley values and the Least
Core.
The Utility expect the model to fulfill the
SupervisedModel interface i.e.
to have fit()
, predict()
, and score()
methods.
When calling the utility, the model will be cloned if it is a Sci-Kit Learn model, otherwise a copy is created using copy.deepcopy
Since evaluating the scoring function requires retraining the model and that can be time-consuming, this class wraps it and caches the results of each execution. Caching is available both locally and across nodes, but must always be enabled for your project first, see the documentation and the module documentation.
ATTRIBUTE | DESCRIPTION |
---|---|
model |
The supervised model.
TYPE:
|
data |
An object containing the split data.
TYPE:
|
scorer |
A scoring function. If None, the
TYPE:
|
PARAMETER | DESCRIPTION |
---|---|
model |
Any supervised model. Typical choices can be found in the [sci-kit learn documentation][https://scikit-learn.org/stable/supervised_learning.html].
TYPE:
|
data |
Dataset or GroupedDataset instance.
TYPE:
|
scorer |
|
default_score |
As a convenience when no
TYPE:
|
score_range |
As with |
catch_errors |
set to
TYPE:
|
show_warnings |
Set to
TYPE:
|
cache_backend |
Optional instance of CacheBackend used to wrap the _utility method of the Utility instance. By default, this is set to None and that means that the utility evaluations will not be cached.
TYPE:
|
cached_func_options |
Optional configuration object for cached utility evaluation.
TYPE:
|
clone_before_fit |
If
TYPE:
|
Example
>>> from pydvl.utils import Utility, DataUtilityLearning, Dataset
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> dataset = Dataset.from_sklearn(load_iris(), random_state=16)
>>> u = Utility(LogisticRegression(random_state=16), dataset)
>>> u(dataset.indices)
0.9
With caching enabled:
>>> from pydvl.utils import Utility, DataUtilityLearning, Dataset
>>> from pydvl.utils.caching.memory import InMemoryCacheBackend
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> dataset = Dataset.from_sklearn(load_iris(), random_state=16)
>>> cache_backend = InMemoryCacheBackend()
>>> u = Utility(LogisticRegression(random_state=16), dataset, cache_backend=cache_backend)
>>> u(dataset.indices)
0.9
Source code in src/pydvl/utils/utility.py
cache_stats
property
¶
cache_stats: Optional[CacheStats]
Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.
DataUtilityLearning
¶
DataUtilityLearning(u: Utility, training_budget: int, model: SupervisedModel)
Implementation of Data Utility Learning (Wang et al., 2022)1.
This object wraps a Utility and delegates
calls to it, up until a given budget (number of iterations). Every tuple
of input and output (a so-called utility sample) is stored. Once the
budget is exhausted, DataUtilityLearning
fits the given model to the
utility samples. Subsequent calls will use the learned model to predict the
utility instead of delegating.
PARAMETER | DESCRIPTION |
---|---|
u |
The Utility to learn.
TYPE:
|
training_budget |
Number of utility samples to collect before fitting the given model.
TYPE:
|
model |
A supervised regression model
TYPE:
|
Example
>>> from pydvl.utils import Utility, DataUtilityLearning, Dataset
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> dataset = Dataset.from_sklearn(load_iris())
>>> u = Utility(LogisticRegression(), dataset)
>>> wrapped_u = DataUtilityLearning(u, 3, LinearRegression())
... # First 3 calls will be computed normally
>>> for i in range(3):
... _ = wrapped_u((i,))
>>> wrapped_u((1, 2, 3)) # Subsequent calls will be computed using the fit model for DUL
0.0