pydvl.valuation.utility
¶
This module contains classes to manage and learn utility functions for the computation of values.
ModelUtility holds information about model, data and scoring function (the latter being what one usually understands under utility in the general definition of Shapley value). Model-based evaluation methods define the utility as a retraining of the model on a subset of the data, which is then scored. Please see the documentation on Computing Data Values for more information.
Utilities can be automatically cached across machines when the cache is so configured and enabled upon construction.
DataUtilityLearning adds support for learning the scoring function to avoid repeated re-training of the model to compute the score. Several methods exist to learn the utility function.
ModelUtility
¶
ModelUtility(
model: ModelT,
scorer: Scorer,
*,
catch_errors: bool = True,
show_warnings: bool = False,
cache_backend: CacheBackend | None = None,
cached_func_options: CachedFuncConfig | None = None,
clone_before_fit: bool = True,
)
Bases: UtilityBase[SampleT]
, Generic[SampleT, ModelT]
Convenience wrapper with configurable memoization of the utility.
An instance of ModelUtility
holds the tuple of model, and scoring function which
determines the value of data points. This is used for the computation of all
game-theoretic values like Shapley
values and the Least
Core.
ModelUtility
expects the model to fulfill at least the
BaseModel interface, i.e. to have a fit()
method
When calling the utility, the model will be cloned if it is a Scikit-Learn model, otherwise a copy is created using copy.deepcopy
Since evaluating the scoring function requires retraining the model and that can be time-consuming, this class wraps it and caches the results of each execution. Caching is available both locally and across nodes, but must always be enabled for your project first, because most stochastic methods do not benefit much from it. See the documentation and the module documentation.
ATTRIBUTE | DESCRIPTION |
---|---|
model |
The supervised model.
TYPE:
|
scorer |
A scoring function. If None, the
TYPE:
|
PARAMETER | DESCRIPTION |
---|---|
model
|
Any supervised model. Typical choices can be found in the sci-kit learn documentation.
TYPE:
|
scorer
|
A scoring object. If None, the
TYPE:
|
catch_errors
|
set to
TYPE:
|
show_warnings
|
Set to
TYPE:
|
cache_backend
|
Optional instance of CacheBackend used to memoize results to avoid duplicate computation. Note however, that for most stochastic methods, cache hits are rare, making the memory expense of caching not worth it (YMMV).
TYPE:
|
cached_func_options
|
Optional configuration object for cached utility evaluation.
TYPE:
|
clone_before_fit
|
If
TYPE:
|
Example
>>> from pydvl.valuation.utility import ModelUtility, DataUtilityLearning
>>> from pydvl.valuation.dataset import Dataset
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> train, test = Dataset.from_sklearn(load_iris(), random_state=16)
>>> u = ModelUtility(LogisticRegression(random_state=16), Scorer("accuracy"))
>>> u(Sample(subset=dataset.indices))
0.9
With caching enabled:
>>> from pydvl.valuation.utility import ModelUtility, DataUtilityLearning
>>> from pydvl.valuation.dataset import Dataset
>>> from pydvl.utils.caching.memory import InMemoryCacheBackend
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> train, test = Dataset.from_sklearn(load_iris(), random_state=16)
>>> cache_backend = InMemoryCacheBackend()
>>> u = ModelUtility(LogisticRegression(random_state=16), Scorer("accuracy"), cache_backend=cache_backend)
>>> u(Sample(subset=train.indices))
0.9
Source code in src/pydvl/valuation/utility/modelutility.py
training_data
property
¶
training_data: Dataset | None
Retrieves the training data used by this utility.
This property is read-only. In order to set it, use with_dataset().
cache_stats
property
¶
cache_stats: CacheStats | None
Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.
with_dataset
¶
Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.
Source code in src/pydvl/valuation/utility/base.py
__call__
¶
__call__(sample: SampleT | None) -> float
PARAMETER | DESCRIPTION |
---|---|
sample
|
contains a subset of valid indices for the
TYPE:
|
Source code in src/pydvl/valuation/utility/modelutility.py
ClasswiseModelUtility
¶
ClasswiseModelUtility(
model: SupervisedModel,
scorer: ClasswiseSupervisedScorer,
*,
catch_errors: bool = True,
show_warnings: bool = False,
cache_backend: CacheBackend | None = None,
cached_func_options: CachedFuncConfig | None = None,
clone_before_fit: bool = True,
)
Bases: ModelUtility[ClasswiseSample, SupervisedModel]
ModelUtility class that is specific to classwise shapley valuation.
It expects a classwise scorer and a classification task.
PARAMETER | DESCRIPTION |
---|---|
model
|
Any supervised model. Typical choices can be found in the sci-kit learn documentation.
TYPE:
|
scorer
|
A class-wise scoring object. |
catch_errors
|
set to
TYPE:
|
show_warnings
|
Set to
TYPE:
|
cache_backend
|
Optional instance of CacheBackend used to wrap the _utility method of the Utility instance. By default, this is set to None and that means that the utility evaluations will not be cached.
TYPE:
|
cached_func_options
|
Optional configuration object for cached utility evaluation.
TYPE:
|
clone_before_fit
|
If
TYPE:
|
Source code in src/pydvl/valuation/utility/classwise.py
training_data
property
¶
training_data: Dataset | None
Retrieves the training data used by this utility.
This property is read-only. In order to set it, use with_dataset().
cache_stats
property
¶
cache_stats: CacheStats | None
Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.
with_dataset
¶
Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.
Source code in src/pydvl/valuation/utility/base.py
__call__
¶
__call__(sample: SampleT | None) -> float
PARAMETER | DESCRIPTION |
---|---|
sample
|
contains a subset of valid indices for the
TYPE:
|
Source code in src/pydvl/valuation/utility/modelutility.py
KNNClassifierUtility
¶
KNNClassifierUtility(
model: KNeighborsClassifier,
test_data: Dataset,
*,
catch_errors: bool = True,
show_warnings: bool = False,
cache_backend: CacheBackend | None = None,
cached_func_options: CachedFuncConfig | None = None,
clone_before_fit: bool = True,
)
Bases: ModelUtility[Sample, KNeighborsClassifier]
Utility object for KNN Classifiers.
The utility function is the model's predicted probability for the true class.
Uses of this utility
Although this class can be used in conjunction with any semi-value method and sampler, when computing Shapley values, it is recommended to use the dedicated class KNNShapleyValuation, because it implements a more efficient algorithm for computing Shapley values which runs in O(n log n) time for each test point.
PARAMETER | DESCRIPTION |
---|---|
model
|
A KNN classifier model.
TYPE:
|
test_data
|
The test data to evaluate the model on.
TYPE:
|
catch_errors
|
set to
TYPE:
|
show_warnings
|
Set to
TYPE:
|
cache_backend
|
Optional instance of [CacheBackend][ pydvl.utils.caching.base.CacheBackend] used to wrap the _utility method of the Utility instance. By default, this is set to None and that means that the utility evaluations will not be cached.
TYPE:
|
cached_func_options
|
Optional configuration object for cached utility evaluation.
TYPE:
|
clone_before_fit
|
If
TYPE:
|
Source code in src/pydvl/valuation/utility/knn.py
training_data
property
¶
training_data: Dataset | None
Retrieves the training data used by this utility.
This property is read-only. In order to set it, use with_dataset().
cache_stats
property
¶
cache_stats: CacheStats | None
Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.
__call__
¶
__call__(sample: SampleT | None) -> float
PARAMETER | DESCRIPTION |
---|---|
sample
|
contains a subset of valid indices for the
TYPE:
|
Source code in src/pydvl/valuation/utility/modelutility.py
with_dataset
¶
Return the utility, or a copy of it, with the given dataset and the model fitted on it.
PARAMETER | DESCRIPTION |
---|---|
data
|
The dataset to use.
TYPE:
|
copy
|
Whether to copy the utility object or not. Additionally, if
TYPE:
|
Returns: The utility object.
Source code in src/pydvl/valuation/utility/knn.py
UtilityModel
¶
Bases: ABC
Interface for utility models.
A utility model predicts the value of a utility function given a sample. The model is trained on a collection of samples and their respective utility values. These tuples are called Utility Samples.
Utility models:
- are fitted on dictionaries of Sample -> utility value
- predict: Collection[samples] -> NDArray[utility values]
IndicatorUtilityModel
¶
IndicatorUtilityModel(predictor: SupervisedModel, n_data: int)
Bases: UtilityModel
A simple wrapper for arbitrary predictors.
Uses 1-hot encoding of the indices as input for the model, as done in Wang et al., (2022)1.
Source code in src/pydvl/valuation/utility/learning.py
DataUtilityLearning
¶
DataUtilityLearning(
utility: UtilityBase,
training_budget: int,
model: UtilityModel,
show_warnings: bool = True,
)
Bases: UtilityBase[SampleT]
This object wraps any class derived from
UtilityBase and delegates calls to it,
up until a given budget (number of iterations). Every tuple of input and output (a
so-called utility sample) is stored. Once the budget is exhausted,
DataUtilityLearning
fits the given model to the utility samples. Subsequent
calls will use the learned model to predict the utility instead of delegating.
PARAMETER | DESCRIPTION |
---|---|
utility
|
The utility to learn. Typically, this will be a ModelUtility object encapsulating a machine learning model which requires fitting on each evaluation of the utility.
TYPE:
|
training_budget
|
Number of utility samples to collect before fitting the given model.
TYPE:
|
model
|
A supervised regression model
TYPE:
|
Example
from pydvl.valuation import Dataset, DataUtilityLearning, ModelUtility, Sample, SupervisedScorer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import load_iris
train, test = Dataset.from_sklearn(load_iris())
scorer = SupervisedScorer("accuracy", test, 0, (0,1))
utility = ModelUtility(LinearRegression(), scorer)
utility_model = IndicatorUtilityModel(LinearRegression(), len(train))
dul = DataUtilityLearning(utility, 3, utility_model)
# First 3 calls will be computed normally
for i in range(3):
_ = dul(Sample(0, np.array([])))
# Subsequent calls will be computed using the fitted utility_model
dul(Sample(0, np.array([1, 2, 3])))
Source code in src/pydvl/valuation/utility/learning.py
training_data
property
¶
training_data: Dataset | None
Retrieves the training data used by this utility.
This property is read-only. In order to set it, use with_dataset().
with_dataset
¶
Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.