pydvl.valuation.utility.learning
¶
This module implements Data Utility Learning (Wang et al., 2022).1
DUL uses an ML model to learn the utility function. Essentially, it learns to predict the performance of a model when trained on a given set of indices from the dataset. The cost of training this model is quickly amortized by avoiding costly re-evaluations of the original utility.
Usage is through the [DataUtilityLearning] class, which wraps any utility function and a UtilityModel to learn it. The wrapper collects utility samples until a given budget is reached, and then fits the model. After that, it forwards any queries for utility values to this learned model to predict the utility of new samples at constant, and low, cost.
See the documentation for more information.
Todo
DUL does not support parallel training of the model yet. This is a limitation of the current architecture. Additionally, batching of utility evaluations should be added to really profit from neural network architectures.
References¶
-
Wang, T., Yang, Y. and Jia, R., 2021. Improving cooperative game theory-based data valuation via data utility learning. arXiv preprint arXiv:2107.06336. ↩
DataUtilityLearning
¶
DataUtilityLearning(
utility: UtilityBase,
training_budget: int,
model: UtilityModel,
show_warnings: bool = True,
)
Bases: UtilityBase[SampleT]
This object wraps any class derived from
UtilityBase and delegates calls to it,
up until a given budget (number of iterations). Every tuple of input and output (a
so-called utility sample) is stored. Once the budget is exhausted,
DataUtilityLearning
fits the given model to the utility samples. Subsequent
calls will use the learned model to predict the utility instead of delegating.
PARAMETER | DESCRIPTION |
---|---|
utility
|
The utility to learn. Typically, this will be a ModelUtility object encapsulating a machine learning model which requires fitting on each evaluation of the utility.
TYPE:
|
training_budget
|
Number of utility samples to collect before fitting the given model.
TYPE:
|
model
|
A wrapper for a supervised model that can be trained on a collection of utility samples.
TYPE:
|
Source code in src/pydvl/valuation/utility/learning.py
training_data
property
¶
training_data: Dataset | None
Retrieves the training data used by this utility.
This property is read-only. In order to set it, use with_dataset().
__str__
¶
Returns a string representation of the utility. Subclasses should override this method to provide a more informative string
with_dataset
¶
Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.
Source code in src/pydvl/valuation/utility/base.py
IndicatorUtilityModel
¶
IndicatorUtilityModel(predictor: SupervisedModel, n_data: int)
Bases: UtilityModel
A simple wrapper for arbitrary predictors.
Uses 1-hot encoding of the indices as input for the model, as done in Wang et al., (2022)1.
This encoding can be fed to any regressor. See the documentation for details.
PARAMETER | DESCRIPTION |
---|---|
predictor
|
A supervised model that implements the
TYPE:
|
n_data
|
Number of indices in the dataset. This is used to create the input matrix for the model.
TYPE:
|
Source code in src/pydvl/valuation/utility/learning.py
UtilityModel
¶
Bases: ABC
Interface for utility models.
A utility model predicts the value of a utility function given a sample. The model is trained on a collection of samples and their respective utility values. These tuples are called Utility Samples.
Utility models:
- are fitted on dictionaries of Sample -> utility value
- predict: Collection[samples] -> NDArray[utility values]