pydvl.valuation.utility.learning
¶
This module implements Data Utility Learning (Wang et al., 2022)1.
Data Utility learning accelerates data valuation by learning the utility function from a small number of subsets. The process is as follows:
- Collect a small number of so-called utility samples (subsets and their utility values) during the normal course of data valuation.
- Fit a model to the utility samples. The model is trained to predict the utility of new subsets.
- Continue the valuation process, sampling subsets, but instead of evaluating the original utility function, use the learned model to predict it.
Usage¶
There are three components (sorry for the confusing naming!):
- The original utility object to learn, typically (but not necessarily) a ModelUtility object which will be expensive to evaluate.
- A UtilityModel which will be trained to predict the utility of subsets.
- The DataUtilityLearning object.
Assuming you have some data valuation algorithm and your Utility
object:
- Pick the actual machine learning model to use to learn the utility. In most cases the utility takes continuous values, so this should be any regression model, such as a linear regression or a neural network. The input to it will be sets of indices, so one has to encode the data accordingly. For example, an indicator vector of the set as done in Wang et al., (2022)1, with IndicatorUtilityModel. This wrapper accepts any machine learning model for the actual fitting. An alternative way to encode the data is to use a deep learning model, such as DeepSet, which is a simple permutation invariant architecture to learn embeddings for sets of points.
- Wrap your
Utility
object within a DataUtilityLearning object and give it the object constructed in the previous point - Use this
DataUtilityLearning
object in your data valuation algorithm instead of the originalUtility
object.
Parallel processing not supported
As of 0.9.0, this method does not support parallel processing. DataUtilityLearning
would have to collect all utility samples in a single process before fitting the
model. Gathering utility samples via custom evaluation strategies and result
updaters might be possible, but some IPC mechanism would be required to send the
fitted utility model to the workers, and this has to be implemented manually and
made to support all backends.
References¶
-
Wang, T., Yang, Y. and Jia, R., 2021. Improving cooperative game theory-based data valuation via data utility learning. arXiv preprint arXiv:2107.06336. ↩
UtilityModel
¶
Bases: ABC
Interface for utility models.
A utility model predicts the value of a utility function given a sample. The model is trained on a collection of samples and their respective utility values. These tuples are called Utility Samples.
Utility models:
- are fitted on dictionaries of Sample -> utility value
- predict: Collection[samples] -> NDArray[utility values]
IndicatorUtilityModel
¶
IndicatorUtilityModel(predictor: SupervisedModel, n_data: int)
Bases: UtilityModel
A simple wrapper for arbitrary predictors.
Uses 1-hot encoding of the indices as input for the model, as done in Wang et al., (2022)1.
Source code in src/pydvl/valuation/utility/learning.py
DataUtilityLearning
¶
DataUtilityLearning(
utility: UtilityBase,
training_budget: int,
model: UtilityModel,
show_warnings: bool = True,
)
Bases: UtilityBase[SampleT]
This object wraps any class derived from
UtilityBase and delegates calls to it,
up until a given budget (number of iterations). Every tuple of input and output (a
so-called utility sample) is stored. Once the budget is exhausted,
DataUtilityLearning
fits the given model to the utility samples. Subsequent
calls will use the learned model to predict the utility instead of delegating.
PARAMETER | DESCRIPTION |
---|---|
utility
|
The utility to learn. Typically, this will be a ModelUtility object encapsulating a machine learning model which requires fitting on each evaluation of the utility.
TYPE:
|
training_budget
|
Number of utility samples to collect before fitting the given model.
TYPE:
|
model
|
A supervised regression model
TYPE:
|
Example
from pydvl.valuation import Dataset, DataUtilityLearning, ModelUtility, Sample, SupervisedScorer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import load_iris
train, test = Dataset.from_sklearn(load_iris())
scorer = SupervisedScorer("accuracy", test, 0, (0,1))
utility = ModelUtility(LinearRegression(), scorer)
utility_model = IndicatorUtilityModel(LinearRegression(), len(train))
dul = DataUtilityLearning(utility, 3, utility_model)
# First 3 calls will be computed normally
for i in range(3):
_ = dul(Sample(0, np.array([])))
# Subsequent calls will be computed using the fitted utility_model
dul(Sample(0, np.array([1, 2, 3])))
Source code in src/pydvl/valuation/utility/learning.py
training_data
property
¶
training_data: Dataset | None
Retrieves the training data used by this utility.
This property is read-only. In order to set it, use with_dataset().
with_dataset
¶
Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.