Skip to content

pydvl.valuation.utility.learning

This module implements Data Utility Learning (Wang et al., 2022)1.

Data Utility learning accelerates data valuation by learning the utility function from a small number of subsets. The process is as follows:

  1. Collect a small number of so-called utility samples (subsets and their utility values) during the normal course of data valuation.
  2. Fit a model to the utility samples. The model is trained to predict the utility of new subsets.
  3. Continue the valuation process, sampling subsets, but instead of evaluating the original utility function, use the learned model to predict it.

Usage

There are three components (sorry for the confusing naming!):

  1. The original utility object to learn, typically (but not necessarily) a ModelUtility object which will be expensive to evaluate.
  2. A UtilityModel which will be trained to predict the utility of subsets.
  3. The DataUtilityLearning object.

Assuming you have some data valuation algorithm and your Utility object:

  1. Pick the actual machine learning model to use to learn the utility. In most cases the utility takes continuous values, so this should be any regression model, such as a linear regression or a neural network. The input to it will be sets of indices, so one has to encode the data accordingly. For example, an indicator vector of the set as done in Wang et al., (2022)1, with IndicatorUtilityModel. This wrapper accepts any machine learning model for the actual fitting. An alternative way to encode the data is to use a deep learning model, such as DeepSet, which is a simple permutation invariant architecture to learn embeddings for sets of points.
  2. Wrap your Utility object within a DataUtilityLearning object and give it the object constructed in the previous point
  3. Use this DataUtilityLearning object in your data valuation algorithm instead of the original Utility object.

Parallel processing not supported

As of 0.9.0, this method does not support parallel processing. DataUtilityLearning would have to collect all utility samples in a single process before fitting the model. Gathering utility samples via custom evaluation strategies and result updaters might be possible, but some IPC mechanism would be required to send the fitted utility model to the workers, and this has to be implemented manually and made to support all backends.

References


  1. Wang, T., Yang, Y. and Jia, R., 2021. Improving cooperative game theory-based data valuation via data utility learning. arXiv preprint arXiv:2107.06336. 

UtilityModel

Bases: ABC

Interface for utility models.

A utility model predicts the value of a utility function given a sample. The model is trained on a collection of samples and their respective utility values. These tuples are called Utility Samples.

Utility models:

  • are fitted on dictionaries of Sample -> utility value
  • predict: Collection[samples] -> NDArray[utility values]

IndicatorUtilityModel

IndicatorUtilityModel(predictor: SupervisedModel, n_data: int)

Bases: UtilityModel

A simple wrapper for arbitrary predictors.

Uses 1-hot encoding of the indices as input for the model, as done in Wang et al., (2022)1.

Source code in src/pydvl/valuation/utility/learning.py
def __init__(self, predictor: SupervisedModel, n_data: int):
    self.n_data = n_data
    self.predictor = predictor

DataUtilityLearning

DataUtilityLearning(
    utility: UtilityBase,
    training_budget: int,
    model: UtilityModel,
    show_warnings: bool = True,
)

Bases: UtilityBase[SampleT]

This object wraps any class derived from UtilityBase and delegates calls to it, up until a given budget (number of iterations). Every tuple of input and output (a so-called utility sample) is stored. Once the budget is exhausted, DataUtilityLearning fits the given model to the utility samples. Subsequent calls will use the learned model to predict the utility instead of delegating.

PARAMETER DESCRIPTION
utility

The utility to learn. Typically, this will be a ModelUtility object encapsulating a machine learning model which requires fitting on each evaluation of the utility.

TYPE: UtilityBase

training_budget

Number of utility samples to collect before fitting the given model.

TYPE: int

model

A supervised regression model

TYPE: UtilityModel

Example
from pydvl.valuation import Dataset, DataUtilityLearning, ModelUtility,             Sample, SupervisedScorer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import load_iris

train, test = Dataset.from_sklearn(load_iris())
scorer = SupervisedScorer("accuracy", test, 0, (0,1))
utility = ModelUtility(LinearRegression(), scorer)
utility_model = IndicatorUtilityModel(LinearRegression(), len(train))
dul = DataUtilityLearning(utility, 3, utility_model)
# First 3 calls will be computed normally
for i in range(3):
    _ = dul(Sample(0, np.array([])))
# Subsequent calls will be computed using the fitted utility_model
dul(Sample(0, np.array([1, 2, 3])))
Source code in src/pydvl/valuation/utility/learning.py
def __init__(
    self,
    utility: UtilityBase,
    training_budget: int,
    model: UtilityModel,
    show_warnings: bool = True,
) -> None:
    self.utility = utility
    self.training_budget = training_budget
    self.model = model
    self.n_predictions = 0
    self.show_warnings = show_warnings
    self._is_fitted = False
    self._utility_samples: dict[Sample, float] = {}

training_data property

training_data: Dataset | None

Retrieves the training data used by this utility.

This property is read-only. In order to set it, use with_dataset().

with_dataset

with_dataset(data: Dataset, copy: bool = True) -> Self

Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.

Source code in src/pydvl/valuation/utility/base.py
def with_dataset(self, data: Dataset, copy: bool = True) -> Self:
    """Returns the utility, or a copy of it, with the given dataset.
    Args:
        data: The dataset to use for utility fitting (training data)
        copy: Whether to copy the utility object or not. Valuation methods should
            always make copies to avoid unexpected side effects.
    Returns:
        The utility object.
    """
    utility = cp.copy(self) if copy else self
    utility._training_data = data
    return utility