Skip to content

pydvl.valuation.utility.learning

This module implements Data Utility Learning (Wang et al., 2022).1

DUL uses an ML model to learn the utility function. Essentially, it learns to predict the performance of a model when trained on a given set of indices from the dataset. The cost of training this model is quickly amortized by avoiding costly re-evaluations of the original utility.

Usage is through the [DataUtilityLearning] class, which wraps any utility function and a UtilityModel to learn it. The wrapper collects utility samples until a given budget is reached, and then fits the model. After that, it forwards any queries for utility values to this learned model to predict the utility of new samples at constant, and low, cost.

See the documentation for more information.

Todo

DUL does not support parallel training of the model yet. This is a limitation of the current architecture. Additionally, batching of utility evaluations should be added to really profit from neural network architectures.

References


  1. Wang, T., Yang, Y. and Jia, R., 2021. Improving cooperative game theory-based data valuation via data utility learning. arXiv preprint arXiv:2107.06336. 

DataUtilityLearning

DataUtilityLearning(
    utility: UtilityBase,
    training_budget: int,
    model: UtilityModel,
    show_warnings: bool = True,
)

Bases: UtilityBase[SampleT]

This object wraps any class derived from UtilityBase and delegates calls to it, up until a given budget (number of iterations). Every tuple of input and output (a so-called utility sample) is stored. Once the budget is exhausted, DataUtilityLearning fits the given model to the utility samples. Subsequent calls will use the learned model to predict the utility instead of delegating.

PARAMETER DESCRIPTION
utility

The utility to learn. Typically, this will be a ModelUtility object encapsulating a machine learning model which requires fitting on each evaluation of the utility.

TYPE: UtilityBase

training_budget

Number of utility samples to collect before fitting the given model.

TYPE: int

model

A wrapper for a supervised model that can be trained on a collection of utility samples.

TYPE: UtilityModel

Source code in src/pydvl/valuation/utility/learning.py
def __init__(
    self,
    utility: UtilityBase,
    training_budget: int,
    model: UtilityModel,
    show_warnings: bool = True,
) -> None:
    self.utility = utility
    self.training_budget = training_budget
    self.model = model
    self.n_predictions = 0
    self.show_warnings = show_warnings
    self._is_fitted = False
    self._utility_samples: dict[Sample, float] = {}

training_data property

training_data: Dataset | None

Retrieves the training data used by this utility.

This property is read-only. In order to set it, use with_dataset().

__str__

__str__()

Returns a string representation of the utility. Subclasses should override this method to provide a more informative string

Source code in src/pydvl/valuation/utility/base.py
def __str__(self):
    """Returns a string representation of the utility.
    Subclasses should override this method to provide a more informative string
    """
    return f"{self.__class__.__name__}"

with_dataset

with_dataset(data: Dataset, copy: bool = True) -> Self

Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.

Source code in src/pydvl/valuation/utility/base.py
def with_dataset(self, data: Dataset, copy: bool = True) -> Self:
    """Returns the utility, or a copy of it, with the given dataset.
    Args:
        data: The dataset to use for utility fitting (training data)
        copy: Whether to copy the utility object or not. Valuation methods should
            always make copies to avoid unexpected side effects.
    Returns:
        The utility object.
    """
    utility = cp.copy(self) if copy else self
    utility._training_data = data
    return utility

IndicatorUtilityModel

IndicatorUtilityModel(predictor: SupervisedModel, n_data: int)

Bases: UtilityModel

A simple wrapper for arbitrary predictors.

Uses 1-hot encoding of the indices as input for the model, as done in Wang et al., (2022)1.

This encoding can be fed to any regressor. See the documentation for details.

PARAMETER DESCRIPTION
predictor

A supervised model that implements the fit and predict methods. This model will be trained on the encoded utility samples gathered by the DataUtilityLearning object.

TYPE: SupervisedModel

n_data

Number of indices in the dataset. This is used to create the input matrix for the model.

TYPE: int

Source code in src/pydvl/valuation/utility/learning.py
def __init__(self, predictor: SupervisedModel, n_data: int):
    self.n_data = n_data
    self.predictor = predictor

UtilityModel

Bases: ABC

Interface for utility models.

A utility model predicts the value of a utility function given a sample. The model is trained on a collection of samples and their respective utility values. These tuples are called Utility Samples.

Utility models:

  • are fitted on dictionaries of Sample -> utility value
  • predict: Collection[samples] -> NDArray[utility values]