pydvl.valuation.utility.learning ¶

This module implements Data Utility Learning (Wang et al., 2022).¹

DUL uses an ML model to learn the utility function. Essentially, it learns to predict the performance of a model when trained on a given set of indices from the dataset. The cost of training this model is quickly amortized by avoiding costly re-evaluations of the original utility.

Usage is through the [DataUtilityLearning] class, which wraps any utility function and a UtilityModel to learn it. The wrapper collects utility samples until a given budget is reached, and then fits the model. After that, it forwards any queries for utility values to this learned model to predict the utility of new samples at constant, and low, cost.

See the documentation for more information.

Todo

DUL does not support parallel training of the model yet. This is a limitation of the current architecture. Additionally, batching of utility evaluations should be added to really profit from neural network architectures.

References¶

Wang, T., Yang, Y. and Jia, R., 2021. Improving cooperative game theory-based data valuation via data utility learning. arXiv preprint arXiv:2107.06336. ↩

DataUtilityLearning ¶

DataUtilityLearning(
    utility: UtilityBase,
    training_budget: int,
    model: UtilityModel,
    show_warnings: bool = True,
)

Bases: UtilityBase[SampleT]

This object wraps any class derived from UtilityBase and delegates calls to it, up until a given budget (number of iterations). Every tuple of input and output (a so-called utility sample) is stored. Once the budget is exhausted, DataUtilityLearning fits the given model to the utility samples. Subsequent calls will use the learned model to predict the utility instead of delegating.

PARAMETER	DESCRIPTION
`utility`	The utility to learn. Typically, this will be a ModelUtility object encapsulating a machine learning model which requires fitting on each evaluation of the utility. TYPE: `UtilityBase`
`training_budget`	Number of utility samples to collect before fitting the given model. TYPE: `int`
`model`	A wrapper for a supervised model that can be trained on a collection of utility samples. TYPE: `UtilityModel`

Source code in src/pydvl/valuation/utility/learning.py

def __init__(
    self,
    utility: UtilityBase,
    training_budget: int,
    model: UtilityModel,
    show_warnings: bool = True,
) -> None:
    self.utility = utility
    self.training_budget = training_budget
    self.model = model
    self.n_predictions = 0
    self.show_warnings = show_warnings
    self._is_fitted = False
    self._utility_samples: dict[Sample, float] = {}

training_data `property` ¶

training_data: Dataset | None

Retrieves the training data used by this utility.

This property is read-only. In order to set it, use with_dataset().

str ¶

__str__()

Returns a string representation of the utility. Subclasses should override this method to provide a more informative string

Source code in src/pydvl/valuation/utility/base.py

def __str__(self):
    """Returns a string representation of the utility.
    Subclasses should override this method to provide a more informative string
    """
    return f"{self.__class__.__name__}"

with_dataset ¶

with_dataset(data: Dataset, copy: bool = True) -> Self

Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.

Source code in src/pydvl/valuation/utility/base.py

def with_dataset(self, data: Dataset, copy: bool = True) -> Self:
    """Returns the utility, or a copy of it, with the given dataset.
    Args:
        data: The dataset to use for utility fitting (training data)
        copy: Whether to copy the utility object or not. Valuation methods should
            always make copies to avoid unexpected side effects.
    Returns:
        The utility object.
    """
    utility = cp.copy(self) if copy else self
    utility._training_data = data
    return utility

IndicatorUtilityModel ¶

IndicatorUtilityModel(
    predictor: SupervisedModel[NDArray, NDArray], n_data: int
)

Bases: UtilityModel[NDArray]

A simple wrapper for arbitrary predictors.

Uses 1-hot encoding of the indices as input for the model, as done in Wang et al., (2022)¹.

This encoding can be fed to any regressor. See the documentation for details.

PARAMETER	DESCRIPTION
`predictor`	A supervised model that implements the `fit` and `predict` methods. This model will be trained on the encoded utility samples gathered by the DataUtilityLearning object. TYPE: `SupervisedModel[NDArray, NDArray]`
`n_data`	Number of indices in the dataset. This is used to create the input matrix for the model. TYPE: `int`

Source code in src/pydvl/valuation/utility/learning.py

def __init__(self, predictor: SupervisedModel[NDArray, NDArray], n_data: int):
    self.n_data = n_data
    self.predictor = predictor

UtilityModel ¶

Bases: ABC, Generic[ArrayRetT]

Interface for utility models.

A utility model predicts the value of a utility function given a sample. The model is trained on a collection of samples and their respective utility values. These tuples are called Utility Samples.

Utility models:

are fitted on dictionaries of Sample -> utility value
predict: Collection[samples] -> Array[utility values]