Data Utility Learning¶

Example

See the notebook on Data Utility Learning for a complete example.

DUL (Wang et al., 2022)¹ uses an ML model \(\hat{u}\) to learn the utility function \(u:2^N \to \mathbb{R}\) during the fitting phase of any valuation method. This utility model is trained with tuples \((S, U(S))\) for a certain warm-up period. Then it is used instead of \(u\) in the valuation method. The cost of training \(\hat{u}\) is quickly amortized by avoiding costly re-evaluations of the original utility.

Process¶

In other words, DUL accelerates data valuation by learning the utility function from a small number of subsets. The process is as follows:

Collect a given budget of so-called utility samples (subsets and their utility values) during the normal course of data valuation.
Fit a model \(\hat{u}\) to the utility samples. The model is trained to predict the utility of new subsets.
Continue the valuation process, sampling subsets, but instead of evaluating the original utility function, use the learned model to predict it.

Usage¶

There are three components (sorry for the confusing naming!):

The original utility object to learn, typically (but not necessarily) a ModelUtility object which will be expensive to evaluate. Any subclass of UtilityBase should work. Let's call it utility.
A UtilityModel which will be trained to predict the utility of subsets.
The DataUtilityLearning object.

Assuming you have some data valuation algorithm and your utility object:

Pick the actual machine learning model to use to learn the utility. In most cases the utility takes continuous values, so this should be any regression model, such as a linear regression or a neural network. The input to it will be sets of indices, so one has to encode the data accordingly. For example, an indicator vector of the set as done in (Wang et al., 2022)¹, with IndicatorUtilityModel. This wrapper accepts any machine learning model for the actual fitting. An alternative way to encode the data is to use a permutation-invariant model, such as DeepSet (Zaheer et al., 2017)², which is a simple architecture to learn embeddings for sets of points (see below).
Wrap both your utility object and the utility model just constructed within a DataUtilityLearning.
Use this last object in your data valuation algorithm instead of the original utility.

Indicator encoding¶

The authors of DUL propose to use an indicator function to encode the sets of indices: a vector of length len(training_data) with a 1 at index \(i\) if sample \(x_i\) is in the set and 0 otherwise. This encoding can then be fed to any regression model.

While this can work under some circumstances, note that one is effectively learning a regression function on the corners of an \(n\)-dimensional hypercube, a problem well known to be difficult. For this reason, we offer a (naive) implementation of a permutation-invariant model called Deep Sets which can serve as guidance for a more complex architecture.

DUL with indicator encoding

In this example we use a linear regression model to learn the utility function, with inputs encoded as an indicator vector.

from pydvl.valuation import Dataset, DataUtilityLearning, ModelUtility, \
    Sample, SupervisedScorer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import load_iris

train, test = Dataset.from_sklearn(load_iris())
scorer = SupervisedScorer("accuracy", test, 0, (0,1))
utility = ModelUtility(LinearRegression(), scorer)
utility_model = IndicatorUtilityModel(LinearRegression(), len(train))
dul = DataUtilityLearning(utility, 300, utility_model)
valuation = ShapleyValuation(
    utility=dul,
    sampler=PermutationSampler(),
    stopping=MaxUpdates(6000)
)
# Note: DUL does not support parallel training yet
valuation.fit(train)

Deep Sets¶

Given a set \(S= \{x_1, x_2, ..., x_n\},\) Deep Sets (Zaheer et al., 2017)² learn a representation of the set which is invariant to the order of elements in the set. The model consists of two networks:

\[ \Phi(S) = \sum_{x_i \in S} \phi(x_i), \]

where \(\phi(x_i)\) is a learned embedding for data point \(x_i,\) and a second network \(\rho\) that predicts the output \(y\) from the aggregated representation:

\[ y = \rho(\Phi(S)). \]

DUL with DeepSets

(This example requires pytorch installed). Here we use a DeepSet model to learn the utility function.

from pydvl.valuation import Dataset, DataUtilityLearning, ModelUtility, \
    Sample, SupervisedScorer
from pydvl.valuation.utility.deepset import DeepSetUtilityModel
from sklearn.datasets import load_iris

train, test = Dataset.from_sklearn(load_iris())
scorer = SupervisedScorer("accuracy", test, 0, (0,1))
utility = ModelUtility(LinearRegression(), scorer)
utility_model = DeepSetUtilityModel(
    input_dim=len(train),
    phi_hidden_dim=10,
    phi_output_dim=20,
    rho_hidden_dim=10
)
dul = DataUtilityLearning(utility, 3000, utility_model)

valuation = ShapleyValuation(
    utility=dul,
    sampler=PermutationSampler(),
    stopping=MaxUpdates(10000)
)
# Note: DUL does not support parallel training yet
valuation.fit(train)

Other architectures¶

As mentioned above, what makes DeepSets suitable for DUL is the permutation-invariance of the model, which is a required property of any estimator of a function defined over sets like the utility. Any alternative architecture with this property should work as well. Alternatively, one can use other encodings of the sets, as long as they are injective and invariant under permutations (or defined for fixed orderings as the indicator encoding above).

Wang, T., Yang, Y., Jia, R., 2022. Improving [Cooperative Game Theory-based Data Valuation]{.nocase} via Data Utility Learning. Presented at the International Conference on Learning Representations (ICLR 2022). Workshop on Socially Responsible Machine Learning, arXiv. https://doi.org/10.48550/arXiv.2107.06336 ↩↩
Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R.R., Smola, A.J., 2017. Deep Sets, in: Advances in Neural Information Processing Systems. Curran Associates, Inc. ↩↩