Data Utility Learning ¶
This notebook introduces Data Utility Learning ( DUL ), a method for approximate data valuation which learns a model of the utility function.
The idea is to estimate the performance of the learning algorithm of interest on unseen data combinations (i.e. subsets of the dataset). The method was originally described in Wang, Tianhao, Yu Yang, and Ruoxi Jia. Improving Cooperative Game Theory-Based Data Valuation via Data Utility Learning . arXiv, 2022 .
Warning: Work on Data Utility Learning is preliminary. It remains to be seen when or whether it can be put effectively into application. For this further testing and benchmarking are required.
Setting and notation ¶
DUL can be applied to any semi-value-based data valuation method. Banzhaf, Beta-Shapley, etc. In this notebook we will focus on Shapley Value. Assume we have some machine learning model, say a classifier \(M\) , trained on a dataset \(D=\{x_1, ..., x_n\}\) . For notational convenience we identify points \(x_i\) with their indices \(i \in N=\{1, ..., n\}\) . We also have a separate valuation set \(D_{val}\) , and define the utility of a subset \(S \subset N\) as the performance (e.g. accuracy) of \(M\) when trained on \(S\) and evaluated on \(D_{val}\) .
Then, the definition of Shapley Value \(v_u(i)\) for \(i \in N\) is:
Because each evaluation of \(u(S)\) requires a potentially very costly training of \(M\) on \(S\) , the idea of DUL is to learn a surrogate model \(\hat{u}\) for the utility. The main assumption is that it is much faster to fit and use \(\hat{u}\) than it is to compute \(u\) and that for most \(i\) , \(v_\hat{u}(i) \approx v_u(i)\) in some sense.
In order to fit \(\hat{u}\) , we start by sampling so-called utility samples to form a training set \(S_\mathrm{train}\) . Each utility sample is a tuple consisting of a subset of indices \(S_j,\) and its true utility \(u(S_j)\) :
where \(m_\mathrm{train}\) denotes the training budget for the learned utility function.
The data utility model \(\hat{u}\) takes as input sets \(S\) , which we must encode somehow. In the DUL paper, the authors use 1-hot encoding for the data points, i.e. they use an indicator function for \(S\) , which is a boolean vector \(\phi\) in which a \(1\) at index \(k\) means that the \(k\) -th sample of the dataset is present in the subset:
We train \(\hat{u}\) on the transformed utility samples \(\phi (\mathcal{S}_\mathrm{train}) := \{(\phi(S_j), u(S_j): j = 1 , ..., m_\mathrm{train}\},\) and then use it to predict instead of computing the utility for any \(S_j \notin \mathcal{S}_\mathrm{train}\) .
Alternative encodings ¶
Training e.g. a neural network \(\hat{u}\) on a set of corners of an \(N\) -dimensional unit cube is generally not the best approach, so one can think of alternative encodings. One such possibility are DeepSets , introduced by Zaheer et al. (2017) , which learn an encoding for the sets as part of the training procedure. This is done by ensuring that the network is permutation-invariant, which is the main property of interest when learning a function defined over sets.
pyDVL offers a very simple pytorch implementation of DeepSets through DeepSetUtilityModel . You can implement your own by subclassing UtilityModel .
Setup ¶
We begin by importing the main libraries and setting some defaults.
Data utility learning can be applied to any valuation method that uses a Utility, in particular any SemivalueValuation . For this example we will use Shapley values with the subclass ShapleyValuation .
Dataset ¶
Closely following the paper, we take
train_size=15
samples (10%) from the
Iris dataset
and compute their Data Shapley values by using all the remaining samples as test set for computing the utility, which in this case is accuracy.
We verify that, as in the paper, if we fit a Support-Vector Classifier to the training data, we obtain a high accuracy:
Computing exact Shapley values ¶
We start by defining the utility using the main model of interest and computing the exact Data Shapley values by definition \(\ref{eq:shapley-def}\) .
We require a
SupervisedScorer
(accuracy), and a
ModelUtility
(not to be confused with
DUL
's model
for
the utility, which is
UtilityModel
!).
Then we must pick the sampling scheme. In order to compute exact values we must use either DeterministicUniformSampler which yields all possible subsets of the data or DeterministicPermutationSampler which yields all possible permutations of the data, but the latter is prohibitively expensive.
Finally, we must pick a stopping criterion. Since the sampler is finite, we use NoStopping to run until completion and we pass it the sampler to keep track of progress.
With everything set up, we fit the valuation in parallel and obtain the exact Data Shapley values:
from joblib import parallel_config
from pydvl.utils.functional import timed
timed_fit = timed(valuation.fit)
with parallel_config(n_jobs=n_jobs):
timed_fit(train)
computation_times["exact"] = timed_fit.execution_time
result = valuation.values()
df = result.to_dataframe(column="exact")["exact"] # We only need the values
Learning the utility with a simple neural network ¶
We now estimate the Data Shapley values with
DataUtilityLearning
. This class learns a model of the
Utility
by wrapping it and delegating calls to it, up until a given budget. Every call yields a
utility sample
which is saved under the hood for training of the given utility model. Once the budget is exhausted,
DataUtilityLearning
fits the model to the utility samples and all subsequent calls use the learned model to predict the wrapped utility. Because each evaluation of the original utility is assumed to take a long time, this results in a speedup.
Below are the parameters for the model used to learn the utility. We follow the paper and use a fully connected neural network whose inputs are indicator vectors of the sets. For a set \(S = \{i_1, ..., i_m\}\) , the encoding is a binary vector \(x\) such that \(x_i = 1\) if \(i \in S\) or \(0\) otherwise. The process of encoding the data and fitting the neural network is encapsulated inside an IndicatorUtilityModel . Other choices inherit from the ABC UtilityModel . We will see an example later.
For the training we use an increasing
training_budget
of utility samples, spaced on a log scale from 100 to 4000. We repeat each training procedure 10 times in order to compute confidence intervals.
Each experiment is encapsulated in the function
run_once
, which takes a run identifier and a
DUL
budget, then performs the fitting and returns all information as a tuple
from joblib import Parallel, delayed
from sklearn.neural_network import MLPRegressor
from pydvl.utils.functional import suppress_warnings, timed
from pydvl.valuation.result import ValuationResult
from pydvl.valuation.samplers import PermutationSampler
from pydvl.valuation.samplers.truncation import RelativeTruncation
from pydvl.valuation.stopping import MaxUpdates
from pydvl.valuation.utility import DataUtilityLearning
from pydvl.valuation.utility.learning import IndicatorUtilityModel
@suppress_warnings(categories=(RuntimeWarning,))
def run_once(run: int, budget: int) -> tuple[int, int, ValuationResult, float]:
# DUL will kick in after `budget` calls to utility
utility_model = IndicatorUtilityModel(MLPRegressor(**mlp_params), n_data=len(train))
dul_utility = DataUtilityLearning(
utility=utility,
training_budget=budget,
model=utility_model,
show_warnings=False,
)
truncation = RelativeTruncation(rtol=0.001)
sampler = PermutationSampler(truncation=truncation)
stopping = MaxUpdates(300)
valuation = ShapleyValuation(dul_utility, sampler, is_done=stopping, progress=False)
# Note that DUL does not support parallel fitting (yet?)
timed_fit = timed(valuation.fit)
timed_fit(train)
return run, budget, valuation.values(), timed_fit.execution_time
with parallel_config(n_jobs=n_jobs):
worker = delayed(run_once)
with Parallel(return_as="generator_unordered") as parallel:
jobs = parallel(
worker(run, budget)
for run, budget in product(range(n_runs), training_budgets)
)
pbar = tqdm(jobs, total=n_runs * len(training_budgets))
for run, budget, result, time in pbar:
computation_times[budget].append(time)
dul_df = result.to_dataframe(column=f"{budget}_{run}")[f"{budget}_{run}"]
df = pd.concat([df, dul_df], axis=1)
Notice how we have changed the stopping criterion to be a combination of
HistoryDeviation
and
MaxUpdates
. The former stops the valuation when the relative deviation of the values is below a certain threshold for a given number of steps, and the latter stops the valuation after a given number of updates (to ensure that the valuation does not run indefinitely).
HistoryDeviation
is the criterion used in the paper introducing Truncated Monte Carlo Shapley.
Next we compute the
\(l_2\)
error for the different training budgets across all runs and plot mean and standard deviation. We obtain results analogous to Figure 1 of the paper, verifying that the method indeed works for estimating the Data Shapley values (at least in this context).
In the plot we also display the mean and standard deviation of the computation time taken for each training budget.
Let us next look at how well the ranking of values resulting from using the surrogate \(\tilde{u}\) matches the ranking by the exact values. For this we fix \(k=3\) and consider the \(k\) samples with the highest value according to \(\tilde{u}\) and \(u\) :
Finally, for each sample, we look at the distance of the estimates to the exact value across runs. Boxes are centered at the 50th percentile with wiskers at the 25th and 75th. We plot relative distances, as a percentage.
Evaluation on anomalous data ¶
One interesting way to assess the Data Utility Learning approach is to corrupt some data and monitor how the value changes. To do this, we will take the sample with the highest score and change its label.
highest_value_index = df.index[df["exact"].argmax()]
y_corrupted = train.data().y.copy()
y_corrupted[highest_value_index] = (y_corrupted[highest_value_index] + 1) % 3
corrupted_dataset = Dataset(
train.data().x.copy(),
y_corrupted,
feature_names=train.feature_names,
target_names=train.target_names,
)
We retrain the model on the new dataset and verify that the accuracy decreases:
Now we recompute the values of all samples using the exact method:
sampler = DeterministicUniformSampler(batch_size=32)
stopping = NoStopping(sampler)
valuation = ShapleyValuation(
utility=utility, sampler=sampler, is_done=stopping, progress=True
)
with parallel_config(n_jobs=n_jobs):
valuation.fit(corrupted_dataset)
result = valuation.values()
df_corrupted = result.to_dataframe(column="exact")["exact"]
And finally we use DUL with the best training budget previously obtained, and plot the resulting scores.
scorer = SupervisedScorer("accuracy", test_data=test, default=0, range=(0, 1))
utility = ModelUtility(model=model, scorer=scorer, show_warnings=False)
utility_model = IndicatorUtilityModel(MLPRegressor(**mlp_params), n_data=len(train))
dul_utility = DataUtilityLearning(
utility=utility,
training_budget=best_budget,
model=utility_model,
show_warnings=False,
)
truncation = RelativeTruncation(rtol=0.001)
sampler = PermutationSampler(truncation=truncation)
stopping = MaxUpdates(300)
valuation = ShapleyValuation(dul_utility, sampler, is_done=stopping, progress=False)
valuation.fit(corrupted_dataset)
result = valuation.values()
dul_df = result.to_dataframe(column="estimated")["estimated"]
df_corrupted = pd.concat([df_corrupted, dul_df], axis=1)
We can see in the figure that both methods assign the lowest value to the sample with the corrupted label.