Data Utility Learning¶
This notebook introduces Data Utility Learning, a method of approximating Data Shapley values by learning to estimate the utility function.
The idea is to employ a model to learn the performance of the learning algorithm of interest on unseen data combinations (i.e. subsets of the dataset). The method was originally described in Wang, Tianhao, Yu Yang, and Ruoxi Jia. Improving Cooperative Game Theory-Based Data Valuation via Data Utility Learning. arXiv, 2022.
Warning: Work on Data Utility Learning is preliminary. It remains to be seen when or whether it can be put effectively into application. For this further testing and benchmarking are required.
Recall the definition of Shapley value \(v_u(i)\) for data point \(i\):
where \(N\) is the set of all indices in the training set and \(u\) is the utility.
In Data Utility Learning, to avoid the exponential cost of computing this sum, one learns a surrogate model for \(u\). We start by sampling so-called utility samples to form a training set \(S_\mathrm{train}\) for our utility model. Each utility sample is a tuple consisting of a subset of indices \(S_j\) in the dataset and its utility \(u(S_j)\):
where \(m_\mathrm{train}\) denotes the training budget for the learned utility function.
The subsets are then transformed into boolean vectors \(\phi\) in which a \(1\) at index \(k\) means that the \(k\)-th sample of the dataset is present in the subset:
We fit a regression model \(\tilde{u}\), called data utility model, on the transformed utility samples \(\phi (\mathcal{S}_\mathrm{train}) := \{(\phi(S_j), u(S_j): j = 1 , ..., m_\mathrm{train}\}\) and use it to predict instead of computing the utility for any \(S_j \notin \mathcal{S}_\mathrm{train}\). We abuse notation and identify \(\tilde{u}\) with the composition \(\tilde{u} \circ \phi : N \rightarrow \mathbb{R}\).
The main assumption is that it is much faster to fit and use \(\tilde{u}\) than it is to compute \(u\) and that for most \(i\), \(v_\tilde{u}(i) \approx v_u(i)\) in some sense.
Setup¶
We begin by importing the main libraries and setting some defaults.
As is the case with all other Shapley methods, the main entry point is the function compute_shapley_values(), which provides a facade to all algorithms in this family. We use it with the usual classes Dataset and Utility. In addition, we must import the core class for learning a utility, DataUtilityLearning.
Dataset¶
Following the paper, we take 15 samples (10%) from the Iris dataset and compute their Data Shapley values by using all the remaining samples as test set for computing the utility, which in this case is accuracy.
We verify that, as in the paper, if we fit a Support-Vector Classifier to the training data, we obtain an accuracy of around 92%:
Data Shapley¶
We start by defining the utility using the model and computing the exact Data Shapley values by definition \(\ref{eq:shapley-def}\).
start_time = time.monotonic()
result = compute_shapley_values(
u=utility,
mode=ShapleyMode.CombinatorialExact,
n_jobs=-1,
progress=False, # Does not display correctly in a notebook
)
computation_time = time.monotonic() - start_time
computation_times["exact"] = computation_time
df = result.to_dataframe(column="exact").drop(columns=["exact_stderr"])
We now estimate the Data Shapley values using the DataUtilityLearning wrapper. This class wraps a Utility and delegates calls to it, up until a given budget. Every call yields a utility sample which is saved under the hood for training of the given utility model. Once the budget is exhausted, DataUtilityLearning
fits the model to the utility samples and all subsequent calls use the learned model to predict the wrapped utility instead of delegating to it.
For the utility model we follow the paper and use a fully connected neural network. To train it we use a total of training_budget
utility samples. We repeat this multiple times for each training budget.
mlp_kwargs = dict(
hidden_layer_sizes=(20, 10),
activation="relu",
solver="adam",
learning_rate_init=0.001,
batch_size=32,
max_iter=800,
)
print(
f"Doing {n_runs} runs for each of {len(training_budget_values)} different training budgets."
)
pbar = tqdm(
product(range(n_runs), training_budget_values),
total=n_runs * len(training_budget_values),
)
for idx, budget in pbar:
pbar.set_postfix_str(f"Run {idx} for training budget: {budget}")
dul_utility = DataUtilityLearning(
u=utility, training_budget=budget, model=MLPRegressor(**mlp_kwargs)
)
start_time = time.monotonic()
# DUL will kick in after training_budget calls to utility
result = compute_shapley_values(
u=dul_utility,
mode=ShapleyMode.PermutationMontecarlo,
done=MaxUpdates(300),
n_jobs=-1,
)
computation_time = time.monotonic() - start_time
if budget in computation_times:
computation_times[budget].append(computation_time)
else:
computation_times[budget] = [computation_time]
dul_df = result.to_dataframe(column=f"{budget}_{idx}").drop(
columns=[f"{budget}_{idx}_stderr"]
)
df = pd.concat([df, dul_df], axis=1)
computation_times_df = pd.DataFrame(computation_times)
Next we compute the \(l_1\) error for the different training budgets across all runs and plot mean and standard deviation. We obtain results analogous to Figure 1 of the paper, verifying that the method indeed works for estimating the Data Shapley values (at least in this context).
In the plot we also display the mean and standard deviation of the computation time taken for each training budget.
errors = np.zeros((len(training_budget_values), n_runs), dtype=float)
accuracies = np.zeros((len(training_budget_values), n_runs), dtype=float)
top_k = 3
for i, budget in enumerate(training_budget_values):
for j in range(n_runs):
y_true = df["exact"].values
y_estimated = df[f"{budget}_{j}"].values
errors[i, j] = np.linalg.norm(y_true - y_estimated, ord=2)
accuracies[i, j] = top_k_value_accuracy(y_true, y_estimated, k=top_k)
error_from_mean = np.linalg.norm(df["exact"].values - df["exact"].values.mean(), ord=2)
Let us next look at how well the ranking of values resulting from using the surrogate \(\tilde{u}\) matches the ranking by the exact values. For this we fix \(k=3\) and consider the \(k\) samples with the highest value according to \(\tilde{u}\) and \(u\):
Finally, for each sample, we look at the distance of the estimates to the exact value across runs. Boxes are centered at the 50th percentile with wiskers at the 25th and 75th. We plot relative distances, as a percentage. We observe a general tendency to underestimate the value:
Evaluation on anomalous data¶
One interesting way to assess the Data Utility Learning approach is to corrupt some data and monitor how the value changes. To do this, we will take the sample with the highest score and change its label.
highest_value_index = df.index[df["exact"].argmax()]
y_train_corrupted = dataset.y_train.copy()
y_train_corrupted[highest_value_index] = (
y_train_corrupted[highest_value_index] + 1
) % 3
corrupted_dataset = Dataset(
x_train=dataset.x_train,
y_train=y_train_corrupted,
x_test=dataset.x_test,
y_test=dataset.y_test,
)
We retrain the model on the new dataset and verify that the accuracy decreases:
Finally, we recompute the values of all samples using the exact method and the best training budget previously obtained and then plot the resulting scores.
best_training_budget = training_budget_values[errors.mean(axis=1).argmin()]
utility = Utility(
model=LinearSVC(),
data=corrupted_dataset,
)
result = compute_shapley_values(
u=utility,
mode=ShapleyMode.CombinatorialExact,
n_jobs=-1,
progress=False,
)
df_corrupted = result.to_dataframe(column="exact").drop(columns=["exact_stderr"])
dul_utility = DataUtilityLearning(
u=utility, training_budget=best_training_budget, model=MLPRegressor(**mlp_kwargs)
)
result = compute_shapley_values(
u=dul_utility,
mode=ShapleyMode.PermutationMontecarlo,
done=MaxUpdates(300),
n_jobs=-1,
)
dul_df = result.to_dataframe(column="estimated").drop(columns=["estimated_stderr"])
df_corrupted = pd.concat([df_corrupted, dul_df], axis=1)
We can see in the figure that both methods assign the lowest value to the sample with the corrupted label.
Created: 2023-10-14