Data Utility Learning¶

This notebook introduces Data Utility Learning, a method of approximating Data Shapley values by learning to estimate the utility function.

The idea is to employ a model to learn the performance of the learning algorithm of interest on unseen data combinations (i.e. subsets of the dataset). The method was originally described in Wang, Tianhao, Yu Yang, and Ruoxi Jia. Improving Cooperative Game Theory-Based Data Valuation via Data Utility Learning. arXiv, 2022.

Warning: Work on Data Utility Learning is preliminary. It remains to be seen when or whether it can be put effectively into application. For this further testing and benchmarking are required.

Recall the definition of Shapley value \(v_u(i)\) for data point \(i\):

\[\begin{equation} v_u(i) = \frac{1}{n} \sum_{S \subseteq N \setminus \{i\}} \binom{n-1}{|S|}^{-1} [u(S \cup \{i\}) − u(S)] , \tag{1} \label{eq:shapley-def} \end{equation}\]

where \(N\) is the set of all indices in the training set and \(u\) is the utility.

In Data Utility Learning, to avoid the exponential cost of computing this sum, one learns a surrogate model for \(u\). We start by sampling so-called utility samples to form a training set \(S_\mathrm{train}\) for our utility model. Each utility sample is a tuple consisting of a subset of indices \(S_j\) in the dataset and its utility \(u(S_j)\):

\[\mathcal{S}_\mathrm{train} = \{(S_j, u(S_j): j = 1 , ..., m_\mathrm{train}\}\]

where \(m_\mathrm{train}\) denotes the training budget for the learned utility function.

The subsets are then transformed into boolean vectors \(\phi\) in which a \(1\) at index \(k\) means that the \(k\)-th sample of the dataset is present in the subset:

\[S_j \mapsto \phi_j \in \{ 0, 1 \}^{N}\]

We fit a regression model \(\tilde{u}\), called data utility model, on the transformed utility samples \(\phi (\mathcal{S}_\mathrm{train}) := \{(\phi(S_j), u(S_j): j = 1 , ..., m_\mathrm{train}\}\) and use it to predict instead of computing the utility for any \(S_j \notin \mathcal{S}_\mathrm{train}\). We abuse notation and identify \(\tilde{u}\) with the composition \(\tilde{u} \circ \phi : N \rightarrow \mathbb{R}\).

The main assumption is that it is much faster to fit and use \(\tilde{u}\) than it is to compute \(u\) and that for most \(i\), \(v_\tilde{u}(i) \approx v_u(i)\) in some sense.

Setup¶

We begin by importing the main libraries and setting some defaults.

If you are reading this in the documentation, some boilerplate (including most plotting code) has been omitted for convenience.

As is the case with all other Shapley methods, the main entry point is the function compute_shapley_values(), which provides a facade to all algorithms in this family. We use it with the usual classes Dataset and Utility. In addition, we must import the core class for learning a utility, DataUtilityLearning.

%autoreload
from pydvl.utils import DataUtilityLearning, top_k_value_accuracy
from pydvl.reporting.plots import shaded_mean_std
from pydvl.value import *

Dataset¶

Following the paper, we take 15 samples (10%) from the Iris dataset and compute their Data Shapley values by using all the remaining samples as test set for computing the utility, which in this case is accuracy.

dataset = Dataset.from_sklearn(
    load_iris(),
    train_size=train_size,
    random_state=random_state,
    stratify_by_target=True,
)

We verify that, as in the paper, if we fit a Support-Vector Classifier to the training data, we obtain an accuracy of around 92%:

model = LinearSVC()
model.fit(dataset.x_train, dataset.y_train)
print(f"Mean accuracy: {100 * model.score(dataset.x_test, dataset.y_test):0.2f}%")

Mean accuracy: 92.59%

Data Shapley¶

We start by defining the utility using the model and computing the exact Data Shapley values by definition \(\ref{eq:shapley-def}\).

computation_times = {}

utility = Utility(model=model, data=dataset)

start_time = time.monotonic()

result = compute_shapley_values(
    u=utility,
    mode=ShapleyMode.CombinatorialExact,
    n_jobs=-1,
    progress=False,
)

computation_time = time.monotonic() - start_time
computation_times["exact"] = computation_time

df = result.to_dataframe(column="exact").drop(columns=["exact_stderr"])

We now estimate the Data Shapley values using the DataUtilityLearning wrapper. This class wraps a Utility and delegates calls to it, up until a given budget. Every call yields a utility sample which is saved under the hood for training of the given utility model. Once the budget is exhausted, DataUtilityLearning fits the model to the utility samples and all subsequent calls use the learned model to predict the wrapped utility instead of delegating to it.

For the utility model we follow the paper and use a fully connected neural network. To train it we use a total of training_budget utility samples. We repeat this multiple times for each training budget.

Note how we use a MonteCarlo approximation instead of `combinatorial_exact` as before. This is because the exact computation samples subsets in a particular order, from the lowest size to the largest. Because the training budget for the model to learn the utility is around 1/4th of the total number of subsets, this would mean that we would never see utility samples for the larger sizes and the model would be biased (try it!)

mlp_kwargs = dict(
    hidden_layer_sizes=(20, 10),
    activation="relu",
    solver="adam",
    learning_rate_init=0.001,
    batch_size=batch_size,
    max_iter=800,
)

print(
    f"Doing {n_runs} runs for each of {len(training_budget_values)} different training budgets."
)

pbar = tqdm(
    product(range(n_runs), training_budget_values),
    total=n_runs * len(training_budget_values),
)
for idx, budget in pbar:
    pbar.set_postfix_str(f"Run {idx} for training budget: {budget}")
    dul_utility = DataUtilityLearning(
        u=utility, training_budget=budget, model=MLPRegressor(**mlp_kwargs)
    )

    start_time = time.monotonic()

    # DUL will kick in after training_budget calls to utility
    result = compute_shapley_values(
        u=dul_utility,
        mode=ShapleyMode.PermutationMontecarlo,
        done=MaxUpdates(300),
        n_jobs=-1,
    )

    computation_time = time.monotonic() - start_time
    if budget in computation_times:
        computation_times[budget].append(computation_time)
    else:
        computation_times[budget] = [computation_time]

    dul_df = result.to_dataframe(column=f"{budget}_{idx}").drop(
        columns=[f"{budget}_{idx}_stderr"]
    )
    df = pd.concat([df, dul_df], axis=1)

computation_times_df = pd.DataFrame(computation_times)

Doing 10 runs for each of 10 different training budgets.

  0%|          | 0/100 [00:00<?, ?it/s]

Next we compute the \(l_1\) error for the different training budgets across all runs and plot mean and standard deviation. We obtain results analogous to Figure 1 of the paper, verifying that the method indeed works for estimating the Data Shapley values (at least in this context).

In the plot we also display the mean and standard deviation of the computation time taken for each training budget.

errors = np.zeros((len(training_budget_values), n_runs), dtype=float)
accuracies = np.zeros((len(training_budget_values), n_runs), dtype=float)

top_k = 3

for i, budget in enumerate(training_budget_values):
    for j in range(n_runs):
        y_true = df["exact"].values
        y_estimated = df[f"{budget}_{j}"].values
        errors[i, j] = np.linalg.norm(y_true - y_estimated, ord=2)
        accuracies[i, j] = top_k_value_accuracy(y_true, y_estimated, k=top_k)

error_from_mean = np.linalg.norm(df["exact"].values - df["exact"].values.mean(), ord=2)

No description has been provided for this image

Let us next look at how well the ranking of values resulting from using the surrogate \(\tilde{u}\) matches the ranking by the exact values. For this we fix \(k=3\) and consider the \(k\) samples with the highest value according to \(\tilde{u}\) and \(u\):

Finally, for each sample, we look at the distance of the estimates to the exact value across runs. Boxes are centered at the 50th percentile with wiskers at the 25th and 75th. We plot relative distances, as a percentage. We observe a general tendency to underestimate the value:

Evaluation on anomalous data¶

One interesting way to assess the Data Utility Learning approach is to corrupt some data and monitor how the value changes. To do this, we will take the sample with the highest score and change its label.

highest_value_index = df.index[df["exact"].argmax()]
y_train_corrupted = dataset.y_train.copy()
y_train_corrupted[highest_value_index] = (
    y_train_corrupted[highest_value_index] + 1
) % 3

corrupted_dataset = Dataset(
    x_train=dataset.x_train,
    y_train=y_train_corrupted,
    x_test=dataset.x_test,
    y_test=dataset.y_test,
)

We retrain the model on the new dataset and verify that the accuracy decreases:

model = LinearSVC()
model.fit(dataset.x_train, y_train_corrupted)
print(f"Mean accuracy: {100 * model.score(dataset.x_test, dataset.y_test):0.2f}%")

Mean accuracy: 82.96%

Finally, we recompute the values of all samples using the exact method and the best training budget previously obtained and then plot the resulting scores.

best_training_budget = training_budget_values[errors.mean(axis=1).argmin()]

utility = Utility(
    model=LinearSVC(),
    data=corrupted_dataset,
)

result = compute_shapley_values(
    u=utility,
    mode=ShapleyMode.CombinatorialExact,
    n_jobs=-1,
    progress=False,
)
df_corrupted = result.to_dataframe(column="exact").drop(columns=["exact_stderr"])

dul_utility = DataUtilityLearning(
    u=utility, training_budget=best_training_budget, model=MLPRegressor(**mlp_kwargs)
)

result = compute_shapley_values(
    u=dul_utility,
    mode=ShapleyMode.PermutationMontecarlo,
    done=MaxUpdates(300),
    n_jobs=-1,
)
dul_df = result.to_dataframe(column="estimated").drop(columns=["estimated_stderr"])
df_corrupted = pd.concat([df_corrupted, dul_df], axis=1)

We can see in the figure that both methods assign the lowest value to the sample with the corrupted label.

As mentioned above, despite the previous results, this work is preliminary and the usefulness of Data Utility Learning remains to be tested in practice.