Data Utility Learning ¶

This notebook introduces Data Utility Learning ( DUL ), a method for approximate data valuation which learns a model of the utility function.

The idea is to estimate the performance of the learning algorithm of interest on unseen data combinations (i.e. subsets of the dataset). The method was originally described in Wang, Tianhao, Yu Yang, and Ruoxi Jia. Improving Cooperative Game Theory-Based Data Valuation via Data Utility Learning . arXiv, 2022 .

Warning: Work on Data Utility Learning is preliminary. It remains to be seen when or whether it can be put effectively into application. For this further testing and benchmarking are required.

Setting and notation ¶

DUL can be applied to any semi-value-based data valuation method. Banzhaf, Beta-Shapley, etc. In this notebook we will focus on Shapley Value. Assume we have some machine learning model, say a classifier \(M\) , trained on a dataset \(D=\{x_1, ..., x_n\}\) . For notational convenience we identify points \(x_i\) with their indices \(i \in N=\{1, ..., n\}\) . We also have a separate valuation set \(D_{val}\) , and define the utility of a subset \(S \subset N\) as the performance (e.g. accuracy) of \(M\) when trained on \(S\) and evaluated on \(D_{val}\) .

Then, the definition of Shapley Value \(v_u(i)\) for \(i \in N\) is:

\[\begin{equation} v_u(i) = \frac{1}{n} \sum_{S \subseteq N \setminus \{i\}} \binom{n-1}{|S|}^{-1} [u(S \cup \{i\}) − u(S)], \tag{1} \label{eq:shapley-def} \end{equation}\]

Because each evaluation of \(u(S)\) requires a potentially very costly training of \(M\) on \(S\) , the idea of DUL is to learn a surrogate model \(\hat{u}\) for the utility. The main assumption is that it is much faster to fit and use \(\hat{u}\) than it is to compute \(u\) and that for most \(i\) , \(v_\hat{u}(i) \approx v_u(i)\) in some sense.

In order to fit \(\hat{u}\) , we start by sampling so-called utility samples to form a training set \(S_\mathrm{train}\) . Each utility sample is a tuple consisting of a subset of indices \(S_j,\) and its true utility \(u(S_j)\) :

\[\mathcal{S}_\mathrm{train} = \{(S_j, u(S_j): S_j \subset N, j = 1 , ..., m_\mathrm{train}\}\]

where \(m_\mathrm{train}\) denotes the training budget for the learned utility function.

The data utility model \(\hat{u}\) takes as input sets \(S\) , which we must encode somehow. In the DUL paper, the authors use 1-hot encoding for the data points, i.e. they use an indicator function for \(S\) , which is a boolean vector \(\phi\) in which a \(1\) at index \(k\) means that the \(k\) -th sample of the dataset is present in the subset:

\[S_j \mapsto \phi_j \in \{ 0, 1 \}^{N}, \text{ where }\phi_j^k = 1 \text{ iff }x_k \in S_j.\]

We train \(\hat{u}\) on the transformed utility samples \(\phi (\mathcal{S}_\mathrm{train}) := \{(\phi(S_j), u(S_j): j = 1 , ..., m_\mathrm{train}\},\) and then use it to predict instead of computing the utility for any \(S_j \notin \mathcal{S}_\mathrm{train}\) .

Alternative encodings ¶

Training e.g. a neural network \(\hat{u}\) on a set of corners of an \(N\) -dimensional unit cube is generally not the best approach, so one can think of alternative encodings. One such possibility are DeepSets , introduced by Zaheer et al. (2017) , which learn an encoding for the sets as part of the training procedure. This is done by ensuring that the network is permutation-invariant, which is the main property of interest when learning a function defined over sets.

pyDVL offers a very simple pytorch implementation of DeepSets through DeepSetUtilityModel . You can implement your own by subclassing UtilityModel .

Setup ¶

We begin by importing the main libraries and setting some defaults.

If you are reading this in the documentation, some boilerplate (including most plotting code) has been omitted for convenience.

Data utility learning can be applied to any valuation method that uses a Utility, in particular any SemivalueValuation . For this example we will use Shapley values with the subclass ShapleyValuation .

Dataset ¶

Closely following the paper, we take train_size=15 samples (10%) from the Iris dataset and compute their Data Shapley values by using all the remaining samples as test set for computing the utility, which in this case is accuracy.

from pydvl.valuation.dataset import Dataset

train, test = Dataset.from_sklearn(
    load_iris(),
    train_size=train_size,
    random_state=random_state,
    stratify_by_target=True,
)

We verify that, as in the paper, if we fit a Support-Vector Classifier to the training data, we obtain a high accuracy:

from sklearn.svm import LinearSVC

model = LinearSVC()
model.fit(*train.data());

Mean accuracy: 94.07%

Computing exact Shapley values ¶

We start by defining the utility using the main model of interest and computing the exact Data Shapley values by definition \(\ref{eq:shapley-def}\) .

We require a SupervisedScorer (accuracy), and a ModelUtility (not to be confused with DUL 's model for the utility, which is UtilityModel !).

from pydvl.valuation import ModelUtility, SupervisedScorer

scorer = SupervisedScorer("accuracy", test_data=test, default=0, range=(0, 1))
utility = ModelUtility(model=model, scorer=scorer, show_warnings=False)

Then we must pick the sampling scheme. In order to compute exact values we must use either DeterministicUniformSampler which yields all possible subsets of the data or DeterministicPermutationSampler which yields all possible permutations of the data, but the latter is prohibitively expensive.

Finally, we must pick a stopping criterion. Since the sampler is finite, we use NoStopping to run until completion and we pass it the sampler to keep track of progress.

from pydvl.valuation import (
    DeterministicUniformSampler,
    NoStopping,
    ShapleyValuation,
)

sampler = DeterministicUniformSampler(batch_size=32)
stopping = NoStopping(sampler)

valuation = ShapleyValuation(
    utility=utility, sampler=sampler, is_done=stopping, progress=True
)

With everything set up, we fit the valuation in parallel and obtain the exact Data Shapley values:

from joblib import parallel_config

from pydvl.utils.functional import timed

timed_fit = timed(valuation.fit)
with parallel_config(n_jobs=n_jobs):
    timed_fit(train)
computation_times["exact"] = timed_fit.execution_time

result = valuation.result
df = result.to_dataframe(column="exact")["exact"]  # We only need the values

ShapleyValuation: NoStopping(): 0.00%|          | [00:00<?, ?%/s]

Learning the utility with a simple neural network ¶

We now estimate the Data Shapley values with DataUtilityLearning . This class learns a model of the Utility by wrapping it and delegating calls to it, up until a given budget. Every call yields a utility sample which is saved under the hood for training of the given utility model. Once the budget is exhausted, DataUtilityLearning fits the model to the utility samples and all subsequent calls use the learned model to predict the wrapped utility. Because each evaluation of the original utility is assumed to take a long time, this results in a speedup.

Note how we use a Monte Carlo approximation instead of sampling all subsets as above. This is not only for speed reasons, but because [DeterministicUniformSampler][pydvl.valuation.samplers.powerset.DeterministicUniformSampler] yields subsets in a fixed order, from the lowest size to the largest. Because the training budget for the model to learn the utility is around 1/4th of the total number of subsets, this would mean that we would never see utility samples for the larger sizes and the model would be biased (try it!)

Below are the parameters for the model used to learn the utility. We follow the paper and use a fully connected neural network whose inputs are indicator vectors of the sets. For a set \(S = \{i_1, ..., i_m\}\) , the encoding is a binary vector \(x\) such that \(x_i = 1\) if \(i \in S\) or \(0\) otherwise. The process of encoding the data and fitting the neural network is encapsulated inside an IndicatorUtilityModel . Other choices inherit from the ABC UtilityModel . We will see an example later.

mlp_params = dict(
    hidden_layer_sizes=(20, 10),
    activation="relu",
    solver="adam",
    learning_rate_init=0.001,
    batch_size=batch_size,
    max_iter=800,
    shuffle=False,
    random_state=random_state,
)

For the training we use an increasing training_budget of utility samples, spaced on a log scale from 100 to 4000. We repeat each training procedure 10 times in order to compute confidence intervals. Each experiment is encapsulated in the function run_once , which takes a run identifier and a DUL budget, then performs the fitting and returns all information as a tuple

from joblib import Parallel, delayed
from sklearn.neural_network import MLPRegressor

from pydvl.utils.functional import suppress_warnings, timed
from pydvl.valuation.result import ValuationResult
from pydvl.valuation.samplers import PermutationSampler
from pydvl.valuation.samplers.truncation import RelativeTruncation
from pydvl.valuation.stopping import MaxUpdates
from pydvl.valuation.utility import DataUtilityLearning
from pydvl.valuation.utility.learning import IndicatorUtilityModel


@suppress_warnings(categories=(RuntimeWarning,))
def run_once(run: int, budget: int) -&gt; tuple[int, int, ValuationResult, float]:
    # DUL will kick in after `budget` calls to utility
    utility_model = IndicatorUtilityModel(MLPRegressor(**mlp_params), n_data=len(train))
    dul_utility = DataUtilityLearning(
        utility=utility,
        training_budget=budget,
        model=utility_model,
        show_warnings=False,
    )

    truncation = RelativeTruncation(rtol=0.001)
    sampler = PermutationSampler(truncation=truncation)
    stopping = MaxUpdates(300)
    valuation = ShapleyValuation(dul_utility, sampler, is_done=stopping, progress=False)

    # Note that DUL does not support parallel fitting (yet?)
    timed_fit = timed(valuation.fit)
    timed_fit(train)

    return run, budget, valuation.result, timed_fit.execution_time

with parallel_config(n_jobs=n_jobs):
    worker = delayed(run_once)

    with Parallel(return_as="generator_unordered") as parallel:
        jobs = parallel(
            worker(run, budget)
            for run, budget in product(range(n_runs), training_budgets)
        )

        pbar = tqdm(jobs, total=n_runs * len(training_budgets))
        for run, budget, result, time in pbar:
            computation_times[budget].append(time)
            dul_df = result.to_dataframe(column=f"{budget}_{run}")[f"{budget}_{run}"]
            df = pd.concat([df, dul_df], axis=1)

Notice how we have changed the stopping criterion to be a combination of HistoryDeviation and MaxUpdates . The former stops the valuation when the relative deviation of the values is below a certain threshold for a given number of steps, and the latter stops the valuation after a given number of updates (to ensure that the valuation does not run indefinitely). HistoryDeviation is the criterion used in the paper introducing Truncated Monte Carlo Shapley. Next we compute the \(l_2\) error for the different training budgets across all runs and plot mean and standard deviation. We obtain results analogous to Figure 1 of the paper, verifying that the method indeed works for estimating the Data Shapley values (at least in this context).

In the plot we also display the mean and standard deviation of the computation time taken for each training budget.

max_len = max(len(v) for v in computation_times.values() if isinstance(v, list))
for k, v in computation_times.items():
    computation_times[k] = (
        v + [v[-1]] if isinstance(v, list) and len(v) &lt; max_len else v
    )

No description has been provided for this image

Let us next look at how well the ranking of values resulting from using the surrogate \(\tilde{u}\) matches the ranking by the exact values. For this we fix \(k=3\) and consider the \(k\) samples with the highest value according to \(\tilde{u}\) and \(u\) :

Finally, for each sample, we look at the distance of the estimates to the exact value across runs. Boxes are centered at the 50th percentile with wiskers at the 25th and 75th. We plot relative distances, as a percentage.

Evaluation on anomalous data ¶

One interesting way to assess the Data Utility Learning approach is to corrupt some data and monitor how the value changes. To do this, we will take the sample with the highest score and change its label.

highest_value_index = df.index[df["exact"].argmax()]
y_corrupted = train.data().y.copy()
y_corrupted[highest_value_index] = (y_corrupted[highest_value_index] + 1) % 3

corrupted_dataset = Dataset(
    train.data().x.copy(),
    y_corrupted,
    feature_names=train.feature_names,
    target_names=train.target_names,
)

The corrupted sample has index 0

We retrain the model on the new dataset and verify that the accuracy decreases:

model = LinearSVC()
model.fit(*corrupted_dataset.data());

Mean accuracy over test set, with corrupted training data: 71.85%

Now we recompute the values of all samples using the exact method:

sampler = DeterministicUniformSampler(batch_size=32)
stopping = NoStopping(sampler)
valuation = ShapleyValuation(
    utility=utility, sampler=sampler, is_done=stopping, progress=True
)

with parallel_config(n_jobs=n_jobs):
    valuation.fit(corrupted_dataset)
result = valuation.result

df_corrupted = result.to_dataframe(column="exact")["exact"]

And finally we use DUL with the best training budget previously obtained, and plot the resulting scores.

Best training budget was: 2361

scorer = SupervisedScorer("accuracy", test_data=test, default=0, range=(0, 1))
utility = ModelUtility(model=model, scorer=scorer, show_warnings=False)
utility_model = IndicatorUtilityModel(MLPRegressor(**mlp_params), n_data=len(train))
dul_utility = DataUtilityLearning(
    utility=utility,
    training_budget=best_budget,
    model=utility_model,
    show_warnings=False,
)

truncation = RelativeTruncation(rtol=0.001)
sampler = PermutationSampler(truncation=truncation)
stopping = MaxUpdates(300)
valuation = ShapleyValuation(dul_utility, sampler, is_done=stopping, progress=False)
valuation.fit(corrupted_dataset)
result = valuation.result

dul_df = result.to_dataframe(column="estimated")["estimated"]
df_corrupted = pd.concat([df_corrupted, dul_df], axis=1)

We can see in the figure that both methods assign the lowest value to the sample with the corrupted label.

As mentioned above, despite the previous results, this work is preliminary and the usefulness of Data Utility Learning remains to be tested in practice.