Least Core for Data Valuation¶
This notebook introduces Least Core methods for the computation of data values using pyDVL.
Shapley values define a fair way of distributing the worth of the whole training set when every data point is part of it. But they do not consider the question of stability of subsets: Could some data points obtain a higher payoff if they formed smaller subsets? It is argued that this might be relevant if data providers are paid based on data value, since Shapley values can incentivise them not to contribute their data to the "grand coalition", but instead try to form smaller ones. Whether this is of actual practical relevance is debatable, but in any case, the least core is an alternative tool available for any task of Data Valuation
The Core is another approach to compute data values originating in cooperative game theory that attempts to answer those questions. It is the set of feasible payoffs that cannot be improved upon by a coalition of the participants.
Its use for Data Valuation was first described in the paper If You Like Shapley Then You’ll Love the Core by Tom Yan and Ariel D. Procaccia.
The Least Core value \(v\) of the \(i\)-th sample in dataset \(D\) wrt. utility \(u\) is computed by solving the following Linear Program:
To illustrate this method we will use a synthetic dataset. We will first use a subset of 10 data point to compute the exact values and use them to assess the Monte Carlo approximation. Afterwards, we will conduct the data removal experiments as described by Ghorbani and Zou in their paper Data Shapley: Equitable Valuation of Data for Machine Learning: We compute the data valuation given different computation budgets and incrementally remove a percentage of the best, respectively worst, data points and observe how that affects the utility.
Setup¶
We begin by importing the main libraries and setting some defaults.
We will be using the following functions and classes from pyDVL.
Dataset¶
We generate a synthetic dataset using the make_classification
function from scikit-learn.
We sample 200 data points from a 50-dimensional Gaussian distribution with 25 informative features and 25 non-informative features (generated as random linear combinations of the informative features).
The 200 samples are uniformly distributed across 3 classes with a small percentage of noise added to the labels to make the task a bit more difficult.
Estimating Least Core Values¶
In this first section we will use a smaller subset of the dataset containing 10 samples in order to be able to compute exact values in a reasonable amount of time. Afterwards, we will use the Monte Carlo method with a limited budget (maximum number of subsets) to approximate these values.
budget_array = np.linspace(200, 2 ** len(small_dataset), num=10, dtype=int)
all_estimated_values_df = []
all_errors = {budget: [] for budget in budget_array}
for budget in tqdm(budget_array):
dfs = []
errors = []
column_name = f"estimated_value_{budget}"
for i in range(20):
values = compute_least_core_values(
u=utility,
mode=LeastCoreMode.MonteCarlo,
n_iterations=budget,
n_jobs=n_jobs,
)
df = (
values.to_dataframe(column=column_name)
.drop(columns=[f"{column_name}_stderr"])
.T
)
df = df[sorted(df.columns)]
error = mean_squared_error(
exact_values_df.loc["exact_value"].values, df.values.ravel()
)
all_errors[budget].append(error)
df["budget"] = budget
dfs.append(df)
estimated_values_df = pd.concat(dfs)
all_estimated_values_df.append(estimated_values_df)
values_df = pd.concat(all_estimated_values_df)
errors_df = pd.DataFrame(all_errors)
We can see that the approximation error decreases, on average, as the we increase the budget.
Still, the decrease may not always necessarily happen when we increase the number of iterations because of the fact that we sample the subsets with replacement in the Monte Carlo method i.e there may be repeated subsets.
Data Removal¶
We now move on to the data removal experiments using the full dataset.
In these experiments, we first rank the data points from most valuable to least valuable using the values estimated by the Monte Carlo Least Core method. Then, we gradually remove from 5 to 40 percent, by increments of 5 percentage points, of the most valuable/least valuable ones, train the model on this subset and compute its accuracy.
Remove Best¶
We start by removing the best data points and seeing how the model's accuracy evolves.
all_scores = []
for i in trange(5):
for method_name in method_names:
if method_name == "Random":
values = ValuationResult.from_random(size=len(utility.data))
else:
values = compute_least_core_values(
u=utility,
mode=LeastCoreMode.MonteCarlo,
n_iterations=n_iterations,
n_jobs=n_jobs,
)
scores = compute_removal_score(
u=utility,
values=values,
percentages=removal_percentages,
remove_best=True,
)
scores["method_name"] = method_name
all_scores.append(scores)
scores_df = pd.DataFrame(all_scores)
We can clearly see that removing the most valuable data points, as given by the Least Core method, leads to, on average, a decrease in the model's performance and that the method outperforms random removal of data points.
Remove Worst¶
We then proceed to removing the worst data points and seeing how the model's accuracy evolves.
all_scores = []
for i in trange(5):
for method_name in method_names:
if method_name == "Random":
values = ValuationResult.from_random(size=len(utility.data))
else:
values = compute_least_core_values(
u=utility,
mode=LeastCoreMode.MonteCarlo,
n_iterations=n_iterations,
n_jobs=n_jobs,
)
scores = compute_removal_score(
u=utility,
values=values,
percentages=removal_percentages,
)
scores["method_name"] = method_name
all_scores.append(scores)
scores_df = pd.DataFrame(all_scores)
We can clearly see that removing the least valuable data points, as given by the Least Core method, leads to, on average, an increase in the model's performance and that the method outperforms the random removal of data points.
Created: 2023-12-21