Skip to content

Least Core for Data Valuation

This notebook introduces Least Core methods for the computation of data values using pyDVL.

Shapley values define a fair way of distributing the worth of the whole training set when every data point is part of it. But they do not consider the question of stability of subsets: Could some data points obtain a higher payoff if they formed smaller subsets? It is argued that this might be relevant if data providers are paid based on data value, since Shapley values can incentivise them not to contribute their data to the "grand coalition", but instead try to form smaller ones. Whether this is of actual practical relevance is debatable, but in any case, the least core is an alternative tool available for any task of Data Valuation

The Core is another approach to compute data values originating in cooperative game theory that attempts to answer those questions. It is the set of feasible payoffs that cannot be improved upon by a coalition of the participants.

Its use for Data Valuation was first described in the paper If You Like Shapley Then You’ll Love the Core by Tom Yan and Ariel D. Procaccia.

The Least Core value \(v\) of the \(i\) -th sample in dataset \(D\) wrt. utility \(u\) is computed by solving the following Linear Program:

\[ \begin{array}{lll} \text{minimize} & \displaystyle{e} & \\ \text{subject to} & \displaystyle\sum_{x_i\in D} v_u(x_i) = u(D) & \\ & \displaystyle\sum_{x_i\in S} v_u(x_i) + e \geq u(S) &, \forall S \subset D, S \neq \emptyset \\ \end{array} \]

To illustrate this method we will use a synthetic dataset. We will first use a subset of 10 data point to compute the exact values and use them to assess the Monte Carlo approximation. Afterwards, we will conduct the data removal experiments as described by Ghorbani and Zou in their paper Data Shapley: Equitable Valuation of Data for Machine Learning : We compute the data valuation given different computation budgets and incrementally remove a percentage of the best, respectively worst, data points and observe how that affects the utility.

Setup

We begin by importing the main libraries and setting some defaults.

If you are reading this in the documentation, some boilerplate (including most plotting code) has been omitted for convenience.

We will be using the following functions and classes from pyDVL.

%autoreload
from pydvl.utils import (
    Dataset,
    Utility,
)
from pydvl.value import compute_least_core_values, LeastCoreMode, ValuationResult
from pydvl.reporting.plots import shaded_mean_std
from pydvl.reporting.scores import compute_removal_score

Dataset

We generate a synthetic dataset using the make_classification function from scikit-learn.

We sample 200 data points from a 50-dimensional Gaussian distribution with 25 informative features and 25 non-informative features (generated as random linear combinations of the informative features).

The 200 samples are uniformly distributed across 3 classes with a small percentage of noise added to the labels to make the task a bit more difficult.

X, y = make_classification(
    n_samples=dataset_size,
    n_features=50,
    n_informative=25,
    n_classes=3,
    random_state=random_state,
)
full_dataset = Dataset.from_arrays(
    X, y, stratify_by_target=True, random_state=random_state
)
small_dataset = Dataset.from_arrays(
    X,
    y,
    stratify_by_target=True,
    train_size=train_size,
    random_state=random_state,
)
model = LogisticRegression(max_iter=500, solver="liblinear")
model.fit(full_dataset.x_train, full_dataset.y_train)
print(
    f"Training accuracy: {100 * model.score(full_dataset.x_train, full_dataset.y_train):0.2f}%"
)
print(
    f"Testing accuracy: {100 * model.score(full_dataset.x_test, full_dataset.y_test):0.2f}%"
)
Training accuracy: 86.25%
Testing accuracy: 70.00%

model.fit(small_dataset.x_train, small_dataset.y_train)
print(
    f"Training accuracy: {100 * model.score(small_dataset.x_train, small_dataset.y_train):0.2f}%"
)
print(
    f"Testing accuracy: {100 * model.score(small_dataset.x_test, small_dataset.y_test):0.2f}%"
)
Training accuracy: 100.00%
Testing accuracy: 47.89%

Estimating Least Core Values

In this first section we will use a smaller subset of the dataset containing 10 samples in order to be able to compute exact values in a reasonable amount of time. Afterwards, we will use the Monte Carlo method with a limited budget (maximum number of subsets) to approximate these values.

utility = Utility(model=model, data=small_dataset)
exact_values = compute_least_core_values(
    u=utility,
    mode=LeastCoreMode.Exact,
    progress=True,
)
exact_values_df = exact_values.to_dataframe(column="exact_value").T
exact_values_df = exact_values_df[sorted(exact_values_df.columns)]
budget_array = np.linspace(200, 2 ** len(small_dataset), num=10, dtype=int)

all_estimated_values_df = []
all_errors = {budget: [] for budget in budget_array}

for budget in tqdm(budget_array):
    dfs = []
    errors = []
    column_name = f"estimated_value_{budget}"
    for i in range(20):
        values = compute_least_core_values(
            u=utility,
            mode=LeastCoreMode.MonteCarlo,
            n_iterations=budget,
            n_jobs=n_jobs,
        )
        df = (
            values.to_dataframe(column=column_name)
            .drop(columns=[f"{column_name}_stderr", f"{column_name}_updates"])
            .T
        )
        df = df[sorted(df.columns)]
        error = mean_squared_error(
            exact_values_df.loc["exact_value"].values, np.nan_to_num(df.values.ravel())
        )
        all_errors[budget].append(error)
        df["budget"] = budget
        dfs.append(df)
    estimated_values_df = pd.concat(dfs)
    all_estimated_values_df.append(estimated_values_df)

values_df = pd.concat(all_estimated_values_df)
errors_df = pd.DataFrame(all_errors)
No description has been provided for this image

We can see that the approximation error decreases, on average, as the we increase the budget.

Still, the decrease may not always necessarily happen when we increase the number of iterations because of the fact that we sample the subsets with replacement in the Monte Carlo method i.e there may be repeated subsets.

No description has been provided for this image

Data Removal

We now move on to the data removal experiments using the full dataset.

In these experiments, we first rank the data points from most valuable to least valuable using the values estimated by the Monte Carlo Least Core method. Then, we gradually remove from 5 to 40 percent, by increments of 5 percentage points, of the most valuable/least valuable ones, train the model on this subset and compute its accuracy.

utility = Utility(model=model, data=full_dataset)
method_names = ["Random", "Least Core"]
removal_percentages = np.arange(0, 0.41, 0.05)

Remove Best

We start by removing the best data points and seeing how the model's accuracy evolves.

all_scores = []

for i in trange(5):
    for method_name in method_names:
        if method_name == "Random":
            values = ValuationResult.from_random(size=len(utility.data))
        else:
            values = compute_least_core_values(
                u=utility,
                mode=LeastCoreMode.MonteCarlo,
                n_iterations=n_iterations,
                n_jobs=n_jobs,
            )
        scores = compute_removal_score(
            u=utility,
            values=values,
            percentages=removal_percentages,
            remove_best=True,
        )
        scores["method_name"] = method_name
        all_scores.append(scores)

scores_df = pd.DataFrame(all_scores)
No description has been provided for this image

We can clearly see that removing the most valuable data points, as given by the Least Core method, leads to, on average, a decrease in the model's performance and that the method outperforms random removal of data points.

Remove Worst

We then proceed to removing the worst data points and seeing how the model's accuracy evolves.

all_scores = []

for i in trange(5):
    for method_name in method_names:
        if method_name == "Random":
            values = ValuationResult.from_random(size=len(utility.data))
        else:
            values = compute_least_core_values(
                u=utility,
                mode=LeastCoreMode.MonteCarlo,
                n_iterations=n_iterations,
                n_jobs=n_jobs,
            )
        scores = compute_removal_score(
            u=utility,
            values=values,
            percentages=removal_percentages,
        )
        scores["method_name"] = method_name
        all_scores.append(scores)

scores_df = pd.DataFrame(all_scores)
No description has been provided for this image

We can clearly see that removing the least valuable data points, as given by the Least Core method, leads to, on average, an increase in the model's performance and that the method outperforms the random removal of data points.