Least Core for Data Valuation ¶

This notebook introduces Least Core methods for the computation of data values using pyDVL.

Shapley values define a fair way of distributing the worth of the whole training set when every data point is part of it. But they do not consider the question of stability of subsets: Could some data points obtain a higher payoff if they formed smaller subsets? It is argued that this might be relevant if data providers are paid based on data value, since Shapley values can incentivise them not to contribute their data to the "grand coalition", but instead try to form smaller ones. Whether this is of actual practical relevance is debatable, but in any case, the least core is an alternative tool available for any task of Data Valuation

The Core is another approach to compute data values originating in cooperative game theory that attempts to answer those questions. It is the set of feasible payoffs that cannot be improved upon by a coalition of the participants.

Its use for Data Valuation was first described in the paper If You Like Shapley Then You’ll Love the Core by Tom Yan and Ariel D. Procaccia.

The Least Core value \(v\) of the \(i\) -th sample in dataset \(D\) wrt. utility \(u\) is computed by solving the following Linear Program:

\[ \begin{array}{lll} \text{minimize} & \displaystyle{e} & \\ \text{subject to} & \displaystyle\sum_{x_i\in D} v_u(x_i) = u(D) & \\ & \displaystyle\sum_{x_i\in S} v_u(x_i) + e \geq u(S) &, \forall S \subset D, S \neq \emptyset \\ \end{array} \]

To illustrate this method we will use a synthetic dataset. We will first use a subset of 10 data point to compute the exact values and use them to assess the Monte Carlo approximation. Afterwards, we will conduct the data removal experiments as described by Ghorbani and Zou in their paper Data Shapley: Equitable Valuation of Data for Machine Learning : We compute the data valuation given different computation budgets and incrementally remove a percentage of the best, respectively worst, data points and observe how that affects the utility.

Setup ¶

We begin by importing the main libraries and setting some defaults.

If you are reading this in the documentation, some boilerplate (including most plotting code) has been omitted for convenience.

We will be using the following functions and classes from pyDVL.

from pydvl.reporting.plots import shaded_mean_std
from pydvl.reporting.scores import compute_removal_score
from pydvl.valuation import (
    Dataset,
    ExactLeastCoreValuation,
    ModelUtility,
    MonteCarloLeastCoreValuation,
    SupervisedScorer,
)

Dataset ¶

We generate a synthetic dataset using the make_classification function from scikit-learn.

We sample 200 data points from a 50-dimensional Gaussian distribution with 25 informative features and 25 non-informative features (generated as random linear combinations of the informative features).

The 200 samples are uniformly distributed across 3 classes with a small percentage of noise added to the labels to make the task a bit more difficult.

X, y = make_classification(
    n_samples=dataset_size,
    n_features=50,
    n_informative=25,
    n_classes=3,
    random_state=seed,
)

train, test = Dataset.from_arrays(X, y, stratify_by_target=True, random_state=seed)

model = LogisticRegression(max_iter=500, solver="liblinear")
model.fit(train.data().x, train.data().y);

Estimating Least Core Values ¶

In this first section we use a smaller subset of the dataset containing 12 samples in order to be able to compute exact values reasonably fast. Recall that, in order to assemble the problem, for every subset \(S \subset D\) we must compute the utility \(u(S).\) We then have a linear problem with \(2^{|D|}\) constraints to solve. After doing this, we use the Monte Carlo method with a limited budget (maximum number of constraints) to approximate the LC values on the same reduced dataset, and we repeat this several times to assess the stability of the approximation.

train_mini = train[:train_mini_size]
scorer = SupervisedScorer("accuracy", test_data=test, default=0, range=(0, 1))
utility = ModelUtility(model=model, scorer=scorer)
valuation = ExactLeastCoreValuation(utility=utility)

from joblib import parallel_config

with parallel_config(n_jobs=n_jobs):
    valuation.fit(train_mini)

exact_result = valuation.result
exact_df = exact_result.to_dataframe(column="exact").T
exact_df = exact_df[sorted(exact_df.columns)]

def relative_error(x, estimate, norm):
    return np.linalg.norm(x - estimate, ord=norm) / np.linalg.norm(x, ord=norm)


budget_array = np.logspace(8, len(train_mini), base=2, num=8, endpoint=False, dtype=int)

all_estimated_values_df = []
all_errors = {budget: [] for budget in budget_array}

with tqdm(total=len(budget_array) * n_runs) as pbar:
    for budget in budget_array:
        pbar.set_description(f"Computing values with {budget} constraints")
        dfs = []
        errors = []
        column_name = f"estimated_{budget}"
        valuation = MonteCarloLeastCoreValuation(
            utility=utility, n_samples=budget, progress=False
        )
        for i in range(n_runs):
            with parallel_config(n_jobs=n_jobs):
                valuation.fit(train_mini)
            df = (
                valuation.result.to_dataframe(column=column_name)
                .drop(columns=[f"{column_name}_variances", f"{column_name}_counts"])
                .T
            )
            df = df[sorted(df.columns)]
            error = relative_error(
                exact_df.loc["exact"].values, np.nan_to_num(df.values.ravel()), norm=1
            )
            all_errors[budget].append(error)
            df["budget"] = budget
            dfs.append(df)
            pbar.update(1)
        estimated_values_df = pd.concat(dfs)
        all_estimated_values_df.append(estimated_values_df)

values_df = pd.concat(all_estimated_values_df)
errors_df = pd.DataFrame(all_errors)

We can see that the approximation error decreases as the number of constraints increases, but that there are diminishing returns for increasing the budget beyond a certain point.

Data Removal ¶

In the final two experiments, we rank the training set according to the value estimates obtained with Monte Carlo Least Core. Then, we incrementally remove up to 40% of the most / least valuable training points, train the model on this subset and compute its accuracy on the previously held-out test set.

Remove Best ¶

We start by removing the best data points and seeing how the model's accuracy evolves. We repeat the whole process (valuation and removal) several times to assess the stability of the results.

from pydvl.valuation.methods.random import RandomValuation

removal_percentages = np.arange(0, 0.41, 0.05)
methods = [
    RandomValuation(random_state=seed),
    MonteCarloLeastCoreValuation(
        utility=utility, n_samples=n_iterations, progress=False, seed=seed
    ),
]
all_scores = []
for i in trange(n_runs, position=0, desc=f"Removing best points, {n_runs} times"):
    for method in methods:
        with parallel_config(n_jobs=n_jobs):
            valuation.fit(train)
        result = valuation.result

        scores = compute_removal_score(
            utility,
            result,
            train,
            removal_percentages,
            remove_best=True,
            progress=False,
        )

        scores["method_name"] = method.__class__.__name__
        all_scores.append(scores)

scores_df = pd.DataFrame(all_scores)

We can clearly see that removing the most valuable data points, as given by the Least Core method, leads to, on average, a decrease in the model's performance and that the method outperforms random removal of data points.

Remove Worst ¶

We then proceed to removing the worst data points and seeing how the model's accuracy evolves.

all_scores = []
for i in trange(n_runs, position=0, desc=f"Removing best points, {n_runs} times"):
    for method in methods:
        with parallel_config(n_jobs=n_jobs):
            valuation.fit(train)
        result = valuation.result

        scores = compute_removal_score(
            utility,
            result,
            train,
            removal_percentages,
            remove_best=False,
            progress=False,
        )

        scores["method_name"] = method.__class__.__name__
        all_scores.append(scores)

scores_df = pd.DataFrame(all_scores)

We can clearly see that removing the least valuable data points, as given by the Least Core method, leads to, on average, an increase in the model's performance and that the method outperforms the random removal of data points.