Shapley for data valuation ¶

This notebook introduces Shapley methods for the computation of data value using pyDVL.

In order to illustrate the practical advantages, we will predict the popularity of songs in the dataset Top Hits Spotify from 2000-2019 , and highlight how data valuation can help investigate and boost the performance of the models. In doing so, we will describe the basic usage patterns of pyDVL.

Recall that data value is a function of three things:

The dataset.
The model.
The performance metric or scoring function.

Below we will describe how to instantiate each one of these objects and how to use them for data valuation. Please also see the documentation on data valuation .

Setup ¶

We begin by importing the main libraries and setting some defaults.

If you are reading this in the documentation, some boilerplate (including most plotting code) has been omitted for convenience.

We will be using the following functions from pyDVL. The main entry point is the function compute_shapley_values() , which provides a facade to all Shapley methods. In order to use it we need the classes Dataset , Utility and Scorer .

%autoreload
from pydvl.reporting.plots import plot_shapley
from pydvl.utils.dataset import GroupedDataset
from support.shapley import load_spotify_dataset
from pydvl.value import *

Loading and grouping the dataset ¶

pyDVL provides a support function for this notebook, load_spotify_dataset() , which downloads data on songs published after 2014, and splits 30% of data for testing, and 30% of the remaining data for validation. The return value is a triple of training, validation and test data as lists of the form [X_input, Y_label] .

training_data, val_data, test_data = load_spotify_dataset(
    val_size=0.3, test_size=0.3, target_column="popularity", random_state=random_state
)

training_data[0].head()

	artist	song	duration_ms	explicit	year	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	genre
1561	Fetty Wap	679 (feat. Remy Boyz)	196693	True	2015	0.618	0.717	7	-5.738	1	0.3180	0.00256	0.000000	0.6250	0.603	190.050	8
1410	Meghan Trainor	All About That Bass	187920	True	2015	0.807	0.887	9	-3.726	1	0.0503	0.05730	0.000003	0.1240	0.961	134.052	14
1772	Katy Perry	Chained To The Rhythm	237733	False	2017	0.562	0.800	0	-5.404	1	0.1120	0.08140	0.000000	0.1990	0.471	95.029	14
1670	Sigala	Sweet Lovin' - Radio Edit	202149	False	2015	0.683	0.910	10	-1.231	1	0.0515	0.05530	0.000005	0.3360	0.674	124.977	15
1780	Liam Payne	Strip That Down	204502	False	2017	0.869	0.485	6	-5.595	1	0.0545	0.24600	0.000000	0.0765	0.527	106.028	14

The dataset has many high-level features, some quite intuitive ('duration_ms' or 'tempo'), while others are a bit more cryptic ('valence'?). For information on each feature, please consult the dataset's website .

In our analysis, we will use all the columns, except for 'artist' and 'song', to predict the 'popularity' of each song. We will nonetheless keep the information on song and artist in a separate object for future reference.

song_name = training_data[0]["song"]
artist = training_data[0]["artist"]
training_data[0] = training_data[0].drop(["song", "artist"], axis=1)
test_data[0] = test_data[0].drop(["song", "artist"], axis=1)
val_data[0] = val_data[0].drop(["song", "artist"], axis=1)

Input and label data are then used to instantiate a Dataset object:

dataset = Dataset(*training_data, *val_data)

The calculation of exact Shapley values is computationally very expensive (exponentially so!) because it requires training the model on every possible subset of the training set. For this reason, PyDVL implements techniques to speed up the calculation, such as Monte Carlo approximations , surrogate models or caching of intermediate results and grouping of data to calculate group Shapley values instead of single data points.

In our case, we will group songs by artist and calculate the Shapley value for the artists. Given the pandas Series for 'artist', to group the dataset by it, one does the following:

grouped_dataset = GroupedDataset.from_dataset(dataset=dataset, data_groups=artist)

Creating the utility and computing values ¶

Now we can calculate the contribution of each group to the model performance.

As a model, we use scikit-learn's GradientBoostingRegressor , but pyDVL can work with any model from sklearn, xgboost or lightgbm. More precisely, any model that implements the protocol pydvl.utils.types.SupervisedModel , which is just the standard sklearn interface of fit() , predict() and score() can be used to construct the utility.

The third and final component is the scoring function. It can be anything like accuracy or \(R^2\) , and is set with a string from the standard sklearn scoring methods . Please refer to that documentation on information on how to define your own scoring function.

We group dataset, model and scoring function into an instance of Utility .

utility = Utility(
    model=GradientBoostingRegressor(n_estimators=3),
    data=grouped_dataset,
    scorer=Scorer("neg_mean_absolute_error", default=0.0),
)
values = compute_shapley_values(
    utility,
    mode=ShapleyMode.TruncatedMontecarlo,
    # Stop if the standard error is below 1% of the range of the values (which is ~2),
    # or if the number of updates exceeds 1000
    done=AbsoluteStandardError(threshold=0.2, fraction=0.9) | MaxUpdates(1000),
    truncation=RelativeTruncation(utility, rtol=0.01),
    n_jobs=-1,
)
values.sort(key="value")
df = values.to_dataframe(column="data_value", use_names=True)

Cancellation of futures is not supported by the joblib backend

The function compute_shapley_values() serves as a common access point to all Shapley methods. For most of them, we must choose a StoppingCriterion with the argument done= . In this case we choose to stop when the ratio of standard error to value is below 0.2 for at least 90% of the training points, or if the number of updates of any index exceeds 1000. The mode argument specifies the Shapley method to use. In this case, we use the Truncated Monte Carlo approximation , which is the fastest of the Monte Carlo methods, owing both to using the permutation definition of Shapley values and the ability to truncate the iteration over a given permutation. We configure this to happen when the contribution of the remaining elements is below 1% of the total utility with the parameter truncation= and the policy RelativeTruncation .

Let's take a look at the returned dataframe:

df.head()

	data_value	data_value_stderr
Years & Years	-1.150663	0.195376
Reik	-1.123071	0.126558
Astrid S	-0.945702	0.331619
Liam Payne	-0.886687	0.112654
DB Boulevard	-0.847957	0.057503

The first thing to notice is that we sorted the results in ascending order of Shapley value. The index holds the labels for each data group: in this case, artist names. The column data_value is just that: the Shapley Data value, and data_value_stderr is its estimated standard error because we are using a Monte Carlo approximation.

Let us plot the results. In the next cell we will take the 30 artists with the lowest score and plot their values with 95% Normal confidence intervals. Keep in mind that Monte Carlo Shapley is typically very noisy, and it can take many steps to arrive at a clean estimate.

No description has been provided for this image

We can immediately see that many artists (groups of samples) have very low, even negative value, which means that they tend to decrease the total score of the model when present in the training set! What happens if we remove them?

In the next cell we create a new training set excluding the artists with the lowest scores:

low_dvl_artists = df.iloc[: int(0.2 * len(df))].index.to_list()
artist_filter = ~artist.isin(low_dvl_artists)
X_train_good_dvl = training_data[0][artist_filter]
y_train_good_dvl = training_data[1][artist_filter]

Now we will use this "cleaned" dataset to retrain the same model and compare its mean absolute error to the one trained on the full dataset. Notice that the score now is calculated using the test set, while in the calculation of the Shapley values we were using the validation set.

model_good_data = GradientBoostingRegressor(n_estimators=3).fit(
    X_train_good_dvl, y_train_good_dvl
)
error_good_data = mean_absolute_error(
    model_good_data.predict(test_data[0]), test_data[1]
)

model_all_data = GradientBoostingRegressor(n_estimators=3).fit(
    training_data[0], training_data[1]
)
error_all_data = mean_absolute_error(model_all_data.predict(test_data[0]), test_data[1])

print(f"Improvement: {100*(error_all_data - error_good_data)/error_all_data:02f}%")

Improvement: 15.314214%

The score has improved by almost 14%! This is quite an important result, as it shows a consistent process to improve the performance of a model by excluding data points from its training set.

One must however proceed with caution instead of simply throwing away data. For one, `mean_absolute_error` is an estimate of generalization error on unseen data, so the improvement we see on the test set might not be as large upon deployment. It would be advisable to cross-validate this whole process to obtain more conservative estimates. It is also advisable to manually inspect the artists with low value and to try to understand the reason why the model behaves like it does. Finally, remember that **the value depends on the model chosen**! Artists that are detrimental to the Gradient Boosting Regressor might be informative for a different model (although it is likely that the worst ones share some characteristic making them "bad" for other regressors).

Evaluation on anomalous data ¶

One interesting test is to corrupt some data and to monitor how their value changes. To do this, we will take one of the artists with the highest value and set the popularity of all their songs to 0.

Let us take all the songs by Billie Eilish, set their score to 0 and re-calculate the Shapley values.

y_train_anomalous = training_data[1].copy(deep=True)
y_train_anomalous[artist == "Billie Eilish"] = 0
anomalous_dataset = Dataset(
    x_train=training_data[0],
    y_train=y_train_anomalous,
    x_test=val_data[0],
    y_test=val_data[1],
)
grouped_anomalous_dataset = GroupedDataset.from_dataset(anomalous_dataset, artist)
anomalous_utility = Utility(
    model=GradientBoostingRegressor(n_estimators=3),
    data=grouped_anomalous_dataset,
    scorer=Scorer("neg_mean_absolute_error", default=0.0),
)
values = compute_shapley_values(
    anomalous_utility,
    mode=ShapleyMode.TruncatedMontecarlo,
    done=AbsoluteStandardError(threshold=0.2, fraction=0.9) | MaxUpdates(1000),
    n_jobs=-1,
)
values.sort(key="value")
df = values.to_dataframe(column="data_value", use_names=True)

Cancellation of futures is not supported by the joblib backend

Let us now consider the low-value artists (at least for predictive purposes, no claims are made about their artistic value!) and plot the results

And Billie Eilish (our anomalous data group) has moved from top contributor to having negative impact on the performance of the model, as expected!

What is going on? A popularity of 0 for Billie Eilish's songs is inconsistent with listening patterns for other artists. In artificially setting this, we degrade the predictive power of the model.

By dropping low-value groups or samples, one can often increase model performance, but by inspecting them, it is possible to identify bogus data sources or acquisition methods.