Skip to content

Shapley for data valuation

This notebook introduces Shapley methods for the computation of data value using pyDVL.

In order to illustrate the practical advantages, we will predict the popularity of songs in the dataset Top Hits Spotify from 2000-2019, and highlight how data valuation can help investigate and boost the performance of the models. In doing so, we will describe the basic usage patterns of pyDVL.

Recall that data value is a function of three things:

  1. The dataset.
  2. The model.
  3. The performance metric or scoring function.

Below we will describe how to instantiate each one of these objects and how to use them for data valuation. Please also see the documentation on data valuation.

Setup

We begin by importing the main libraries and setting some defaults.

If you are reading this in the documentation, some boilerplate (including most plotting code) has been omitted for convenience.

We will be using the following functions from pyDVL. The main entry point is the function compute_shapley_values(), which provides a facade to all Shapley methods. In order to use it we need the classes Dataset, Utility and Scorer.

%autoreload
from pydvl.reporting.plots import plot_shapley
from pydvl.utils.dataset import GroupedDataset
from support.shapley import load_spotify_dataset
from pydvl.value import *

Loading and grouping the dataset

pyDVL provides a support function for this notebook, load_spotify_dataset(), which downloads data on songs published after 2014, and splits 30% of data for testing, and 30% of the remaining data for validation. The return value is a triple of training, validation and test data as lists of the form [X_input, Y_label].

training_data, val_data, test_data = load_spotify_dataset(
    val_size=0.3, test_size=0.3, target_column="popularity", random_state=random_state
)
training_data[0].head()
artist song duration_ms explicit year danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo genre
1561 Fetty Wap 679 (feat. Remy Boyz) 196693 True 2015 0.618 0.717 7 -5.738 1 0.3180 0.00256 0.000000 0.6250 0.603 190.050 8
1410 Meghan Trainor All About That Bass 187920 True 2015 0.807 0.887 9 -3.726 1 0.0503 0.05730 0.000003 0.1240 0.961 134.052 14
1772 Katy Perry Chained To The Rhythm 237733 False 2017 0.562 0.800 0 -5.404 1 0.1120 0.08140 0.000000 0.1990 0.471 95.029 14
1670 Sigala Sweet Lovin' - Radio Edit 202149 False 2015 0.683 0.910 10 -1.231 1 0.0515 0.05530 0.000005 0.3360 0.674 124.977 15
1780 Liam Payne Strip That Down 204502 False 2017 0.869 0.485 6 -5.595 1 0.0545 0.24600 0.000000 0.0765 0.527 106.028 14

The dataset has many high-level features, some quite intuitive ('duration_ms' or 'tempo'), while others are a bit more cryptic ('valence'?). For information on each feature, please consult the dataset's website.

In our analysis, we will use all the columns, except for 'artist' and 'song', to predict the 'popularity' of each song. We will nonetheless keep the information on song and artist in a separate object for future reference.

song_name = training_data[0]["song"]
artist = training_data[0]["artist"]
training_data[0] = training_data[0].drop(["song", "artist"], axis=1)
test_data[0] = test_data[0].drop(["song", "artist"], axis=1)
val_data[0] = val_data[0].drop(["song", "artist"], axis=1)

Input and label data are then used to instantiate a Dataset object:

dataset = Dataset(*training_data, *val_data)

The calculation of exact Shapley values is computationally very expensive (exponentially so!) because it requires training the model on every possible subset of the training set. For this reason, PyDVL implements techniques to speed up the calculation, such as Monte Carlo approximations, surrogate models or caching of intermediate results and grouping of data to calculate group Shapley values instead of single data points.

In our case, we will group songs by artist and calculate the Shapley value for the artists. Given the pandas Series for 'artist', to group the dataset by it, one does the following:

grouped_dataset = GroupedDataset.from_dataset(dataset=dataset, data_groups=artist)

Creating the utility and computing values

Now we can calculate the contribution of each group to the model performance.

As a model, we use scikit-learn's GradientBoostingRegressor, but pyDVL can work with any model from sklearn, xgboost or lightgbm. More precisely, any model that implements the protocol pydvl.utils.types.SupervisedModel, which is just the standard sklearn interface of fit(),predict() and score() can be used to construct the utility.

The third and final component is the scoring function. It can be anything like accuracy or \(R^2\), and is set with a string from the standard sklearn scoring methods. Please refer to that documentation on information on how to define your own scoring function.

We group dataset, model and scoring function into an instance of Utility.

utility = Utility(
    model=GradientBoostingRegressor(n_estimators=3),
    data=grouped_dataset,
    scorer=Scorer("neg_mean_absolute_error", default=0.0),
)
values = compute_shapley_values(
    utility,
    mode=ShapleyMode.TruncatedMontecarlo,
    # Stop if the standard error is below 1% of the range of the values (which is ~2),
    # or if the number of updates exceeds 1000
    done=AbsoluteStandardError(threshold=0.2, fraction=0.9) | MaxUpdates(1000),
    truncation=RelativeTruncation(utility, rtol=0.01),
    n_jobs=-1,
)
values.sort(key="value")
df = values.to_dataframe(column="data_value", use_names=True)

The function compute_shapley_values() serves as a common access point to all Shapley methods. For most of them, we must choose a StoppingCriterion with the argument done=. In this case we choose to stop when the ratio of standard error to value is below 0.2 for at least 90% of the training points, or if the number of updates of any index exceeds 1000. The mode argument specifies the Shapley method to use. In this case, we use the Truncated Monte Carlo approximation, which is the fastest of the Monte Carlo methods, owing both to using the permutation definition of Shapley values and the ability to truncate the iteration over a given permutation. We configure this to happen when the contribution of the remaining elements is below 1% of the total utility with the parameter truncation= and the policy RelativeTruncation.

Let's take a look at the returned dataframe:

df.head()
data_value data_value_stderr
Kendrick Lamar -1.279149 0.091670
BLACKPINK -1.277363 0.177476
Adele -1.241698 0.183732
5 Seconds of Summer -1.228002 0.103377
Flume -1.197065 0.102345

The first thing to notice is that we sorted the results in ascending order of Shapley value. The index holds the labels for each data group: in this case, artist names. The column data_value is just that: the Shapley Data value, and data_value_stderr is its estimated standard error because we are using a Monte Carlo approximation.

Let us plot the results. In the next cell we will take the 30 artists with the lowest score and plot their values with 95% Normal confidence intervals. Keep in mind that Monte Carlo Shapley is typically very noisy, and it can take many steps to arrive at a clean estimate.

No description has been provided for this image

We can immediately see that many artists (groups of samples) have very low, even negative value, which means that they tend to decrease the total score of the model when present in the training set! What happens if we remove them?

In the next cell we create a new training set excluding the artists with the lowest scores:

low_dvl_artists = df.iloc[:30].index.to_list()
artist_filter = ~artist.isin(low_dvl_artists)
X_train_good_dvl = training_data[0][artist_filter]
y_train_good_dvl = training_data[1][artist_filter]

Now we will use this "cleaned" dataset to retrain the same model and compare its mean absolute error to the one trained on the full dataset. Notice that the score now is calculated using the test set, while in the calculation of the Shapley values we were using the validation set.

model_good_data = GradientBoostingRegressor(n_estimators=3).fit(
    X_train_good_dvl, y_train_good_dvl
)
error_good_data = mean_absolute_error(
    model_good_data.predict(test_data[0]), test_data[1]
)

model_all_data = GradientBoostingRegressor(n_estimators=3).fit(
    training_data[0], training_data[1]
)
error_all_data = mean_absolute_error(model_all_data.predict(test_data[0]), test_data[1])

print(f"Improvement: {100*(error_all_data - error_good_data)/error_all_data:02f}%")
Improvement: 13.940685%

The score has improved by almost 14%! This is quite an important result, as it shows a consistent process to improve the performance of a model by excluding data points from its training set.

One must however proceed with caution instead of simply throwing away data. For one, `mean_absolute_error` is an estimate of generalization error on unseen data, so the improvement we see on the test set might not be as large upon deployment. It would be advisable to cross-validate this whole process to obtain more conservative estimates. It is also advisable to manually inspect the artists with low value and to try to understand the reason why the model behaves like it does. Finally, remember that **the value depends on the model chosen**! Artists that are detrimental to the Gradient Boosting Regressor might be informative for a different model (although it is likely that the worst ones share some characteristic making them "bad" for other regressors).

Evaluation on anomalous data

One interesting test is to corrupt some data and to monitor how their value changes. To do this, we will take one of the artists with the highest value and set the popularity of all their songs to 0.

No description has been provided for this image

Let us take all the songs by Rihanna, set their score to 0 and re-calculate the Shapley values.

y_train_anomalous = training_data[1].copy(deep=True)
y_train_anomalous[artist == "Rihanna"] = 0
anomalous_dataset = Dataset(
    x_train=training_data[0],
    y_train=y_train_anomalous,
    x_test=val_data[0],
    y_test=val_data[1],
)
grouped_anomalous_dataset = GroupedDataset.from_dataset(anomalous_dataset, artist)
anomalous_utility = Utility(
    model=GradientBoostingRegressor(n_estimators=3),
    data=grouped_anomalous_dataset,
    scorer=Scorer("neg_mean_absolute_error", default=0.0),
)
values = compute_shapley_values(
    anomalous_utility,
    mode=ShapleyMode.TruncatedMontecarlo,
    done=AbsoluteStandardError(threshold=0.2, fraction=0.9) | MaxUpdates(1000),
    n_jobs=-1,
)
values.sort(key="value")
df = values.to_dataframe(column="data_value", use_names=True)

Let us now consider the low-value artists (at least for predictive purposes, no claims are made about their artistic value!) and plot the results

No description has been provided for this image

And Rihanna (our anomalous data group) has moved from top contributor to having negative impact on the performance of the model, as expected!

What is going on? A popularity of 0 for Rihanna's songs is inconsistent with listening patterns for other artists. In artificially setting this, we degrade the predictive power of the model.

By dropping low-value groups or samples, one can often increase model performance, but by inspecting them, it is possible to identify bogus data sources or acquisition methods.


Last update: 2023-08-24
Created: 2021-04-02