Shapley for data valuation ¶
This notebook introduces Shapley methods for the computation of data value using pyDVL.
In order to illustrate the practical advantages, we will predict the popularity of songs in the dataset Top Hits Spotify from 2000-2019 , and highlight how data valuation can help investigate and boost the performance of the models. In doing so, we will describe the basic usage patterns of pyDVL.
Recall that data value is a function of three things:
- The dataset.
- The model.
- The performance metric or scoring function.
Below we will describe how to instantiate each one of these objects and how to use them for data valuation. Please also see the documentation on data valuation .
Setup ¶
We begin by importing the main libraries and setting some defaults.
%matplotlib inline
import os
import random
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics._scorer import neg_mean_absolute_error_scorer
from support.shapley import load_spotify_dataset
plt.ioff() # Prevent jupyter from automatically plotting
plt.rcParams["figure.figsize"] = (20, 6)
plt.rcParams["font.size"] = 12
plt.rcParams["xtick.labelsize"] = 12
plt.rcParams["ytick.labelsize"] = 10
plt.rcParams["axes.facecolor"] = (1, 1, 1, 0)
plt.rcParams["figure.facecolor"] = (1, 1, 1, 0)
is_CI = os.environ.get("CI")
random_state = 24
random.seed(random_state)
n_jobs = 4
if is_CI:
n_jobs = 1
We will use the following classes and functions from pyDVL. The main entry point is the class ShapleyValuation , which provides the implementation of the Shapley method. In order to use it we need to instantiate two Datasets , a PermutationSampler (with a RelativeTruncation policy to stop computation early), a SupervisedScorer to evaluate the model on the held-out test set, and a ModelUtility to hold the model and the scoring function.
from pydvl.reporting.plots import plot_shapley
from pydvl.valuation.dataset import Dataset, GroupedDataset
from pydvl.valuation.methods.shapley import ShapleyValuation
from pydvl.valuation.samplers import PermutationSampler, RelativeTruncation
from pydvl.valuation.scorers import SupervisedScorer
from pydvl.valuation.stopping import HistoryDeviation, MaxUpdates
from pydvl.valuation.utility import ModelUtility
Loading and grouping the dataset ¶
pyDVL provides a support function for this notebook,
load_spotify_dataset()
, which downloads data on songs published after 2014, and splits 30% of data for testing, and 30% of the remaining data for validation. The return value is a triple of training, validation and test data as lists of the form
[X_input, Y_label]
.
The dataset has many high-level features, some quite intuitive (
duration_ms
or
tempo
), while others are a bit more cryptic (
valence
?).
For detailed information on each feature, please consult the dataset's website .
In our analysis, we will use every column except
artist
and
song
to predict the
popularity
of each song.
Input and label data are then used to instantiate Dataset objects:
train_dataset = Dataset(
*train_data, feature_names=train_data[0].columns, target_names=["popularity"]
)
test_dataset = Dataset(
*test_data, feature_names=train_data[0].columns, target_names=["popularity"]
)
val_dataset = Dataset(
*val_data, feature_names=train_data[0].columns, target_names=["popularity"]
)
The calculation of exact Shapley values is computationally very expensive (exponentially so!) because it requires training the model on every possible subset of the training set. For this reason, PyDVL implements techniques to speed up the calculation, such as Monte Carlo approximations, surrogate models or caching of intermediate results and grouping of data to calculate Shapley values of groups of data points instead of single data points.
In our case, we will group songs by artist and calculate the Shapley values for the artists. The class GroupedDataset takes an array mapping indices to group identifiers. These identifiers are 0-indexed and must be integers, so we can't use the artist names directly. Instead, we map the artist names to integers and use these as group identifiers. Note that we also cannot use the artist ids which are non-contiguous. We build the necessary mappings below.
artist_to_gid = {
artist: i for i, artist in enumerate(artists.unique())
} # 1:1 artist name -> group id
gid_to_artist = [
artist for artist, i in sorted(artist_to_gid.items(), key=lambda x: x[1])
] # 1:1 group id -> artist name
song_to_gid = [
artist_to_gid[x] for x in artists
] # n:1 song loc[] in the data -> group id
grouped_train_dataset = GroupedDataset.from_dataset(
train_dataset, data_groups=song_to_gid, group_names=gid_to_artist
)
The songs are now grouped by artist, and values will be computed per-group. This is a common scenario in data valuation. On the one hand the data points are naturally grouped, and it is more informative to know the value of the group than the value of each data point. On the other, it is computationally much cheaper to calculate the value of a few groups than the value of each data point.
Creating the utility and computing values ¶
Now we can calculate the contribution of each group to the model performance.
As a model, we use scikit-learn's
GradientBoostingRegressor
, but pyDVL can work with any model from sklearn, xgboost or lightgbm. More precisely, any model that implements the protocol
SupervisedModel
(which is just the standard scikit-learn interface of
fit()
,
predict()
and
score()
) can be used to construct the utility.
The third and final component is the scoring function. It can be anything like accuracy or
\(R^2\)
, and is set, in the simplest way, by passing a string from the
standard sklearn scoring methods
to the
SupervisedScorer
class. Please refer to that documentation on information on how to define your own scoring function.
We collect validation dataset, model and scoring function into an instance of ModelUtility .
Now we configure the valuation method. Shapley values were popularized for data valuation in machine learning with Truncated Monte Carlo Shapley , which is a Monte Carlo approximation of the Shapley value that uses a permutation-based definition of Shapley values and truncates the iteration over a given permutation after the marginal utility drops below a certain threshold. For more information on the method, see Ghorbani and Zou (2019) or pydvl's documentation .
Like every semi-value method,
ShapleyValuation
requires a sampler and a stopping criterion. For the former we use a
PermutationSampler
, which samples permutations of indices and computes marginal contributions incrementally. By using
RelativeTruncation
, the processing of a permutation will stop once the utility of a subset is close to the total utility. Finally, the stopping condition for the whole algorithm is given as in the
TMCS
paper: we stop once the total change in the last 100 steps is below a threshold.
from joblib import parallel_config
from pydvl.valuation import MinUpdates
valuation = ShapleyValuation(
utility=utility,
sampler=PermutationSampler(
truncation=RelativeTruncation(rtol=0.01), seed=random_state
),
is_done=MinUpdates(200)
& (HistoryDeviation(n_steps=100, rtol=0.01) | MaxUpdates(1000)),
progress=True,
)
with parallel_config(n_jobs=n_jobs):
valuation.fit(grouped_train_dataset)
result = valuation.values()
result.sort(key="value")
df = result.to_dataframe(column="data_value", use_names=True)
Let's take a look at the returned dataframe:
The first thing to notice is that we sorted the results in ascending order of Shapley value. The index holds the labels for each data group: in this case, artist names. The column
data_value
is just that: the Shapley Data value.
data_value_variance
is the sample variance of the Monte Carlo estimate, and
data_value_count
is the number of updates to the estimate.
Let us plot the results. In the next cell we will take the 30 artists with the lowest score and plot their values with 95% Normal confidence intervals. Keep in mind that Monte Carlo Shapley is typically very noisy, and it can take many steps to arrive at a clean estimate.
We can immediately see that many artists (groups of samples) have very low, even negative value, which means that they tend to decrease the total score of the model when present in the training set! What happens if we remove them?
In the next cell we create a new training set excluding the artists within the bottom 10% scores:
Now we will use this "cleaned" dataset to retrain the same model and compare its mean absolute error to the one trained on the full dataset. Notice that the score now is calculated using the test set, while in the calculation of the Shapley values we were using the validation set.
model_clean_data = GradientBoostingRegressor(
n_estimators=3, random_state=random_state
).fit(*clean_dataset.data())
error_good_data = neg_mean_absolute_error_scorer(model_clean_data, *test_dataset.data())
model_all_data = GradientBoostingRegressor(n_estimators=3).fit(*train_dataset.data())
error_all_data = neg_mean_absolute_error_scorer(model_all_data, *test_dataset.data())
The score has improved by a noticeable amount! This is quite an important result, as it shows a consistent process to improve the performance of a model by excluding data points from its training set.
Evaluation on anomalous data ¶
One interesting test to validate the idea that Shapley values help locate bogus data, is to corrupt some of it and to monitor how their value changes. To do this, we will take one of the artists with the highest value and set the popularity of all their songs to 0.
Let us take all the songs by Billie Eilish, set their score to 0 and re-calculate the Shapley values.
y_train_anomalous = train_data[1].copy(deep=True)
y_train_anomalous[artists == "Billie Eilish"] = 0
anomalous_train_dataset = Dataset(train_data[0], y_train_anomalous)
grouped_anomalous_dataset = GroupedDataset.from_dataset(
anomalous_train_dataset, data_groups=song_to_gid, group_names=gid_to_artist
)
anomalous_utility = ModelUtility(
model=GradientBoostingRegressor(n_estimators=3, random_state=random_state),
scorer=SupervisedScorer(
"neg_mean_absolute_error", test_data=val_dataset, default=0.0
),
)
valuation = ShapleyValuation(
utility=anomalous_utility,
sampler=PermutationSampler(
truncation=RelativeTruncation(rtol=0.01), seed=random_state
),
is_done=HistoryDeviation(n_steps=100, rtol=1e-3) | MaxUpdates(1000),
progress=True,
)
with parallel_config(n_jobs=n_jobs):
valuation.fit(grouped_anomalous_dataset)
result = valuation.values()
result.sort(key="value")
df = result.to_dataframe(column="data_value", use_names=True)
Let us now consider the low-value artists (at least for predictive purposes, no claims are made about their artistic value!) and plot the results
And Billie Eilish (our anomalous data group) has moved from top contributor to having negative impact on the performance of the model, as expected!
What is going on? A popularity of 0 for Billie Eilish's songs is inconsistent with listening patterns for other artists. In artificially setting this, we degrade the predictive power of the model.
By dropping low-value groups or samples, one can often increase model performance, but by inspecting them, it is possible to identify bogus data sources or acquisition methods.