Shapley for data valuation ¶
This notebook introduces Shapley methods for the computation of data value using pyDVL.
In order to illustrate the practical advantages, we will predict the popularity of songs in the dataset Top Hits Spotify from 2000-2019 , and highlight how data valuation can help investigate and boost the performance of the models. In doing so, we will describe the basic usage patterns of pyDVL.
Recall that data value is a function of three things:
- The dataset.
- The model.
- The performance metric or scoring function.
Below we will describe how to instantiate each one of these objects and how to use them for data valuation. Please also see the documentation on data valuation .
Setup ¶
We begin by importing the main libraries and setting some defaults.
We will be using the following functions from pyDVL. The main entry point is the function compute_shapley_values() , which provides a facade to all Shapley methods. In order to use it we need the classes Dataset , Utility and Scorer .
Loading and grouping the dataset ¶
pyDVL provides a support function for this notebook,
load_spotify_dataset()
, which downloads data on songs published after 2014, and splits 30% of data for testing, and 30% of the remaining data for validation. The return value is a triple of training, validation and test data as lists of the form
[X_input, Y_label]
.
The dataset has many high-level features, some quite intuitive ('duration_ms' or 'tempo'), while others are a bit more cryptic ('valence'?). For information on each feature, please consult the dataset's website .
In our analysis, we will use all the columns, except for 'artist' and 'song', to predict the 'popularity' of each song. We will nonetheless keep the information on song and artist in a separate object for future reference.
Input and label data are then used to instantiate a Dataset object:
The calculation of exact Shapley values is computationally very expensive (exponentially so!) because it requires training the model on every possible subset of the training set. For this reason, PyDVL implements techniques to speed up the calculation, such as Monte Carlo approximations , surrogate models or caching of intermediate results and grouping of data to calculate group Shapley values instead of single data points.
In our case, we will group songs by artist and calculate the Shapley value for the artists. Given the pandas Series for 'artist', to group the dataset by it, one does the following:
Creating the utility and computing values ¶
Now we can calculate the contribution of each group to the model performance.
As a model, we use scikit-learn's
GradientBoostingRegressor
, but pyDVL can work with any model from sklearn, xgboost or lightgbm. More precisely, any model that implements the protocol
pydvl.utils.types.SupervisedModel
, which is just the standard sklearn interface of
fit()
,
predict()
and
score()
can be used to construct the utility.
The third and final component is the scoring function. It can be anything like accuracy or
We group dataset, model and scoring function into an instance of Utility .
utility = Utility(
model=GradientBoostingRegressor(n_estimators=3),
data=grouped_dataset,
scorer=Scorer("neg_mean_absolute_error", default=0.0),
)
values = compute_shapley_values(
utility,
mode=ShapleyMode.TruncatedMontecarlo,
# Stop if the standard error is below 1% of the range of the values (which is ~2),
# or if the number of updates exceeds 1000
done=AbsoluteStandardError(threshold=0.2, fraction=0.9) | MaxUpdates(1000),
truncation=RelativeTruncation(utility, rtol=0.01),
n_jobs=-1,
)
values.sort(key="value")
df = values.to_dataframe(column="data_value", use_names=True)
The function
compute_shapley_values()
serves as a common access point to all Shapley methods. For most of them, we must choose a
StoppingCriterion
with the argument
done=
. In this case we choose to stop when the ratio of standard error to value is below 0.2 for at least 90% of the training points, or if the number of updates of any index exceeds 1000. The
mode
argument specifies the Shapley method to use. In this case, we use the
Truncated Monte Carlo approximation
, which is the fastest of the Monte Carlo methods, owing both to using the permutation definition of Shapley values and the ability to truncate the iteration over a given permutation. We configure this to happen when the contribution of the remaining elements is below 1% of the total utility with the parameter
truncation=
and the policy
RelativeTruncation
.
Let's take a look at the returned dataframe:
The first thing to notice is that we sorted the results in ascending order of Shapley value. The index holds the labels for each data group: in this case, artist names. The column
data_value
is just that: the Shapley Data value, and
data_value_stderr
is its estimated standard error because we are using a Monte Carlo approximation.
Let us plot the results. In the next cell we will take the 30 artists with the lowest score and plot their values with 95% Normal confidence intervals. Keep in mind that Monte Carlo Shapley is typically very noisy, and it can take many steps to arrive at a clean estimate.
We can immediately see that many artists (groups of samples) have very low, even negative value, which means that they tend to decrease the total score of the model when present in the training set! What happens if we remove them?
In the next cell we create a new training set excluding the artists with the lowest scores:
Now we will use this "cleaned" dataset to retrain the same model and compare its mean absolute error to the one trained on the full dataset. Notice that the score now is calculated using the test set, while in the calculation of the Shapley values we were using the validation set.
model_good_data = GradientBoostingRegressor(n_estimators=3).fit(
X_train_good_dvl, y_train_good_dvl
)
error_good_data = mean_absolute_error(
model_good_data.predict(test_data[0]), test_data[1]
)
model_all_data = GradientBoostingRegressor(n_estimators=3).fit(
training_data[0], training_data[1]
)
error_all_data = mean_absolute_error(model_all_data.predict(test_data[0]), test_data[1])
print(f"Improvement: {100*(error_all_data - error_good_data)/error_all_data:02f}%")
The score has improved by almost 14%! This is quite an important result, as it shows a consistent process to improve the performance of a model by excluding data points from its training set.
Evaluation on anomalous data ¶
One interesting test is to corrupt some data and to monitor how their value changes. To do this, we will take one of the artists with the highest value and set the popularity of all their songs to 0.
Let us take all the songs by Billie Eilish, set their score to 0 and re-calculate the Shapley values.
y_train_anomalous = training_data[1].copy(deep=True)
y_train_anomalous[artist == "Billie Eilish"] = 0
anomalous_dataset = Dataset(
x_train=training_data[0],
y_train=y_train_anomalous,
x_test=val_data[0],
y_test=val_data[1],
)
grouped_anomalous_dataset = GroupedDataset.from_dataset(anomalous_dataset, artist)
anomalous_utility = Utility(
model=GradientBoostingRegressor(n_estimators=3),
data=grouped_anomalous_dataset,
scorer=Scorer("neg_mean_absolute_error", default=0.0),
)
values = compute_shapley_values(
anomalous_utility,
mode=ShapleyMode.TruncatedMontecarlo,
done=AbsoluteStandardError(threshold=0.2, fraction=0.9) | MaxUpdates(1000),
n_jobs=-1,
)
values.sort(key="value")
df = values.to_dataframe(column="data_value", use_names=True)
Let us now consider the low-value artists (at least for predictive purposes, no claims are made about their artistic value!) and plot the results
And Billie Eilish (our anomalous data group) has moved from top contributor to having negative impact on the performance of the model, as expected!
What is going on? A popularity of 0 for Billie Eilish's songs is inconsistent with listening patterns for other artists. In artificially setting this, we degrade the predictive power of the model.
By dropping low-value groups or samples, one can often increase model performance, but by inspecting them, it is possible to identify bogus data sources or acquisition methods.