Skip to content

Bagging for data valuation ¶

This notebook introduces the Data- OOB method, an implementation based on a publication from Kwon and Zou " Data- OOB : Out-of-bag Estimate as a Simple and Efficient Data Value " ICML 2023 , using pyDVL.

The objective of this paper is mainly to overcome the computational bottleneck of shapley-based data valuation methods that require to fit a significant number of models to accurately estimate marginal contributions. The algorithms compute data values from out of bag estimates using a bagging model.

The value can be interpreted as a partition of the OOB estimate, which is originally introduced to estimate the prediction error. This OOB estimate is given as:

\[ \sum_{i=1}^n\frac{\sum_{b=1}^{B}\mathbb{1}(w_{bi}=0)T(y_i, \hat{f}_b(x_i))}{\sum_{b=1}^{B} \mathbb{1} (w_{bi}=0)} \]

Setup ¶

We begin by importing the main libraries and setting some defaults.

If you are reading this in the documentation, some boilerplate (including most plotting code) has been omitted for convenience.
%autoreload
from pydvl.utils import Dataset, Scorer, Seed, Utility, ensure_seed_sequence
from pydvl.value import ValuationResult, compute_data_oob

We will work with the adult classification dataset from the UCI repository. The objective is to predict whether a person earns more than 50k a year based on a set of features such as age, education, occupation, etc.

With a helper function we download the data and obtain the following pandas dataframe, where the categorical features have been removed:

Found cached file: adult_data.pkl.

data_adult.head()
age fnlwgt education-num capital-gain capital-loss hours-per-week income
0 39 77516 13 2174 0 40 <=50K
1 50 83311 13 0 0 13 <=50K
2 38 215646 9 0 0 40 <=50K
3 53 234721 7 0 0 40 <=50K
4 28 338409 13 0 0 40 <=50K

Computing the OOB values ¶

The main idea of Data- OOB is to take an existing classifier or regression model and compute a per-sample out-of-bag performance estimate via bagging.

For this example, we use a simple KNN classifier with \(k=5\) neighbours on the data and compute the data-oob values with two choices for the number of estimators in the bagging. For that we construct a Utility object using the Scorer class to specify the metric to use for the evaluation. Note how we pass a random seed to Dataset.from_arrays in order to ensure that we always get the same split when running this notebook multiple times. This will be particularly important when running the standard point removal experiments later.

We then use the compute_data_oob function to compute the data-oob values.

data = Dataset.from_arrays(
    X=data_adult.drop(columns=["income"]).values,
    y=data_adult.loc[:, "income"].cat.codes.values,
    random_state=random_state,
)

model = KNeighborsClassifier(n_neighbors=5)

utility = Utility(model, data, Scorer("accuracy", default=0.0))
n_estimators = [100, 500]
oob_values = [
    compute_data_oob(utility, n_est=n_est, max_samples=0.95, seed=random_state)
    for n_est in n_estimators
]

The two results are stored in an array of ValuationResult objects. Here's their distribution. The left-hand side depicts value as it increases with rank and a 99% t-confidence interval. The right-hand side shows the histogram of values.

Observe how adding estimators reduces the variance of the values, but doesn't change their distribution much.

No description has been provided for this image

Point removal experiments ¶

The standard procedure for the evaluation of data valuation schemes is the point removal experiment. The objective is to measure the evolution of performance when the best/worst points are removed from the training set. This can be done with the function compute_removal_score , which takes precomputed values and computes the performance of the model as points are removed.

In order to test the true performance of DataOOB, we repeat the whole task of computing the values and the point removal experiment multiple times, including the splitting of the dataset into training and valuation sets. It is important to remember to pass random state adequately for full reproducibility.

No description has been provided for this image