Data valuation¶

Info

If you want to jump right into it, skip ahead to Computing data values. If you want a quick list of applications, see Applications of data valuation. For a list of all algorithms implemented in pyDVL, see Methods.

Data valuation is the task of assigning a number to each element of a training set which reflects its contribution to the final performance of some model trained on it. Some methods attempt to be model-agnostic, but in most cases the model is an integral part of the method. In these cases, this number is not an intrinsic property of the element of interest, but typically a function of three factors:

The dataset \(D\), or more generally, the distribution it was sampled from: In some cases one only cares about values wrt. a given data set, in others value would ideally be the (expected) contribution of a data point to any random set \(D\) sampled from the same distribution. pyDVL implements methods of the first kind.
The algorithm \(\mathcal{A}\) mapping the data \(D\) to some estimator \(f\) in a model class \(\mathcal{F}\). E.g. MSE minimization to find the parameters of a linear model.
The performance metric of interest \(u\) for the problem. When value depends on a model, it must be measured in some way which uses it. E.g. the \(R^2\) score or the negative MSE over a test set. This metric will be computed over a held-out valuation set.

pyDVL collects algorithms for the computation of data values in this sense, mostly those derived from cooperative game theory. The methods can be found in the package pydvl.valuation.methods, with support from modules like pydvl.valuation.samplers or and pydvl.valuation.dataset, as detailed below.

Warning

Be sure to read the section on the difficulties using data values.

There are three main families of methods for data valuation: model-based, influence-based and model-free. As of v0.10.0 pyDVL supports the first two. Here, we focus on model-based (and in particular game-theoretic) concepts and refer to the main documentation on the influence function for the second.

Game theoretical methods and semi-values¶

The main contenders in game-theoretic approaches are Shapley values (Ghorbani and Zou, 2019)¹, (Kwon et al., 2021)², (Schoch et al., 2022)³, their generalization to so-called semi-values with some examples being (Kwon and Zou, 2022)⁴ and (Wang and Jia, 2023)⁵, and the Core (Yan and Procaccia, 2021)⁶. All of these are implemented in pyDVL. For a full list see Methods.

In these methods, data points are considered players in a cooperative game whose outcome is the performance of the model when trained on subsets (coalitions) of the data, measured on a held-out valuation set. This outcome, or utility, must typically be computed for every subset of the training set, so that an exact computation is \(\mathcal{O} (2^n)\) in the number of samples \(n\), with each iteration requiring a full re-fitting of the model using a coalition as training set. Consequently, most methods involve Monte Carlo approximations, and sometimes approximate utilities which are faster to compute, e.g. proxy models (Wang et al., 2022)⁷ or constant-cost approximations like Neural Tangent Kernels (Wu et al., 2022)⁸.

Info

Here is the full list of valuation methods implemented in pyDVL.

The reasoning behind using game theory is that, in order to be useful, an assignment of value, dubbed valuation function, is usually required to fulfil certain requirements of consistency and "fairness". For instance, in some applications value should not depend on the order in which data are considered, or it should be equal for samples that contribute equally to any subset of the data (of equal size). When considering aggregated value for (sub-)sets of data there are additional desiderata, like having a value function that does not increase with repeated samples. Game-theoretic methods are all rooted in axioms that by construction ensure different desiderata, but despite their practical usefulness, none of them are either necessary or sufficient for all applications. For instance, SV methods try to equitably distribute all value among all samples, failing to identify repeated ones as unnecessary, with e.g. a zero value.

Computing data values¶

Using pyDVL to compute data values is a flexible process that can be broken down into several steps. This degree of flexibility allows for a wide range of applications, but it can also be a bit overwhelming. The following steps are necessary:

Creating two Datasets object from your data: one for training and one for evaluating the utility. The quality of this latter set is crucial for the quality of the values.
Choosing a scoring function,typically something like accuracy or \(R^2\), but it can be any function that takes a model and a dataset and returns a number. The test dataset is attached to the scorer. This is done by instantiating a SupervisedScorer object. Other types, and subclassing is possible for non-supervised problems.
Creating a utility object that ties your model to the scoring function. This is done by instantiating a ModelUtility.
Computing values with a valuation method of your choice, e.g. via BanzhafValuation or ShapleyValuation. For semi-value methods, you will also need to choose a subset sampling scheme, e.g. a PermutationSampler or a simple UniformSampler.
For methods that require it, in particular those using infinite subset sampling schemes, one must choose a stopping criterion, that is a stopping condition that interrupts the computation e.g. when the change in estimates is low, or the number of iterations or time elapsed exceed some threshold.

Creating a Dataset¶

The first item in the tuple \((D, \mathcal{A}, u)\) characterising data value is the dataset. The class Dataset is a simple convenience wrapper for use across the library. Some class methods allow for the convenient creation of train / test splits, e.g. with from_arrays:

Constructing a synthetic classification dataset

from pydvl.valuation.dataset import Dataset
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=20, n_classes=3)
train, test = Dataset.from_arrays(X, y, stratify_by_target=True)

With the class method from_sklearn it is possible to construct a dataset from any of the toy datasets in sklearn.datasets.

Loading a scikit-learn dataset

from pydvl.valuation.dataset import Dataset
from sklearn.datasets import load_iris
train, test = Dataset.from_sklearn(
    load_iris(), train_size=0.8, stratify_by_target=True
)

Grouping data¶

Be it because data valuation methods are computationally very expensive, or because we are interested in the groups themselves, it can be often useful or necessary to group samples to valuate them together. GroupedDataset provides an alternative to Dataset with the same interface which allows this.

You can see an example in action in the Spotify notebook, but here's a simple example grouping a pre-existing Dataset. First we construct an array mapping each index in the dataset to a group, then use from_dataset:

Grouping a dataset

This is of course silly, but it serves to illustrate the functionality:

import numpy as np
from pydvl.valuation.dataset import Dataset, GroupedDataset

dataset = Dataset.from_sklearn(sk.datasets.fetch_covtype())
n_groups = 5800
# Randomly assign elements to any one of n_groups:
data_groups = np.random.randint(0, n_groups, len(dataset))
dummy_group_names = [f"Group {i}" for i in range(n_groups)]
grouped_dataset = GroupedDataset.from_dataset(
  dataset, data_groups, dummy_group_names
)

Creating a utility¶

In pyDVL we have slightly overloaded the name "utility" and use it to refer to an object that keeps track of both the method and its evaluation. For model-based methods like all semi-values including Shapley, the utility will be an instance of ModelUtility which, as mentioned, is a convenient wrapper for the model and scoring function.

Creating a ModelUtility

import sklearn as sk
from pydvl.valuation import Dataset, ModelUtility, SupervisedScorer

train, test = Dataset.from_sklearn(sk.datasets.load_iris(), train_size=0.6)
model = sk.svm.SVC()
# Uses the model.score() method by default
scorer = SupervisedScorer(model, test)
utility = ModelUtility(model, scorer)

Note how we pass the test set to the scorer. Importantly, the object provides information about the range of the score, which is used by some methods to estimate the number of samples necessary, and about what default value to use when the model fails to train.

If we pass a model to SupervisedScorer, it will use the model's score() method by default, but it is possible to use any scoring function (greater values must be better). In particular, the constructor accepts the same types of arguments as those of sklearn.model_selection.cross_validate: a string, a scorer callable or None for the default.

scorer = SupervisedScorer("explained_variance", default=0.0, range=(-np.inf, 1))

The object utility is a callable and is used by data valuation methods to train the model on various subsets of the data and evaluate its performance. ModelUtility wraps the fit() method of the model to cache its results. In some (rare) cases this reduces computation times of Monte Carlo methods. Because of how caching is implemented, it is important not to reuse ModelUtility objects for different datasets. You can read more about setting up the cache in the installation guide, and in the documentation of the caching module.

Errors are hidden by default

During semi-value computations, the utility can be evaluated on subsets that break the fitting process. For instance, a classifier might require at least two classes to fit, but the utility is sometimes evaluated on subsets with only one class. This will raise an error with most classifiers. To avoid this, we set by default catch_errors=True upon instantiation, which will catch the error and return the scorer's default value instead. While we show a warning to signal that something went wrong, this suppression can lead to unexpected results, so it is important to be aware of this setting and to set it to False when testing, or if you are sure that the utility will not be evaluated on problematic subsets.

Computing some values¶

By far the most popular concept of value is the Shapley value, a particular case of semi-value. In order to compute them for a training set, all we need to do after the previous steps is to instantiate a ShapleyValuation object and call its fit() method.

Computing Shapley values

import sklearn as sk
from joblib import parallel_config
from pydvl.valuation import Dataset, ModelUtility, ShapleyValuation, SupervisedScorer

train, test = Dataset.from_sklearn(sk.datasets.load_iris(), train_size=0.6)
model = sk.svm.SVC()
scorer = SupervisedScorer("accuracy", test, default=0.0, range=(0, 1))
utility = ModelUtility(model, scorer)
sampler = PermutationSampler()  # Just one of many examples
stopping = MaxUpdates(100)  # A trivial criterion
shapley = ShapleyValuation(utility, sampler, stopping)
with parallel_config(n_jobs=-1):
    shapley.fit(train)

result = shapley.result

Note our use of joblib.parallel_config in the example in order to parallelize the computation of the values. Most valuation methods support this.

The result type of all valuations is an object of type ValuationResult. This can be iterated over, sliced, sorted, as well as converted to a pandas.DataFrame using to_dataframe.

Learning the utility¶

Since each evaluation of the utility entails a full retraining of the model on a new subset of the training data, it is natural to try to learn this mapping from subsets to scores. This is the idea behind Data Utility Learning (DUL) (Wang et al., 2022)⁷ and in pyDVL it's as simple as wrapping the ModelUtility inside a DataUtilityLearning object:

from pydvl.valuation import *
from pydvl.valuation.types import Sample
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import load_iris

train, test = Dataset.from_sklearn(load_iris())
scorer = SupervisedScorer("accuracy", test, default=0.0, range=(0, 1))
u = ModelUtility(LogisticRegression(), scorer)
training_budget = 3
utility_model = IndicatorUtilityModel(
    predictor=LinearRegression(), n_data=len(train)
)
wrapped_u = DataUtilityLearning(u, training_budget, utility_model)

# First 3 calls will be computed normally
for i in range(training_budget):
    _ = wrapped_u(Sample(None, train.indices[:i]))
# Subsequent calls will be computed using the learned model for DUL
wrapped_u(Sample(None, train.indices))

Retrieving and restoring results¶

Calling fit() on any valuation object populates the result property with a ValuationResult. This object can be persisted to disk using save_result(), and restored using load_result(). This can then be passed to subsequent calls to fit() to continue the computation from the last saved state.

Problems of data values¶

There are a number of factors that affect how useful values can be for your project. In particular, regression can be especially tricky, but the particular nature of every (non-trivial) ML problem can have an effect:

Variance of the utility: Classical applications of game theoretic value concepts operate with deterministic utilities, as do many of the bounds in the literature. But in ML we use an evaluation of the model on a validation set as a proxy for the true risk. Even if the utility is bounded, its variance will affect final values, and even more so any Monte Carlo estimates. Several works have tried to cope with variance. (Wang and Jia, 2023)⁵ prove that by relaxing one of the Shapley axioms and considering the general class of semi-values, of which Shapley is an instance, one can prove that a choice of constant weights is the best one can do in a utility-agnostic setting. This method, dubbed Data Banzhaf, is available in pyDVL as BanzhafValuation.

Averaging repeated utility evaluations

One workaround in pyDVL is to configure the caching system to allow multiple evaluations of the utility for every index set. A moving average is computed and returned once the standard error is small, see CachedFuncConfig. Note however that in practice, the likelihood of cache hits is low, so one would have to force recomputation manually somehow.
Unbounded utility: Choosing a scorer for a classifier is simple: accuracy or some F-score provides a bounded number with a clear interpretation. However, in regression problems most scores, like \(R^2\), are not bounded because regressors can be arbitrarily bad. This leads to great variability in the utility for low sample sizes, and hence unreliable Monte Carlo approximations to the values. Nevertheless, in practice it is only the ranking of samples that matters, and this tends to be accurate (wrt. to the true ranking) despite inaccurate values.
Squashing scores

pyDVL offers a dedicated function composition for scorer functions which can be used to squash a score. The following is defined in the module scorers:
```
import numpy as np
from pydvl.valuation.score import compose_score

def sigmoid(x: float) -> float:
  return float(1 / (1 + np.exp(-x)))

squashed_r2 = compose_score("r2", sigmoid, "squashed r2")

squashed_variance = compose_score(
  "explained_variance", sigmoid, "squashed explained variance"
)
```
These squashed scores can prove useful in regression problems, but they can also introduce issues in the low-value regime.
Data set size: Computing exact Shapley values is NP-hard, and Monte Carlo approximations can converge slowly. Massive datasets are thus impractical, at least with game-theoretical methods. A workaround is to group samples and investigate their value together. You can do this using GroupedDataset. There is a fully worked-out example here. Some algorithms also provide different sampling strategies to reduce the variance, but due to a no-free-lunch-type theorem, no single strategy can be optimal for all utilities. Finally, model specific methods like kNN-Shapley (Jia et al., 2019)⁹, or altogether different and typically faster approaches like Data-OOB (Kwon and Zou, 2023)¹⁰ can also be used.
Model size: Since every evaluation of the utility entails retraining the whole model on a subset of the data, large models require great amounts of computation. But also, they will effortlessly interpolate small to medium datasets, leading to great variance in the evaluation of performance on the dedicated validation set. One mitigation for this problem is cross-validation, but this would incur massive computational cost. As of v0.8.1 there are no facilities in pyDVL for cross-validating the utility (note that this would require cross-validating the whole value computation).

Ghorbani, A., Zou, J., 2019. Data Shapley: Equitable Valuation of Data for Machine Learning, in: Proceedings of the 36th International Conference on Machine Learning, PMLR. Presented at the International Conference on Machine Learning (ICML 2019), PMLR, pp. 2242--2251. ↩
Kwon, Y., Rivas, M.A., Zou, J., 2021. Efficient Computation and Analysis of Distributional Shapley Values, in: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics. Presented at the International Conference on Artificial Intelligence and Statistics, PMLR, pp. 793--801. ↩
Schoch, S., Xu, H., Ji, Y., 2022. CS-Shapley: [Class-wise Shapley Values]{.nocase} for Data Valuation in Classification, in: Proc. Of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS). Presented at the Advances in Neural Information Processing Systems (NeurIPS 2022), New Orleans, Louisiana, USA. ↩
Kwon, Y., Zou, J., 2022. Beta Shapley: A Unified and [Noise-reduced Data Valuation Framework]{.nocase} for Machine Learning, in: Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022,. Presented at the AISTATS 2022, PMLR, Valencia, Spain. ↩
Wang, J.T., Jia, R., 2023. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning, in: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Presented at the International Conference on Artificial Intelligence and Statistics, PMLR, pp. 6388--6421. ↩↩
Yan, T., Procaccia, A.D., 2021. If You Like Shapley Then You'll Love the Core, in: Proceedings of the 35th AAAI Conference on Artificial Intelligence. Presented at the AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence, Virtual conference, pp. 5751--5759. https://doi.org/10.1609/aaai.v35i6.16721 ↩
Wang, T., Yang, Y., Jia, R., 2022. Improving [Cooperative Game Theory-based Data Valuation]{.nocase} via Data Utility Learning. Presented at the International Conference on Learning Representations (ICLR 2022). Workshop on Socially Responsible Machine Learning, arXiv. https://doi.org/10.48550/arXiv.2107.06336 ↩↩
Wu, Z., Shu, Y., Low, B.K.H., 2022. DAVINZ: Data Valuation using Deep Neural Networks at Initialization, in: Proceedings of the 39th International Conference on Machine Learning. Presented at the International Conference on Machine Learning, PMLR, pp. 24150--24176. ↩
Jia, R., Dao, D., Wang, B., Hubis, F.A., Gurel, N.M., Li, B., Zhang, C., Spanos, C., Song, D., 2019. Efficient task-specific data valuation for nearest neighbor algorithms. Proc. VLDB Endow. 12, 1610--1623. https://doi.org/10.14778/3342263.3342637 ↩
Kwon, Y., Zou, J., 2023. Data-OOB: [Out-of-bag Estimate]{.nocase} as a Simple and Efficient Data Value, in: Proceedings of the 40th International Conference on Machine Learning. Presented at the International Conference on Machine Learning, PMLR, pp. 18135--18152. ↩