Skip to content

Data valuation

Info

If you want to jump right into it, skip ahead to Computing data values. If you want a quick list of applications, see Applications of data valuation. For a list of all algorithms implemented in pyDVL, see Methods.

Data valuation is the task of assigning a number to each element of a training set which reflects its contribution to the final performance of some model trained on it. Some methods attempt to be model-agnostic, but in most cases the model is an integral part of the method. In these cases, this number is not an intrinsic property of the element of interest, but typically a function of three factors:

  1. The dataset \(D\), or more generally, the distribution it was sampled from: In some cases one only cares about values wrt. a given data set, in others value would ideally be the (expected) contribution of a data point to any random set \(D\) sampled from the same distribution. pyDVL implements methods of the first kind.

  2. The algorithm \(\mathcal{A}\) mapping the data \(D\) to some estimator \(f\) in a model class \(\mathcal{F}\). E.g. MSE minimization to find the parameters of a linear model.

  3. The performance metric of interest \(u\) for the problem. When value depends on a model, it must be measured in some way which uses it. E.g. the \(R^2\) score or the negative MSE over a test set. This metric will be computed over a held-out valuation set.

pyDVL collects algorithms for the computation of data values in this sense, mostly those derived from cooperative game theory. The methods can be found in the package [[pydvl.value]], with support from modules pydvl.utils.dataset and pydvl.utils.utility, as detailed below.

Warning

Be sure to read the section on the difficulties using data values.

There are three main families of methods for data valuation: game-theoretic, influence-based and intrinsic. As of v0.8.1 pyDVL supports the first two. Here, we focus on game-theoretic concepts and refer to the main documentation on the influence funtion for the second.

Game theoretical methods

The main contenders in game-theoretic approaches are Shapley values (Ghorbani and Zou, 2019)1, (Kwon et al., 2021)2, (Schoch et al., 2022)3, their generalization to so-called semi-values by (Kwon and Zou, 2022)4 and [@wang_data_2022], and the Core (Yan and Procaccia, 2021)5. All of these are implemented in pyDVL. For a full list see Methods

In these methods, data points are considered players in a cooperative game whose outcome is the performance of the model when trained on subsets (coalitions) of the data, measured on a held-out valuation set. This outcome, or utility, must typically be computed for every subset of the training set, so that an exact computation is \(\mathcal{O} (2^n)\) in the number of samples \(n\), with each iteration requiring a full re-fitting of the model using a coalition as training set. Consequently, most methods involve Monte Carlo approximations, and sometimes approximate utilities which are faster to compute, e.g. proxy models (Wang et al., 2022)6 or constant-cost approximations like Neural Tangent Kernels (Wu et al., 2022)7.

The reasoning behind using game theory is that, in order to be useful, an assignment of value, dubbed valuation function, is usually required to fulfil certain requirements of consistency and "fairness". For instance, in some applications value should not depend on the order in which data are considered, or it should be equal for samples that contribute equally to any subset of the data (of equal size). When considering aggregated value for (sub-)sets of data there are additional desiderata, like having a value function that does not increase with repeated samples. Game-theoretic methods are all rooted in axioms that by construction ensure different desiderata, but despite their practical usefulness, none of them are either necessary or sufficient for all applications. For instance, SV methods try to equitably distribute all value among all samples, failing to identify repeated ones as unnecessary, with e.g. a zero value.

Computing data values

Using pyDVL to compute data values is a simple process that can be broken down into three steps:

  1. Creating a Dataset object from your data.
  2. Creating a Utility which ties your model to the dataset and a scoring function.
  3. Computing values with a method of your choice, e.g. via compute_shapley_values.

Creating a Dataset

The first item in the tuple \((D, \mathcal{A}, u)\) characterising data value is the dataset. The class Dataset is a simple convenience wrapper for the train and test splits that is used throughout pyDVL. The test set will be used to evaluate a scoring function for the model.

It can be used as follows:

import numpy as np
from pydvl.utils import Dataset
from sklearn.model_selection import train_test_split
X, y = np.arange(100).reshape((50, 2)), np.arange(50)
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.5, random_state=16
)
dataset = Dataset(X_train, X_test, y_train, y_test)

It is also possible to construct Datasets from sklearn toy datasets for illustrative purposes using from_sklearn.

Grouping data

Be it because data valuation methods are computationally very expensive, or because we are interested in the groups themselves, it can be often useful or necessary to group samples to valuate them together. GroupedDataset provides an alternative to Dataset with the same interface which allows this.

You can see an example in action in the Spotify notebook, but here's a simple example grouping a pre-existing Dataset. First we construct an array mapping each index in the dataset to a group, then use from_dataset:

import numpy as np
from pydvl.utils import GroupedDataset

# Randomly assign elements to any one of num_groups:
data_groups = np.random.randint(0, num_groups, len(dataset))
grouped_dataset = GroupedDataset.from_dataset(dataset, data_groups)

Creating a Utility

In pyDVL we have slightly overloaded the name "utility" and use it to refer to an object that keeps track of all three items in \((D, \mathcal{A}, u)\). This will be an instance of Utility which, as mentioned, is a convenient wrapper for the dataset, model and scoring function used for valuation methods.

Here's a minimal example:

import sklearn as sk
from pydvl.utils import Dataset, Utility

dataset = Dataset.from_sklearn(sk.datasets.load_iris())
model = sk.svm.SVC()
utility = Utility(model, dataset)

The object utility is a callable that data valuation methods will execute with different subsets of training data. Each call will retrain the model on a subset and evaluate it on the test data using a scoring function. By default, Utility will use model.score(), but it is possible to use any scoring function (greater values must be better). In particular, the constructor accepts the same types as argument as sklearn.model_selection.cross_validate: a string, a scorer callable or None for the default.

utility = Utility(model, dataset, "explained_variance")

Utility will wrap the fit() method of the model to cache its results. This greatly reduces computation times of Monte Carlo methods. Because of how caching is implemented, it is important not to reuse Utility objects for different datasets. You can read more about setting up the cache in the installation guide, and in the documentation of the caching module.

Using custom scorers

The scoring argument of Utility can be used to specify a custom Scorer object. This is a simple wrapper for a callable that takes a model, and test data and returns a score.

More importantly, the object provides information about the range of the score, which is used by some methods by estimate the number of samples necessary, and about what default value to use when the model fails to train.

Note

The most important property of a Scorer is its default value. Because many models will fail to fit on small subsets of the data, it is important to provide a sensible default value for the score.

It is possible to skip the construction of the Scorer when constructing the Utility object. The two following calls are equivalent:

from pydvl.utils import Utility, Scorer

utility = Utility(
   model, dataset, "explained_variance", score_range=(-np.inf, 1), default_score=0.0
)
utility = Utility(
   model, dataset, Scorer("explained_variance", range=(-np.inf, 1), default=0.0)
)

Learning the utility

Because each evaluation of the utility entails a full retrain of the model with a new subset of the training set, it is natural to try to learn this mapping from subsets to scores. This is the idea behind Data Utility Learning (DUL) (Wang et al., 2022)6 and in pyDVL it's as simple as wrapping the Utility inside DataUtilityLearning:

from pydvl.utils import Utility, DataUtilityLearning, Dataset
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import load_iris

dataset = Dataset.from_sklearn(load_iris())
u = Utility(LogisticRegression(), dataset)
training_budget = 3
wrapped_u = DataUtilityLearning(u, training_budget, LinearRegression())

# First 3 calls will be computed normally
for i in range(training_budget):
   _ = wrapped_u((i,))
# Subsequent calls will be computed using the fit model for DUL
wrapped_u((1, 2, 3))

As you can see, all that is required is a model to learn the utility itself and the fitting and using of the learned model happens behind the scenes.

There is a longer example with an investigation of the results achieved by DUL in a dedicated notebook.

Leave-One-Out values

LOO is the simplest approach to valuation. It assigns to each sample its marginal utility as value:

\[v_u(i) = u(D) − u(D_{-i}).\]

For notational simplicity, we consider the valuation function as defined over the indices of the dataset \(D\), and \(i \in D\) is the index of the sample, \(D_{-i}\) is the training set without the sample \(x_i\), and \(u\) is the utility function. See the section on notation for more.

For the purposes of data valuation, this is rarely useful beyond serving as a baseline for benchmarking. Although in some benchmarks it can perform astonishingly well on occasion. One particular weakness is that it does not necessarily correlate with an intrinsic value of a sample: since it is a marginal utility, it is affected by diminishing returns. Often, the training set is large enough for a single sample not to have any significant effect on training performance, despite any qualities it may possess. Whether this is indicative of low value or not depends on each one's goals and definitions, but other methods are typically preferable.

from pydvl.value.loo import compute_loo

values = compute_loo(utility, n_jobs=-1)

The return value of all valuation functions is an object of type ValuationResult. This can be iterated over, indexed with integers, slices and Iterables, as well as converted to a pandas.DataFrame.

Problems of data values

There are a number of factors that affect how useful values can be for your project. In particular, regression can be especially tricky, but the particular nature of every (non-trivial) ML problem can have an effect:

  • Variance of the utility: Classical applications of game theoretic value concepts operate with deterministic utilities, as do many of the bounds in the literature. But in ML we use an evaluation of the model on a validation set as a proxy for the true risk. Even if the utility is bounded, its variance will affect final values, and even more so any Monte Carlo estimates. Several works have tried to cope with variance. [@wang_data_2022] prove that by relaxing one of the Shapley axioms and considering the general class of semi-values, of which Shapley is an instance, one can prove that a choice of constant weights is the best one can do in a utility-agnostic setting. This method, dubbed Data Banzhaf, is available in pyDVL as compute_banzhaf_semivalues.

    Averaging repeated utility evaluations

    One workaround in pyDVL is to configure the caching system to allow multiple evaluations of the utility for every index set. A moving average is computed and returned once the standard error is small, see CachedFuncConfig. Note however that in practice, the likelihood of cache hits is low, so one would have to force recomputation manually somehow.

  • Unbounded utility: Choosing a scorer for a classifier is simple: accuracy or some F-score provides a bounded number with a clear interpretation. However, in regression problems most scores, like \(R^2\), are not bounded because regressors can be arbitrarily bad. This leads to great variability in the utility for low sample sizes, and hence unreliable Monte Carlo approximations to the values. Nevertheless, in practice it is only the ranking of samples that matters, and this tends to be accurate (wrt. to the true ranking) despite inaccurate values.

    Squashing scores

    pyDVL offers a dedicated function composition for scorer functions which can be used to squash a score. The following is defined in module score:

    import numpy as np
    from pydvl.utils import compose_score
    
    def sigmoid(x: float) -> float:
      return float(1 / (1 + np.exp(-x)))
    
    squashed_r2 = compose_score("r2", sigmoid, "squashed r2")
    
    squashed_variance = compose_score(
      "explained_variance", sigmoid, "squashed explained variance"
    )
    
    These squashed scores can prove useful in regression problems, but they can also introduce issues in the low-value regime.

  • Data set size: Computing exact Shapley values is NP-hard, and Monte Carlo approximations can converge slowly. Massive datasets are thus impractical, at least with game-theoretical methods. A workaround is to group samples and investigate their value together. You can do this using GroupedDataset. There is a fully worked-out example here. Some algorithms also provide different sampling strategies to reduce the variance, but due to a no-free-lunch-type theorem, no single strategy can be optimal for all utilities. Finally, model specific methods like kNN-Shapley (Jia et al., 2019)8, or altogether different and typically faster approaches like Data-OOB (Kwon and Zou, 2023)9 can also be used.

  • Model size: Since every evaluation of the utility entails retraining the whole model on a subset of the data, large models require great amounts of computation. But also, they will effortlessly interpolate small to medium datasets, leading to great variance in the evaluation of performance on the dedicated validation set. One mitigation for this problem is cross-validation, but this would incur massive computational cost. As of v0.8.1 there are no facilities in pyDVL for cross-validating the utility (note that this would require cross-validating the whole value computation).

Notation and nomenclature

Todo

Organize this section better and use its content consistently throughout the documentation.

The following notation is used throughout the documentation:

Let \(D = \{x_1, \ldots, x_n\}\) be a training set of \(n\) samples.

The utility function \(u:\mathcal{D} \rightarrow \mathbb{R}\) maps subsets of \(D\) to real numbers. In pyDVL, we typically call this mapping a score for consistency with sklearn, and reserve the term utility for the triple of dataset \(D\), model \(f\) and score \(u\), since they are used together to compute the value.

The value \(v\) of the \(i\)-th sample in dataset \(D\) wrt. utility \(u\) is denoted as \(v_u(x_i)\) or simply \(v(i)\).

For any \(S \subseteq D\), we denote by \(S_{-i}\) the set of samples in \(D\) excluding \(x_i\), and \(S_{+i}\) denotes the set \(S\) with \(x_i\) added.

The marginal utility of adding sample \(x_i\) to a subset \(S\) is denoted as \(\delta(i) := u(S_{+i}) - u(S)\).

The set \(D_{-i}^{(k)}\) contains all subsets of \(D\) of size \(k\) that do not include sample \(x_i\).


  1. Ghorbani, A., Zou, J., 2019. Data Shapley: Equitable Valuation of Data for Machine Learning, in: Proceedings of the 36th International Conference on Machine Learning, PMLR. Presented at the International Conference on Machine Learning (ICML 2019), PMLR, pp. 2242–2251. 

  2. Kwon, Y., Rivas, M.A., Zou, J., 2021. Efficient Computation and Analysis of Distributional Shapley Values, in: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics. Presented at the International Conference on Artificial Intelligence and Statistics, PMLR, pp. 793–801. 

  3. Schoch, S., Xu, H., Ji, Y., 2022. CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification, in: Proc. Of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS). Presented at the Advances in Neural Information Processing Systems (NeurIPS 2022). 

  4. Kwon, Y., Zou, J., 2022. Beta Shapley: A Unified and Noise-reduced Data Valuation Framework for Machine Learning, in: Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022,. Presented at the AISTATS 2022, PMLR. 

  5. Yan, T., Procaccia, A.D., 2021. If You Like Shapley Then You’ll Love the Core, in: Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2021. Presented at the AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence, pp. 5751–5759. https://doi.org/10.1609/aaai.v35i6.16721 

  6. Wang, T., Yang, Y., Jia, R., 2022. Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning. Presented at the International Conference on Learning Representations (ICLR 2022). Workshop on Socially Responsible Machine Learning, arXiv. https://doi.org/10.48550/arXiv.2107.06336 

  7. Wu, Z., Shu, Y., Low, B.K.H., 2022. DAVINZ: Data Valuation using Deep Neural Networks at Initialization, in: Proceedings of the 39th International Conference on Machine Learning. Presented at the International Conference on Machine Learning, PMLR, pp. 24150–24176. 

  8. Jia, R., Dao, D., Wang, B., Hubis, F.A., Gurel, N.M., Li, B., Zhang, C., Spanos, C., Song, D., 2019. Efficient task-specific data valuation for nearest neighbor algorithms. Proc. VLDB Endow. 12, 1610–1623. https://doi.org/10.14778/3342263.3342637 

  9. Kwon, Y., Zou, J., 2023. Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value, in: Proceedings of the 40th International Conference on Machine Learning. Presented at the International Conference on Machine Learning, PMLR, pp. 18135–18152.