Glossary¶

This glossary is meant to provide only brief explanations of each term, helping to clarify the concepts and techniques used in the library. For more detailed information, please refer to the relevant literature or resources.

Warning

This glossary is still a work in progress. Pull requests are welcome!

Data valuation terms¶

Beta-Shapley¶

Beta-Shapley is a semi-value method that defines weights using a beta distribution, thus effectively implementing an importance sampling scheme that puts weight on marginal utilities for sets of certain sizes, as a function of the parameters of the beta distribution. Introduced in (Kwon and Zou, 2022)¹.

Class-wise Shapley¶

Class-wise Shapley is a Shapley valuation method which introduces a utility function that balances in-class, and out-of-class accuracy, with the goal of favoring points that improve the model's performance on the class they belong to. It is estimated to be particularly useful in imbalanced datasets, but more research is needed to confirm this. Introduced by (Schoch et al., 2022)².

Data-Banzhaf¶

Data-Banzhaf is a semi-value method that uses the Banzhaf value to determine the contribution of each data point to the model's performance. Its constant weights are a somwewhat effective importance sampling scheme that reduces the variance of the marginal utility estimates, especially for small subsets. It is most efficient when used in conjunction with the MSR sampler. Introduced by (Wang and Jia, 2023)³.

Data-OOB¶

Data-OOB is a method for valuing data points for a bagged model using its out-of-bag performance estimate. It overcomes the computational bottleneck of Shapley-based methods by evaluating each weak learner in an ensemble over samples it hasn't seen during training, and averaging the performance across all weak learners. Introduced in (Kwon and Zou, 2023)⁴.

Data Utility Learning¶

Data Utility Learning is a method that uses an ML model to learn the utility function. Essentially, it learns to predict the performance of a model when trained on a given set of indices from the dataset. The cost of training this model is quickly amortized by avoiding costly re-evaluations of the original utility. Introduced by (Wang et al., 2022)⁵.

Delta-Shapley¶

Delta-Shapley is an approximation to Shapley value which uses a stratified sampling distribution that picks set sizes based on stability bounds for the machine learning model for which values are estimated. An additional clipping constraint saves computation by skipping subset sizes (justified because of diminishing returns for model performance). This introduces a difference to Shapley value by a multiplicative factor that should not affect ranking. Introduced in (Watson et al., 2023)⁶.

Game-theoretic Methods¶

Game-theoretic methods for data valuation are a class of techniques that leverage concepts from cooperative game theory to assign values to data points based on their contributions to the overall performance of an ML model. Salient examples are Shapley value, Data Banzhaf, and Least Core.

Group Testing¶

Group Testing is a strategy for identifying characteristics within groups of items efficiently, by testing groups rather than individuals to quickly narrow down the search for items with specific properties. The idea was introduced into data valuation by (Jia et al., 2019)⁷, transforming the problem of computing Shapley values into one of constraint satisfiability, with constraints given by samples of the utility, with a carefully designed sampling distribution.

KNN-Shapley¶

KNN-Shapley is a Shapley value method tailored to KNN models. By exploiting the locality of KNN, it can reduce computation drastically wrt. the standard Shapley value with Monte Carlo approximation. Introduced in (Jia et al., 2019)⁷.

Importance Sampling¶

Importance Sampling is a technique used to estimate properties of a particular distribution while only having samples generated from a different distribution. This is achieved by re-weighting the samples according to their "sampled importance", i.e. effectively dividing by the probability of sampling them.

Much of the work in model-based data valuation consists of finding a good importance sampling distribution, and the weights to use for the sampled subsets. The most common choices are uniform sampling, Beta distributions, and Banzhaf indices.

Least Core¶

The Least Core is a solution concept in cooperative game theory, referring to the smallest set of payoffs to players that cannot be improved upon by any coalition, ensuring stability in the allocation of value. In data valuation, it implies solving a linear and a quadratic system whose constraints are determined by the evaluations of the utility function on every subset of the training data. Introduced as data valuation method by (Yan and Procaccia, 2021)⁸.

Leave-One-Out¶

LOO in the context of data valuation refers to the process of evaluating the impact of removing individual data points on the model's performance. The value of a training point is defined as the marginal change in the model's performance when that point is removed from the training set.

Marginal utility¶

In data valuation for ML, marginal utility refers to the change in performance of an ML model when a single data point is added to or removed from the training set. In our documentation it is often denoted \(\Delta_i(S) := U(S_{+i}) - U(S),\) where \(S\) is a subset of the training set, \(i\) is the index of the data point to be added, and \(U\) is the utility function.

Introduction to data valuation

Marginal-based methods¶

Marginal-based methods are a class of data valuation techniques that define value in terms of weighted averages of the marginal utility.

Introduction to data valuation

Maximum Sample Reuse¶

MSR is a sampling method for data valuation that updates the value of every data point in one sample. This method can achieve much faster convergence for Data Banzhaf since the sampling distribution "coincides" with the Banzhaf coefficients. In principle, it can be used with any semi-value by setting the sampler to MSRSampler, but it's most effective when used with the Banzhaf semi-value via MSRBanzhafValuation. Introduced by (Wang and Jia, 2023)³

Monte Carlo Least Core¶

MCLC is a variation of the Least Core that uses a reduced amount of constraints, sampled randomly from the powerset of the training data. Introduced by (Yan and Procaccia, 2021)⁸.

Monte Carlo Shapley¶

MCS estimates the Shapley Value using a Monte Carlo approximation to the sum over subsets of the training set. This reduces computation to polynomial time at the cost of accuracy, but this loss is typically irrelevant for downstream applications in ML. Introduced into data valuation by (Ghorbani and Zou, 2019)⁹.

Point-removal experiment¶

A point-removal experiment is a benchmarking task in data valuation where the quality of a valuation method is measured through the impact of incrementally removing data points on the model's performance. The points are removed in order of their value, and the performance is evaluated on a fixed validation set.

Benchmarking valuation methods.

Rank correlation¶

Rank correlation is a simple way of measuring the stability of estimates for the values of a training set. It is computed as the Spearman correlation between the values of two different runs of the same method, after changing some hyperparameter like random seed, number of updates during a single run, etc.

Benchmarking valuation methods.

Shapley Value¶

Shapley Value is a concept from cooperative game theory that allocates payouts to players based on their contribution to the total payoff. In data valuation, players are data points. The method assigns a value to each data point based on a weighted average of its marginal contributions to the model's performance when trained on each subset of the training set. This requires \(\mathcal{O}(2^{n-1})\) re-trainings of the model, which is infeasible for even trivial data set sizes, so one resorts to approximations like TMCS. Introduced into data valuation by (Ghorbani and Zou, 2019)⁹.

Stratified sampling¶

In pyDVL, a stratified sampler is one that first chooses a subset size \(k\) following some distribution \(\mathcal{L}_k\) over \(\{0,...,n-1\},\) then samples a subset of that size uniformly at random from the powerset of \({N_{-i}}:\)

Sample \(k \sim \mathcal{L}_k,\)
Sample \(S \sim \mathcal{U}(2^{N_{-i}}).\)

If we denote by \(\mathcal{L}\) the law for this two stage procedure, then one has that the Shapley value is the expectation over this distribution:

\[v_\text{sh}(i) = \mathbb{E}_{S \sim \mathcal{L}}[\Delta_i(S)].\]

One can try to reduce variance or obtain different semi-values by choosing \(\mathcal{L}_k\) differently, or combining it with any semivalue. See the links below.

Truncated Monte Carlo Shapley¶

TMCS is an efficient approach to estimating the Shapley Value using a truncated version of the Monte Carlo method with permutation sampling. Introduced by (Ghorbani and Zou, 2019)⁹.

Utility function¶

The utility function in ML data valuation is a measure of the performance of a model trained on a subset of the training data. This can be a metric like accuracy, F1 score, or any other performance measure. It is typically measured wrt. to a fixed valuation set (sometimes we use the terms test set or validation set interchangeably when no confusion is possible, but it should be a different, held-out set.

In our documentation the utility is denoted \(U\), and is assumed to be a function defined over sets (hence invariant wrt. permutation of data indices): \(U:2^{N} \to \mathbb{R}\), where \(N\) is the index set of the training data, which we identify with the data themselves.

Introduction to data valuation

Valuation set¶

The valuation set is a held-out subset of data used to evaluate the utility of a model trained on the training set. Sometimes, when there is no confusion, we use the terms test set or validation set interchangeably, but it should be a different, held-out set.

Note that computing a score (loss) over a fixed set is typically a poor approximation to the true score of the model, i.e. to its expected score on unseen data. This problem might be alleviated with some form of cross-validation, but we haven't explored this possibility in pyDVL.

Variance-Reduced Data-Shapley¶

A stratified sampling-based approach to estimate Shapley values that uses a simple deterministic heuristic for sample sizes, which in particular does not depend on run-time variance estimates. A good default choice is based on the harmonic function. Introduced in (Wu et al., 2023)¹⁰.

Implementation

Weighted Accuracy Drop¶

WAD is a metric to evaluate the impact of sequentially removing data points on the performance of a machine learning model, weighted by their rank, i.e. by the time at which they were removed. Introduced by (Schoch et al., 2022)².

Influence function terms¶

Arnoldi Method¶

The Arnoldi method approximately computes eigenvalue, eigenvector pairs of a symmetric matrix. For influence functions, it is used to approximate the iHVP. Introduced by (Schioppa et al., 2022)¹¹ in the context of influence functions.

Block Conjugate Gradient¶

A blocked version of CG, which solves several linear systems simultaneously. For Influence Functions, it is used to approximate the iHVP.

Conjugate Gradient¶

CG is an algorithm for solving linear systems with a symmetric and positive-definite coefficient matrix. For Influence Functions, it is used to approximate the iHVP.

Eigenvalue-corrected Kronecker-Factored Approximate Curvature¶

EKFAC builds on K-FAC by correcting for the approximation errors in the eigenvalues of the blocks of the Kronecker-factored approximate curvature matrix. This correction aims to refine the accuracy of natural gradient approximations, thus potentially offering better training efficiency and stability in neural networks.

Influence Function¶

The Influence Function measures the impact of a single data point on a statistical estimator. In machine learning, it's used to understand how much a particular data point affects the model's prediction. Introduced into data valuation by (Koh and Liang, 2017)¹².

Documentation

Inverse Hessian-vector product¶

iHVP is the operation of calculating the product of the inverse Hessian matrix of a function and a vector, without explicitly constructing nor inverting the full Hessian matrix first. This is essential for influence function computation.

Kronecker-Factored Approximate Curvature¶

K-FAC is an optimization technique that approximates the Fisher Information matrix's inverse efficiently. It uses the Kronecker product to factor the matrix, significantly speeding up the computation of natural gradient updates and potentially improving training efficiency.

Linear-time Stochastic Second-order Algorithm¶

LiSSA is an efficient algorithm for approximating the inverse Hessian-vector product, enabling faster computations in large-scale machine learning problems, particularly for second-order optimization. For Influence Functions, it is used to approximate the iHVP. Introduced by (Agarwal et al., 2017)¹³.

Nyström Low-Rank Approximation¶

The Nyström approximation computes a low-rank approximation to a symmetric positive-definite matrix via random projections. For influence functions, it is used to approximate the iHVP. Introduced as sketch and solve algorithm in (Hataya and Yamada, 2023)¹⁴, and as preconditioner for PCG in (Frangella et al., 2023)¹⁵.

Preconditioned Block Conjugate Gradient¶

A blocked version of PCG, which solves several linear systems simultaneously. For Influence Functions, it is used to approximate the iHVP.

Preconditioned Conjugate Gradient¶

A preconditioned version of CG for improved convergence, depending on the characteristics of the matrix and the preconditioner. For Influence Functions, it is used to approximate the iHVP.

Other terms¶

Coefficient of Variation¶

CV is a statistical measure of the dispersion of data points in a data series around the mean, expressed as a percentage. It's used to compare the degree of variation from one data series to another, even if the means are drastically different.

Constraint Satisfaction Problem¶

A CSP involves finding values for variables within specified constraints or conditions, commonly used in scheduling, planning, and design problems where solutions must satisfy a set of restrictions.

Out-of-Bag¶

OOB refers to data samples in an ensemble learning context (like random forests) that are not selected for training a specific model within the ensemble. These OOB samples are used as a validation set to estimate the model's accuracy, providing a convenient internal cross-validation mechanism.

Machine Learning Reproducibility Challenge¶

The MLRC is an initiative that encourages the verification and replication of machine learning research findings, promoting transparency and reliability in the field. Papers are published in Transactions on Machine Learning Research (TMLR).

Point removal task¶

A task in data valuation where the quality of a valuation method is measured through the impact of incrementally removing data points on the model's performance, where the points are removed in order of their value. See

Kwon, Y., Zou, J., 2022. Beta Shapley: A Unified and [Noise-reduced Data Valuation Framework]{.nocase} for Machine Learning, in: Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022,. Presented at the AISTATS 2022, PMLR, Valencia, Spain. ↩
Schoch, S., Xu, H., Ji, Y., 2022. CS-Shapley: [Class-wise Shapley Values]{.nocase} for Data Valuation in Classification, in: Proc. Of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS). Presented at the Advances in Neural Information Processing Systems (NeurIPS 2022), New Orleans, Louisiana, USA. ↩↩
Wang, J.T., Jia, R., 2023. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning, in: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Presented at the International Conference on Artificial Intelligence and Statistics, PMLR, pp. 6388--6421. ↩↩
Kwon, Y., Zou, J., 2023. Data-OOB: [Out-of-bag Estimate]{.nocase} as a Simple and Efficient Data Value, in: Proceedings of the 40th International Conference on Machine Learning. Presented at the International Conference on Machine Learning, PMLR, pp. 18135--18152. ↩
Wang, T., Yang, Y., Jia, R., 2022. Improving [Cooperative Game Theory-based Data Valuation]{.nocase} via Data Utility Learning. Presented at the International Conference on Learning Representations (ICLR 2022). Workshop on Socially Responsible Machine Learning, arXiv. https://doi.org/10.48550/arXiv.2107.06336 ↩
Watson, L., Kujawa, Z., Andreeva, R., Yang, H.-T., Elahi, T., Sarkar, R., 2023. Accelerated Shapley Value Approximation for Data Evaluation [WWW Document]. https://doi.org/10.48550/arXiv.2311.05346 ↩
Jia, R., Dao, D., Wang, B., Hubis, F.A., Gurel, N.M., Li, B., Zhang, C., Spanos, C., Song, D., 2019. Efficient task-specific data valuation for nearest neighbor algorithms. Proc. VLDB Endow. 12, 1610--1623. https://doi.org/10.14778/3342263.3342637 ↩↩
Yan, T., Procaccia, A.D., 2021. If You Like Shapley Then You'll Love the Core, in: Proceedings of the 35th AAAI Conference on Artificial Intelligence. Presented at the AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence, Virtual conference, pp. 5751--5759. https://doi.org/10.1609/aaai.v35i6.16721 ↩↩
Ghorbani, A., Zou, J., 2019. Data Shapley: Equitable Valuation of Data for Machine Learning, in: Proceedings of the 36th International Conference on Machine Learning, PMLR. Presented at the International Conference on Machine Learning (ICML 2019), PMLR, pp. 2242--2251. ↩↩↩
Wu, M., Jia, R., Lin, C., Huang, W., Chang, X., 2023. Variance reduced Shapley value estimation for trustworthy data valuation. Computers\ & Operations Research 159, 106305. https://doi.org/10.1016/j.cor.2023.106305 ↩
Schioppa, A., Zablotskaia, P., Vilar, D., Sokolov, A., 2022. Scaling Up Influence Functions. Proc. AAAI Conf. Artif. Intell. 36, 8179--8186. https://doi.org/10.1609/aaai.v36i8.20791 ↩
Koh, P.W., Liang, P., 2017. Understanding [Black-box Predictions]{.nocase} via Influence Functions, in: Proceedings of the 34th International Conference on Machine Learning. Presented at the International Conference on Machine Learning, PMLR, pp. 1885--1894. ↩
Agarwal, N., Bullins, B., Hazan, E., 2017. Second-Order Stochastic Optimization for Machine Learning in Linear Time. JMLR 18, 1--40. ↩
Hataya, R., Yamada, M., 2023. Nyström Method for Accurate and Scalable Implicit Differentiation, in: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Presented at the International Conference on Artificial Intelligence and Statistics, PMLR, pp. 4643--4654. ↩
Frangella, Z., Tropp, J.A., Udell, M., 2023. Randomized Nyström Preconditioning. SIAM J. Matrix Anal. Appl. 44, 718--752. https://doi.org/10.1137/21M1466244 ↩