Semivalues
This module provides the core functionality for the computation of generic semi-values. A semi-value is any valuation function with the form:
where the coefficients \(w(k)\) satisfy the property:
Note
For implementation consistency, we slightly depart from the common definition of semi-values, which includes a factor \(1/n\) in the sum over subsets. Instead, we subsume this factor into the coefficient \(w(k)\).
Main components¶
The computation of a semi-value requires two components:
- A subset sampler that generates subsets of the set \(D\) of interest.
- A coefficient \(w(k)\) that assigns a weight to each subset size \(k\).
Samplers can be found in sampler, and can be classified into two categories: powerset samplers and permutation samplers. Powerset samplers generate subsets of \(D_{-i}\), while the permutation sampler generates permutations of \(D\). The former conform to the above definition of semi-values, while the latter reformulates it as:
where \(\sigma_{:i}\) denotes the set of indices in permutation sigma before the position where \(i\) appears (see Data valuation for details), and
is the weight correction due to the reformulation.
Warning
Both PermutationSampler and DeterministicPermutationSampler require caching to be enabled or computation will be doubled wrt. a 'direct' implementation of permutation MC.
Computing semi-values¶
Samplers and coefficients can be arbitrarily mixed by means of the main entry point of this module, compute_generic_semivalues. There are several pre-defined coefficients, including the Shapley value of (Ghorbani and Zou, 2019)1, the Banzhaf index of (Wang and Jia)3, and the Beta coefficient of (Kwon and Zou, 2022)2. For each of these methods, there is a convenience wrapper function. Respectively, these are: compute_shapley_semivalues, compute_banzhaf_semivalues, and compute_beta_shapley_semivalues. instead.
Parallelization and batching
In order to ensure reproducibility and fine-grained control of
parallelization, samples are generated in the main process and then
distributed to worker processes for evaluation. For small sample sizes, this
can lead to a significant overhead. To avoid this, we temporarily provide an
additional argument batch_size
to all methods which can improve
performance with small models up to an order of magnitude. Note that this
argument will be removed before version 1.0 in favour of a more general
solution.
References¶
-
Ghorbani, A., Zou, J., 2019. Data Shapley: Equitable Valuation of Data for Machine Learning. In: Proceedings of the 36th International Conference on Machine Learning, PMLR, pp. 2242–2251. ↩
-
Kwon, Y. and Zou, J., 2022. Beta Shapley: A Unified and Noise-reduced Data Valuation Framework for Machine Learning. In: Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022, Vol. 151. PMLR, Valencia, Spain. ↩
-
Wang, J.T. and Jia, R., 2022. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. ArXiv preprint arXiv:2205.15466. ↩
SVCoefficient
¶
Bases: Protocol
The protocol that coefficients for the computation of semi-values must fulfill.
SemiValueMode
¶
compute_generic_semivalues(sampler, u, coefficient, done, *, batch_size=1, skip_converged=False, n_jobs=1, config=ParallelConfig(), progress=False)
¶
Computes semi-values for a given utility function and subset sampler.
PARAMETER | DESCRIPTION |
---|---|
sampler |
The subset sampler to use for utility computations.
TYPE:
|
u |
Utility object with model, data, and scoring function.
TYPE:
|
coefficient |
The semi-value coefficient
TYPE:
|
done |
Stopping criterion.
TYPE:
|
batch_size |
Number of marginal evaluations per single parallel job.
TYPE:
|
skip_converged |
Whether to skip marginal evaluations for indices that have already converged. CAUTION: This is only entirely safe if the stopping criterion is MaxUpdates. For any other stopping criterion, the convergence status of indices may change during the computation, or they may be marked as having converged even though in fact the estimated values are far from the true values (e.g. for AbsoluteStandardError, you will probably have to carefully adjust the threshold).
TYPE:
|
n_jobs |
Number of parallel jobs to use.
TYPE:
|
config |
Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
progress |
Whether to display a progress bar.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Deprecation notice
Parameter batch_size
is for experimental use and will be removed in
future versions.
Source code in src/pydvl/value/semivalues.py
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 |
|
compute_shapley_semivalues(u, *, done=MaxUpdates(100), sampler_t=PermutationSampler, batch_size=1, n_jobs=1, config=ParallelConfig(), progress=False, seed=None)
¶
Computes Shapley values for a given utility function.
This is a convenience wrapper for compute_generic_semivalues with the Shapley coefficient. Use compute_shapley_values for a more flexible interface and additional methods, including TMCS.
PARAMETER | DESCRIPTION |
---|---|
u |
Utility object with model, data, and scoring function.
TYPE:
|
done |
Stopping criterion.
TYPE:
|
sampler_t |
The sampler type to use. See the sampler module for a list.
TYPE:
|
batch_size |
Number of marginal evaluations per single parallel job.
TYPE:
|
n_jobs |
Number of parallel jobs to use.
TYPE:
|
config |
Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
seed |
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
progress |
Whether to display a progress bar.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Deprecation notice
Parameter batch_size
is for experimental use and will be removed in
future versions.
Source code in src/pydvl/value/semivalues.py
compute_banzhaf_semivalues(u, *, done=MaxUpdates(100), sampler_t=PermutationSampler, batch_size=1, n_jobs=1, config=ParallelConfig(), progress=False, seed=None)
¶
Computes Banzhaf values for a given utility function.
This is a convenience wrapper for compute_generic_semivalues with the Banzhaf coefficient.
PARAMETER | DESCRIPTION |
---|---|
u |
Utility object with model, data, and scoring function.
TYPE:
|
done |
Stopping criterion.
TYPE:
|
sampler_t |
The sampler type to use. See the sampler module for a list.
TYPE:
|
batch_size |
Number of marginal evaluations per single parallel job.
TYPE:
|
n_jobs |
Number of parallel jobs to use.
TYPE:
|
seed |
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
config |
Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
progress |
Whether to display a progress bar.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Deprecation notice
Parameter batch_size
is for experimental use and will be removed in
future versions.
Source code in src/pydvl/value/semivalues.py
compute_beta_shapley_semivalues(u, *, alpha=1, beta=1, done=MaxUpdates(100), sampler_t=PermutationSampler, batch_size=1, n_jobs=1, config=ParallelConfig(), progress=False, seed=None)
¶
Computes Beta Shapley values for a given utility function.
This is a convenience wrapper for compute_generic_semivalues with the Beta Shapley coefficient.
PARAMETER | DESCRIPTION |
---|---|
u |
Utility object with model, data, and scoring function.
TYPE:
|
alpha |
Alpha parameter of the Beta distribution.
TYPE:
|
beta |
Beta parameter of the Beta distribution.
TYPE:
|
done |
Stopping criterion.
TYPE:
|
sampler_t |
The sampler type to use. See the sampler module for a list.
TYPE:
|
batch_size |
Number of marginal evaluations per (parallelized) task.
TYPE:
|
n_jobs |
Number of parallel jobs to use.
TYPE:
|
seed |
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
config |
Object configuring parallel computation, with cluster address, number of cpus, etc.
TYPE:
|
progress |
Whether to display a progress bar.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Deprecation notice
Parameter batch_size
is for experimental use and will be removed in
future versions.
Source code in src/pydvl/value/semivalues.py
compute_semivalues(u, *, done=MaxUpdates(100), mode=SemiValueMode.Shapley, sampler_t=PermutationSampler, batch_size=1, n_jobs=1, seed=None, **kwargs)
¶
Convenience entry point for most common semi-value computations.
Deprecation warning
This method is deprecated and will be replaced in 0.8.0 by the more general implementation of compute_generic_semivalues. Use compute_shapley_semivalues, compute_banzhaf_semivalues, or compute_beta_shapley_semivalues instead.
The modes supported with this interface are the following. For greater flexibility use compute_generic_semivalues directly.
- SemiValueMode.Shapley: Shapley values.
- [SemiValueMode.BetaShapley][pydvl.value.semivalues.SemiValueMode.BetaShapley]:
Implements the Beta Shapley semi-value as introduced in
(Kwon and Zou, 2022)1.
Pass additional keyword arguments
alpha
andbeta
to set the parameters of the Beta distribution (both default to 1). - [SemiValueMode.Banzhaf][]: Implements the Banzhaf semi-value as introduced in (Wang and Jia, 2022)1.
See [[data-valuation]] for an overview of valuation. - SemiValueMode.Banzhaf: Implements the Banzhaf semi-value as introduced in [@wang_data_2022].
PARAMETER | DESCRIPTION |
---|---|
u |
Utility object with model, data, and scoring function.
TYPE:
|
done |
Stopping criterion.
TYPE:
|
mode |
The semi-value mode to use. See SemiValueMode for a list.
TYPE:
|
sampler_t |
The sampler type to use. See sampler for a list.
TYPE:
|
batch_size |
Number of marginal evaluations per (parallelized) task.
TYPE:
|
n_jobs |
Number of parallel jobs to use.
TYPE:
|
seed |
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
kwargs |
Additional keyword arguments passed to compute_generic_semivalues.
DEFAULT:
|
RETURNS | DESCRIPTION |
---|---|
ValuationResult
|
Object with the results. |
Deprecation notice
Parameter batch_size
is for experimental use and will be removed in
future versions.
Source code in src/pydvl/value/semivalues.py
498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 |
|
Created: 2023-12-21