pydvl.utils.numeric
¶
This module contains routines for numerical computations used across the library.
complement
¶
Returns the complement of the set of indices excluding the given indices.
PARAMETER | DESCRIPTION |
---|---|
include
|
The set of indices to consider.
TYPE:
|
exclude
|
The indices to exclude from the complement. These must be a subset
of |
RETURNS | DESCRIPTION |
---|---|
NDArray[T]
|
The complement of the set of indices excluding the given indices. |
Source code in src/pydvl/utils/numeric.py
powerset
¶
powerset(s: NDArray[T]) -> Iterator[Collection[T]]
Returns an iterator for the power set of the argument.
Subsets are generated in sequence by growing size. See random_powerset() for random sampling.
Example
PARAMETER | DESCRIPTION |
---|---|
s
|
The set to use
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Iterator[Collection[T]]
|
An iterator over all subsets of the set of indices |
Source code in src/pydvl/utils/numeric.py
num_samples_permutation_hoeffding
¶
Lower bound on the number of samples required for MonteCarlo Shapley to obtain an (ε,δ)-approximation.
That is: with probability 1-δ, the estimated value for one data point will be ε-close to the true quantity, if at least this many permutations are sampled.
PARAMETER | DESCRIPTION |
---|---|
eps
|
ε > 0
TYPE:
|
delta
|
0 < δ <= 1
TYPE:
|
u_range
|
Range of the Utility function
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
int
|
Number of permutations required to guarantee ε-correct Shapley values with probability 1-δ |
Source code in src/pydvl/utils/numeric.py
random_subset
¶
Returns one subset at random from s
.
PARAMETER | DESCRIPTION |
---|---|
s
|
set to sample from
TYPE:
|
q
|
Sampling probability for elements. The default 0.5 yields a uniform distribution over the power set of s.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray[T]
|
The subset |
Source code in src/pydvl/utils/numeric.py
random_powerset
¶
random_powerset(
s: NDArray[T],
n_samples: Optional[int] = None,
q: float = 0.5,
seed: Optional[Seed] = None,
) -> Generator[NDArray[T], None, None]
Samples subsets from the power set of the argument, without pre-generating all subsets and in no order.
See powerset if you wish to deterministically generate all subsets.
To generate subsets, len(s)
Bernoulli draws with probability q
are
drawn. The default value of q = 0.5
provides a uniform distribution over
the power set of s
. Other choices can be used e.g. to implement
owen_sampling_shapley.
PARAMETER | DESCRIPTION |
---|---|
s
|
set to sample from
TYPE:
|
n_samples
|
if set, stop the generator after this many steps.
Defaults to |
q
|
Sampling probability for elements. The default 0.5 yields a uniform distribution over the power set of s.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
None
|
Samples from the power set of |
RAISES | DESCRIPTION |
---|---|
ValueError
|
if the element sampling probability is not in [0,1] |
Source code in src/pydvl/utils/numeric.py
random_powerset_label_min
¶
random_powerset_label_min(
s: NDArray[T],
labels: NDArray[int_],
min_elements_per_label: int = 1,
seed: Optional[Seed] = None,
) -> Generator[NDArray[T], None, None]
Draws random subsets from s
, while ensuring that at least
min_elements_per_label
elements per label are included in the draw. It can be used
for classification problems to ensure that a set contains information for all labels
(or not if min_elements_per_label=0
).
PARAMETER | DESCRIPTION |
---|---|
s
|
Set to sample from
TYPE:
|
labels
|
Labels for the samples |
min_elements_per_label
|
Minimum number of elements for each label.
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
None
|
Generated draw from the powerset of s with |
None
|
label. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If |
Source code in src/pydvl/utils/numeric.py
random_subset_of_size
¶
Samples a random subset of given size uniformly from the powerset
of s
.
PARAMETER | DESCRIPTION |
---|---|
s
|
Set to sample from
TYPE:
|
size
|
Size of the subset to generate
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray[T]
|
The subset |
Raises ValueError: If size > len(s)
Source code in src/pydvl/utils/numeric.py
random_matrix_with_condition_number
¶
random_matrix_with_condition_number(
n: int, condition_number: float, seed: Optional[Seed] = None
) -> NDArray
Constructs a square matrix with a given condition number.
Taken from: https://gist.github.com/bstellato/23322fe5d87bb71da922fbc41d658079#file-random_mat_condition_number-py
Also see: https://math.stackexchange.com/questions/1351616/condition-number-of-ata.
PARAMETER | DESCRIPTION |
---|---|
n
|
size of the matrix
TYPE:
|
condition_number
|
duh
TYPE:
|
seed
|
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray
|
An (n,n) matrix with the requested condition number. |
Source code in src/pydvl/utils/numeric.py
running_moments
¶
running_moments(
previous_avg: float,
previous_variance: float,
count: int,
new_value: float,
unbiased: bool = True,
) -> tuple[float, float]
Calculates running average and variance of a series of numbers.
See Welford's algorithm in wikipedia
Warning
This is not really using Welford's correction for numerical stability for the variance. (FIXME)
Todo
This could be generalised to arbitrary moments. See this paper
PARAMETER | DESCRIPTION |
---|---|
previous_avg
|
average value at previous step.
TYPE:
|
previous_variance
|
variance at previous step.
TYPE:
|
count
|
number of points seen so far,
TYPE:
|
new_value
|
new value in the series of numbers.
TYPE:
|
unbiased
|
whether to use the unbiased variance estimator (same as
TYPE:
|
Returns: new_average, new_variance, calculated with the new count
Source code in src/pydvl/utils/numeric.py
top_k_value_accuracy
¶
Computes the top-k accuracy for the estimated values by comparing indices of the highest k values.
PARAMETER | DESCRIPTION |
---|---|
y_true
|
Exact/true value |
y_pred
|
Predicted/estimated value |
k
|
Number of the highest values taken into account
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
Accuracy |
Source code in src/pydvl/utils/numeric.py
logcomb
¶
Computes the log of the binomial coefficient (n choose k).
PARAMETER | DESCRIPTION |
---|---|
n
|
Total number of elements
TYPE:
|
k
|
Number of elements to choose
TYPE:
|
Returns: The log of the binomial coefficient
Source code in src/pydvl/utils/numeric.py
logexp
¶
logsumexp_two
¶
Numerically stable computation of log(exp(log_a) + exp(log_b)).
Uses standard log sum exp trick:
where \(m = \max(\log a, \log b)\).
PARAMETER | DESCRIPTION |
---|---|
log_a
|
Log of the first value
TYPE:
|
log_b
|
Log of the second value
TYPE:
|
Returns: The log of the sum of the exponentials
Source code in src/pydvl/utils/numeric.py
log_running_moments
¶
log_running_moments(
previous_log_sum_pos: float,
previous_log_sum_neg: float,
previous_log_sum2: float,
count: int,
new_log_value: float,
new_sign: int,
unbiased: bool = True,
) -> tuple[float, float, float, float, float]
Update running moments when the new value is provided in log space, allowing for negative values via an explicit sign.
Here the actual value is x = new_sign * exp(new_log_value). Rather than updating the arithmetic sum S = sum(x) and S2 = sum(x^2) directly, we maintain:
L_S+ = log(sum_{i: x_i >= 0} x_i) L_S- = log(sum_{i: x_i < 0} |x_i|) L_S2 = log(sum_i x_i^2)
The running mean is then computed as:
mean = exp(L_S+) - exp(L_S-)
and the second moment is:
second_moment = exp(L_S2 - log(count))
so that the variance is:
variance = second_moment - mean^2
For the unbiased (sample) estimator, we scale the variance by count/(count-1) when count > 1 (and define variance = 0 when count == 1).
PARAMETER | DESCRIPTION |
---|---|
previous_log_sum_pos
|
running log(sum of positive contributions), or -inf if none.
TYPE:
|
previous_log_sum_neg
|
running log(sum of negative contributions in absolute value), or -inf if none.
TYPE:
|
previous_log_sum2
|
running log(sum of squares) so far (or -inf if none).
TYPE:
|
count
|
number of points processed so far.
TYPE:
|
new_log_value
|
log(|x_new|), where x_new is the new value.
TYPE:
|
new_sign
|
sign of the new value (should be +1, 0, or -1).
TYPE:
|
unbiased
|
if True, compute the unbiased estimator of the variance.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
new_mean
|
running mean in the linear domain.
TYPE:
|
new_variance
|
running variance in the linear domain.
TYPE:
|
new_log_sum_pos
|
updated running log(sum of positive contributions).
TYPE:
|
new_log_sum_neg
|
updated running log(sum of negative contributions).
TYPE:
|
new_log_sum2
|
updated running log(sum of squares).
TYPE:
|
new_count
|
updated count. |
Source code in src/pydvl/utils/numeric.py
421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 |
|