Skip to content

pydvl.utils.numeric

This module contains routines for numerical computations used across the library.

powerset

powerset(s: NDArray[T]) -> Iterator[Collection[T]]

Returns an iterator for the power set of the argument.

Subsets are generated in sequence by growing size. See random_powerset() for random sampling.

Example
>>> import numpy as np
>>> from pydvl.utils.numeric import powerset
>>> list(powerset(np.array((1,2))))
[(), (1,), (2,), (1, 2)]
PARAMETER DESCRIPTION
s

The set to use

TYPE: NDArray[T]

RETURNS DESCRIPTION
Iterator[Collection[T]]

An iterator over all subsets of the set of indices s.

Source code in src/pydvl/utils/numeric.py
def powerset(s: NDArray[T]) -> Iterator[Collection[T]]:
    """Returns an iterator for the power set of the argument.

     Subsets are generated in sequence by growing size. See
     [random_powerset()][pydvl.utils.numeric.random_powerset] for random
     sampling.

    ??? Example
        ``` pycon
        >>> import numpy as np
        >>> from pydvl.utils.numeric import powerset
        >>> list(powerset(np.array((1,2))))
        [(), (1,), (2,), (1, 2)]
        ```

    Args:
         s: The set to use

    Returns:
        An iterator over all subsets of the set of indices `s`.
    """
    return chain.from_iterable(combinations(s, r) for r in range(len(s) + 1))

num_samples_permutation_hoeffding

num_samples_permutation_hoeffding(
    eps: float, delta: float, u_range: float
) -> int

Lower bound on the number of samples required for MonteCarlo Shapley to obtain an (ε,δ)-approximation.

That is: with probability 1-δ, the estimated value for one data point will be ε-close to the true quantity, if at least this many permutations are sampled.

PARAMETER DESCRIPTION
eps

ε > 0

TYPE: float

delta

0 < δ <= 1

TYPE: float

u_range

Range of the Utility function

TYPE: float

RETURNS DESCRIPTION
int

Number of permutations required to guarantee ε-correct Shapley values with probability 1-δ

Source code in src/pydvl/utils/numeric.py
def num_samples_permutation_hoeffding(eps: float, delta: float, u_range: float) -> int:
    """Lower bound on the number of samples required for MonteCarlo Shapley to
    obtain an (ε,δ)-approximation.

    That is: with probability 1-δ, the estimated value for one data point will
    be ε-close to the true quantity, if at least this many permutations are
    sampled.

    Args:
        eps: ε > 0
        delta: 0 < δ <= 1
        u_range: Range of the [Utility][pydvl.utils.utility.Utility] function

    Returns:
        Number of _permutations_ required to guarantee ε-correct Shapley
            values with probability 1-δ
    """
    return int(np.ceil(np.log(2 / delta) * 2 * u_range**2 / eps**2))

random_subset

random_subset(
    s: NDArray[T], q: float = 0.5, seed: Optional[Seed] = None
) -> NDArray[T]

Returns one subset at random from s.

PARAMETER DESCRIPTION
s

set to sample from

TYPE: NDArray[T]

q

Sampling probability for elements. The default 0.5 yields a uniform distribution over the power set of s.

TYPE: float DEFAULT: 0.5

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

RETURNS DESCRIPTION
NDArray[T]

The subset

Source code in src/pydvl/utils/numeric.py
def random_subset(
    s: NDArray[T], q: float = 0.5, seed: Optional[Seed] = None
) -> NDArray[T]:
    """Returns one subset at random from ``s``.

    Args:
        s: set to sample from
        q: Sampling probability for elements. The default 0.5 yields a
            uniform distribution over the power set of s.
        seed: Either an instance of a numpy random number generator or a seed
            for it.

    Returns:
        The subset
    """
    rng = np.random.default_rng(seed)
    selection = rng.uniform(size=len(s)) > q
    return s[selection]

random_powerset

random_powerset(
    s: NDArray[T],
    n_samples: Optional[int] = None,
    q: float = 0.5,
    seed: Optional[Seed] = None,
) -> Generator[NDArray[T], None, None]

Samples subsets from the power set of the argument, without pre-generating all subsets and in no order.

See powerset if you wish to deterministically generate all subsets.

To generate subsets, len(s) Bernoulli draws with probability q are drawn. The default value of q = 0.5 provides a uniform distribution over the power set of s. Other choices can be used e.g. to implement owen_sampling_shapley.

PARAMETER DESCRIPTION
s

set to sample from

TYPE: NDArray[T]

n_samples

if set, stop the generator after this many steps. Defaults to np.iinfo(np.int32).max

TYPE: Optional[int] DEFAULT: None

q

Sampling probability for elements. The default 0.5 yields a uniform distribution over the power set of s.

TYPE: float DEFAULT: 0.5

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

RETURNS DESCRIPTION
Generator[NDArray[T], None, None]

Samples from the power set of s.

RAISES DESCRIPTION
ValueError

if the element sampling probability is not in [0,1]

Source code in src/pydvl/utils/numeric.py
def random_powerset(
    s: NDArray[T],
    n_samples: Optional[int] = None,
    q: float = 0.5,
    seed: Optional[Seed] = None,
) -> Generator[NDArray[T], None, None]:
    """Samples subsets from the power set of the argument, without
    pre-generating all subsets and in no order.

    See [powerset][pydvl.utils.numeric.powerset] if you wish to deterministically generate all subsets.

    To generate subsets, `len(s)` Bernoulli draws with probability `q` are
    drawn. The default value of `q = 0.5` provides a uniform distribution over
    the power set of `s`. Other choices can be used e.g. to implement
    [owen_sampling_shapley][pydvl.value.shapley.owen.owen_sampling_shapley].

    Args:
        s: set to sample from
        n_samples: if set, stop the generator after this many steps.
            Defaults to `np.iinfo(np.int32).max`
        q: Sampling probability for elements. The default 0.5 yields a
            uniform distribution over the power set of s.
        seed: Either an instance of a numpy random number generator or a seed for it.

    Returns:
        Samples from the power set of `s`.

    Raises:
        ValueError: if the element sampling probability is not in [0,1]

    """
    if q < 0 or q > 1:
        raise ValueError("Element sampling probability must be in [0,1]")

    rng = np.random.default_rng(seed)
    total = 1
    if n_samples is None:
        n_samples = np.iinfo(np.int32).max
    while total <= n_samples:
        yield random_subset(s, q, seed=rng)
        total += 1

random_powerset_label_min

random_powerset_label_min(
    s: NDArray[T],
    labels: NDArray[int_],
    min_elements_per_label: int = 1,
    seed: Optional[Seed] = None,
) -> Generator[NDArray[T], None, None]

Draws random subsets from s, while ensuring that at least min_elements_per_label elements per label are included in the draw. It can be used for classification problems to ensure that a set contains information for all labels (or not if min_elements_per_label=0).

PARAMETER DESCRIPTION
s

Set to sample from

TYPE: NDArray[T]

labels

Labels for the samples

TYPE: NDArray[int_]

min_elements_per_label

Minimum number of elements for each label.

TYPE: int DEFAULT: 1

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

RETURNS DESCRIPTION
Generator[NDArray[T], None, None]

Generated draw from the powerset of s with min_elements_per_label for each

Generator[NDArray[T], None, None]

label.

RAISES DESCRIPTION
ValueError

If s and labels are of different length or min_elements_per_label is smaller than 0.

Source code in src/pydvl/utils/numeric.py
def random_powerset_label_min(
    s: NDArray[T],
    labels: NDArray[np.int_],
    min_elements_per_label: int = 1,
    seed: Optional[Seed] = None,
) -> Generator[NDArray[T], None, None]:
    """Draws random subsets from `s`, while ensuring that at least
    `min_elements_per_label` elements per label are included in the draw. It can be used
    for classification problems to ensure that a set contains information for all labels
    (or not if `min_elements_per_label=0`).

    Args:
        s: Set to sample from
        labels: Labels for the samples
        min_elements_per_label: Minimum number of elements for each label.
        seed: Either an instance of a numpy random number generator or a seed for it.

    Returns:
        Generated draw from the powerset of s with `min_elements_per_label` for each
        label.

    Raises:
        ValueError: If `s` and `labels` are of different length or
            `min_elements_per_label` is smaller than 0.
    """
    if len(labels) != len(s):
        raise ValueError("Set and labels have to be of same size.")

    if min_elements_per_label < 0:
        raise ValueError(
            f"Parameter min_elements={min_elements_per_label} needs to be bigger or "
            f"equal to 0."
        )

    rng = np.random.default_rng(seed)
    unique_labels = np.unique(labels)

    while True:
        subsets: List[NDArray[T]] = []
        for label in unique_labels:
            label_indices = np.asarray(np.where(labels == label)[0])
            subset_size = int(
                rng.integers(
                    min(min_elements_per_label, len(label_indices)),
                    len(label_indices) + 1,
                )
            )
            if subset_size > 0:
                subsets.append(
                    random_subset_of_size(s[label_indices], subset_size, seed=rng)
                )

        if len(subsets) > 0:
            subset = np.concatenate(tuple(subsets))
            rng.shuffle(subset)
            yield subset
        else:
            yield np.array([], dtype=s.dtype)

random_subset_of_size

random_subset_of_size(
    s: NDArray[T], size: int, seed: Optional[Seed] = None
) -> NDArray[T]

Samples a random subset of given size uniformly from the powerset of s.

PARAMETER DESCRIPTION
s

Set to sample from

TYPE: NDArray[T]

size

Size of the subset to generate

TYPE: int

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

RETURNS DESCRIPTION
NDArray[T]

The subset

Raises ValueError: If size > len(s)

Source code in src/pydvl/utils/numeric.py
def random_subset_of_size(
    s: NDArray[T], size: int, seed: Optional[Seed] = None
) -> NDArray[T]:
    """Samples a random subset of given size uniformly from the powerset
    of `s`.

    Args:
        s: Set to sample from
        size: Size of the subset to generate
        seed: Either an instance of a numpy random number generator or a seed for it.

    Returns:
        The subset

    Raises
        ValueError: If size > len(s)
    """
    if size > len(s):
        raise ValueError("Cannot sample subset larger than set")
    rng = np.random.default_rng(seed)
    return rng.choice(s, size=size, replace=False)

random_matrix_with_condition_number

random_matrix_with_condition_number(
    n: int, condition_number: float, seed: Optional[Seed] = None
) -> NDArray

Constructs a square matrix with a given condition number.

Taken from: https://gist.github.com/bstellato/23322fe5d87bb71da922fbc41d658079#file-random_mat_condition_number-py

Also see: https://math.stackexchange.com/questions/1351616/condition-number-of-ata.

PARAMETER DESCRIPTION
n

size of the matrix

TYPE: int

condition_number

duh

TYPE: float

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

RETURNS DESCRIPTION
NDArray

An (n,n) matrix with the requested condition number.

Source code in src/pydvl/utils/numeric.py
def random_matrix_with_condition_number(
    n: int, condition_number: float, seed: Optional[Seed] = None
) -> NDArray:
    """Constructs a square matrix with a given condition number.

    Taken from:
    [https://gist.github.com/bstellato/23322fe5d87bb71da922fbc41d658079#file-random_mat_condition_number-py](
    https://gist.github.com/bstellato/23322fe5d87bb71da922fbc41d658079#file-random_mat_condition_number-py)

    Also see:
    [https://math.stackexchange.com/questions/1351616/condition-number-of-ata](
    https://math.stackexchange.com/questions/1351616/condition-number-of-ata).

    Args:
        n: size of the matrix
        condition_number: duh
        seed: Either an instance of a numpy random number generator or a seed for it.

    Returns:
        An (n,n) matrix with the requested condition number.
    """
    if n < 2:
        raise ValueError("Matrix size must be at least 2")

    if condition_number <= 1:
        raise ValueError("Condition number must be greater than 1")

    rng = np.random.default_rng(seed)
    log_condition_number = np.log(condition_number)
    exp_vec = np.arange(
        -log_condition_number / 4.0,
        log_condition_number * (n + 1) / (4 * (n - 1)),
        log_condition_number / (2.0 * (n - 1)),
    )
    exp_vec = exp_vec[:n]
    s: np.ndarray = np.exp(exp_vec)
    S = np.diag(s)
    U, _ = np.linalg.qr((rng.uniform(size=(n, n)) - 5.0) * 200)
    V, _ = np.linalg.qr((rng.uniform(size=(n, n)) - 5.0) * 200)
    P: np.ndarray = U.dot(S).dot(V.T)
    P = P.dot(P.T)
    return P

running_moments

running_moments(
    previous_avg: float | NDArray[float_],
    previous_variance: float | NDArray[float_],
    count: int,
    new_value: float | NDArray[float_],
) -> Tuple[float | NDArray[float_], float | NDArray[float_]]

Uses Welford's algorithm to calculate the running average and variance of a set of numbers.

See Welford's algorithm in wikipedia

Warning

This is not really using Welford's correction for numerical stability for the variance. (FIXME)

Todo

This could be generalised to arbitrary moments. See this paper

PARAMETER DESCRIPTION
previous_avg

average value at previous step

TYPE: float | NDArray[float_]

previous_variance

variance at previous step

TYPE: float | NDArray[float_]

count

number of points seen so far

TYPE: int

new_value

new value in the series of numbers

TYPE: float | NDArray[float_]

RETURNS DESCRIPTION
Tuple[float | NDArray[float_], float | NDArray[float_]]

new_average, new_variance, calculated with the new count

Source code in src/pydvl/utils/numeric.py
def running_moments(
    previous_avg: float | NDArray[np.float_],
    previous_variance: float | NDArray[np.float_],
    count: int,
    new_value: float | NDArray[np.float_],
) -> Tuple[float | NDArray[np.float_], float | NDArray[np.float_]]:
    """Uses Welford's algorithm to calculate the running average and variance of
     a set of numbers.

    See [Welford's algorithm in wikipedia](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm)

    !!! Warning
        This is not really using Welford's correction for numerical stability
        for the variance. (FIXME)

    !!! Todo
        This could be generalised to arbitrary moments. See [this paper](https://www.osti.gov/biblio/1028931)

    Args:
        previous_avg: average value at previous step
        previous_variance: variance at previous step
        count: number of points seen so far
        new_value: new value in the series of numbers

    Returns:
        new_average, new_variance, calculated with the new count
    """
    # broadcasted operations seem not to be supported by mypy, so we ignore the type
    new_average = (new_value + count * previous_avg) / (count + 1)  # type: ignore
    new_variance = previous_variance + (
        (new_value - previous_avg) * (new_value - new_average) - previous_variance
    ) / (count + 1)
    return new_average, new_variance

top_k_value_accuracy

top_k_value_accuracy(
    y_true: NDArray[float_], y_pred: NDArray[float_], k: int = 3
) -> float

Computes the top-k accuracy for the estimated values by comparing indices of the highest k values.

PARAMETER DESCRIPTION
y_true

Exact/true value

TYPE: NDArray[float_]

y_pred

Predicted/estimated value

TYPE: NDArray[float_]

k

Number of the highest values taken into account

TYPE: int DEFAULT: 3

RETURNS DESCRIPTION
float

Accuracy

Source code in src/pydvl/utils/numeric.py
def top_k_value_accuracy(
    y_true: NDArray[np.float_], y_pred: NDArray[np.float_], k: int = 3
) -> float:
    """Computes the top-k accuracy for the estimated values by comparing indices
    of the highest k values.

    Args:
        y_true: Exact/true value
        y_pred: Predicted/estimated value
        k: Number of the highest values taken into account

    Returns:
        Accuracy
    """
    top_k_exact_values = np.argsort(y_true)[-k:]
    top_k_pred_values = np.argsort(y_pred)[-k:]
    top_k_accuracy = len(np.intersect1d(top_k_exact_values, top_k_pred_values)) / k
    return top_k_accuracy