Skip to content

pydvl.valuation.methods.data_oob

This module implements the method described in (Kwon and Zou, 2023)1. It fits a bagging classifier or regressor to the data with a given model as base estimator. A data point's value is the average loss of the estimators which were not fit on it.

Let \(w_{bj}\in Z\) be the number of times the j-th datum \((x_j, y_j)\) is selected in the b-th bootstrap dataset.

\[ \psi((x_i,y_i),\Theta_B):=\frac{\sum_{b=1}^{B}\mathbb{1}(w_{bi}=0)T(y_i, \hat{f}_b(x_i))}{\sum_{b=1}^{B} \mathbb{1}(w_{bi}=0)}, \]

where \(T: Y \times Y \rightarrow \mathbb{R}\) is a score function that represents the goodness of a weak learner \(\hat{f}_b\) at the i-th datum \((x_i, y_i)\).

Warning

This implementation is a placeholder and does not match exactly the method described in the paper.

References


  1. Kwon, Yongchan, and James Zou. Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value. In Proceedings of the 40th International Conference on Machine Learning, 18135–52. PMLR, 2023. 

DataOOBValuation

DataOOBValuation(
    model: SupervisedModel,
    n_estimators: int,
    max_samples: float = 0.8,
    loss: LossFunction | None = None,
    seed: Seed | None = None,
)

Bases: Valuation

Computes Data Out-Of-Bag values.

Tip

n_estimators and max_samples must be tuned jointly to ensure that all samples are at least 1 time out-of-bag, otherwise the result could include a NaN value for that datum.

PARAMETER DESCRIPTION
data

dataset

model

TYPE: SupervisedModel

n_estimators

Number of estimators used in the bagging procedure.

TYPE: int

max_samples

The fraction of samples to draw to train each base estimator.

TYPE: float DEFAULT: 0.8

loss

A function taking as parameters model prediction and corresponding data labels(y_true, y_pred) and returning an array of point-wise errors.

TYPE: LossFunction | None DEFAULT: None

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Seed | None DEFAULT: None

RETURNS DESCRIPTION

Object with the data values.

this is an extended pydvl implementation of the Data-OOB valuation method

which just bags whatever model is passed to it. The paper only considers bagging models as input.

Source code in src/pydvl/valuation/methods/data_oob.py
def __init__(
    self,
    model: SupervisedModel,
    n_estimators: int,
    max_samples: float = 0.8,
    loss: LossFunction | None = None,
    seed: Seed | None = None,
):
    super().__init__()
    self.model = model
    self.n_estimators = n_estimators
    self.max_samples = max_samples
    self.loss = loss
    self.rng = np.random.default_rng(seed)

values

values(sort: bool = False) -> ValuationResult

Returns a copy of the valuation result.

The valuation must have been run with fit() before calling this method.

PARAMETER DESCRIPTION
sort

Whether to sort the valuation result before returning it.

TYPE: bool DEFAULT: False

Returns: The result of the valuation.

Source code in src/pydvl/valuation/base.py
def values(self, sort: bool = False) -> ValuationResult:
    """Returns a copy of the valuation result.

    The valuation must have been run with `fit()` before calling this method.

    Args:
        sort: Whether to sort the valuation result before returning it.
    Returns:
        The result of the valuation.
    """
    if not self.is_fitted:
        raise NotFittedException(type(self))
    assert self.result is not None

    from copy import copy

    r = copy(self.result)
    if sort:
        r.sort()
    return r

point_wise_accuracy

point_wise_accuracy(y_true: NDArray[T], y_pred: NDArray[T]) -> NDArray[T]

Point-wise 0-1 loss between two arrays

PARAMETER DESCRIPTION
y_true

Array of true values (e.g. labels)

TYPE: NDArray[T]

y_pred

Array of estimated values (e.g. model predictions)

TYPE: NDArray[T]

RETURNS DESCRIPTION
NDArray[T]

Array with point-wise 0-1 losses between labels and model predictions

Source code in src/pydvl/valuation/methods/data_oob.py
def point_wise_accuracy(y_true: NDArray[T], y_pred: NDArray[T]) -> NDArray[T]:
    r"""Point-wise 0-1 loss between two arrays

    Args:
        y_true: Array of true values (e.g. labels)
        y_pred: Array of estimated values (e.g. model predictions)

    Returns:
        Array with point-wise 0-1 losses between labels and model predictions
    """
    return np.array(y_pred == y_true, dtype=y_pred.dtype)

neg_l2_distance

neg_l2_distance(y_true: NDArray[T], y_pred: NDArray[T]) -> NDArray[T]

Point-wise negative \(l_2\) distance between two arrays

PARAMETER DESCRIPTION
y_true

Array of true values (e.g. labels)

TYPE: NDArray[T]

y_pred

Array of estimated values (e.g. model predictions)

TYPE: NDArray[T]

RETURNS DESCRIPTION
NDArray[T]

Array with point-wise negative \(l_2\) distances between labels and model

NDArray[T]

predictions

Source code in src/pydvl/valuation/methods/data_oob.py
def neg_l2_distance(y_true: NDArray[T], y_pred: NDArray[T]) -> NDArray[T]:
    r"""Point-wise negative $l_2$ distance between two arrays

    Args:
        y_true: Array of true values (e.g. labels)
        y_pred: Array of estimated values (e.g. model predictions)

    Returns:
        Array with point-wise negative $l_2$ distances between labels and model
        predictions
    """
    return -np.square(np.array(y_pred - y_true), dtype=y_pred.dtype)