Skip to content

pydvl.valuation.methods.data_oob

This module implements the method described in Kwon and Zou, (2023).1

Data-OOB value is tailored to bagging models. It defines a data point's value as the average loss of the estimators which were not fit on it.

As such it is not a semi-value, and it is not based on marginal contributions.

Info

For details on the method and a discussion on how and whether to use it by bagging models a posteriori, see the main documentation.

References


  1. Kwon, Yongchan, and James Zou. Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value. In Proceedings of the 40th International Conference on Machine Learning, 18135–52. PMLR, 2023. 

DataOOBValuation

DataOOBValuation(model: BaggingModel, score: PointwiseScore | None = None)

Bases: Valuation

Computes Data Out-Of-Bag values.

This class implements the method described in Kwon and Zou, (2023)1.

PARAMETER DESCRIPTION
model

A fitted bagging model. Bagging models in sklearn include [[BaggingClassifier]], [[BaggingRegressor]], [[IsolationForest]], RandomForest, ExtraTrees, or any model which defines an attribute estimators_ and uses bootstrapped subsamples to compute predictions.

TYPE: BaggingModel

score

A callable for point-wise comparison of true values with the predictions. If None, uses point-wise accuracy for classifiers and negative \(l_2\) distance for regressors.

TYPE: PointwiseScore | None DEFAULT: None

Source code in src/pydvl/valuation/methods/data_oob.py
def __init__(
    self,
    model: BaggingModel,
    score: PointwiseScore | None = None,
):
    super().__init__()
    self.model = model
    self.score = score
    self.algorithm_name = f"Data-OOB-{str(self.model)}"

result property

The current valuation result (not a copy).

fit

fit(data: Dataset, continue_from: ValuationResult | None = None) -> Self

Compute the Data-OOB values.

This requires the bagging model passed upon construction to be fitted.

PARAMETER DESCRIPTION
data

Data for which to compute values

TYPE: Dataset

continue_from

A previously computed valuation result to continue from.

TYPE: ValuationResult | None DEFAULT: None

RETURNS DESCRIPTION
Self

The fitted object.

Source code in src/pydvl/valuation/methods/data_oob.py
def fit(self, data: Dataset, continue_from: ValuationResult | None = None) -> Self:
    """Compute the Data-OOB values.

    This requires the bagging model passed upon construction to be fitted.

    Args:
        data: Data for which to compute values
        continue_from: A previously computed valuation result to continue from.

    Returns:
        The fitted object.
    """

    self._result = self._init_or_check_result(data, continue_from)

    check_is_fitted(
        self.model,
        msg="The bagging model has to be fitted before calling the valuation method.",
    )

    # This should always be present after fitting
    try:
        estimators = self.model.estimators_  # type: ignore
    except AttributeError:
        raise ValueError(
            "The model has to be an sklearn-compatible bagging model, including "
            "BaggingClassifier, BaggingRegressor, IsolationForest, RandomForest*, "
            "and ExtraTrees*"
        )

    if self.score is None:
        self.score = (
            point_wise_accuracy if is_classifier(self.model) else neg_l2_distance
        )

    if hasattr(self.model, "estimators_samples_"):  # Bagging(Classifier|Regressor)
        unsampled_indices = [
            np.setxor1d(data.indices, np.unique(sampled))
            for sampled in self.model.estimators_samples_
        ]
    else:  # RandomForest*, ExtraTrees*, IsolationForest
        n_samples_bootstrap = _get_n_samples_bootstrap(
            len(data), self.model.max_samples
        )
        unsampled_indices = [
            _generate_unsampled_indices(
                est.random_state, len(data.indices), n_samples_bootstrap
            )
            for est in estimators
        ]

    for est, oob_indices in zip(estimators, unsampled_indices):
        subset = data[oob_indices].data()
        score_array = self.score(y_true=subset.y, y_pred=est.predict(subset.x))
        self._result += ValuationResult(
            algorithm=str(self),
            indices=oob_indices,
            names=data[oob_indices].names,
            values=score_array,
            counts=np.ones_like(score_array, dtype=data.indices.dtype),
        )
    return self

values

values(sort: bool = False) -> ValuationResult

Returns a copy of the valuation result.

The valuation must have been run with fit() before calling this method.

PARAMETER DESCRIPTION
sort

Whether to sort the valuation result by value before returning it.

TYPE: bool DEFAULT: False

Returns: The result of the valuation.

Source code in src/pydvl/valuation/base.py
@deprecated(
    target=None,
    deprecated_in="0.10.0",
    remove_in="0.11.0",
)
def values(self, sort: bool = False) -> ValuationResult:
    """Returns a copy of the valuation result.

    The valuation must have been run with `fit()` before calling this method.

    Args:
        sort: Whether to sort the valuation result by value before returning it.
    Returns:
        The result of the valuation.
    """
    if not self.is_fitted:
        raise NotFittedException(type(self))
    assert self._result is not None

    r = self._result.copy()
    if sort:
        r.sort(inplace=True)
    return r

neg_l2_distance

neg_l2_distance(y_true: NDArray[T], y_pred: NDArray[T]) -> NDArray[T]

Point-wise negative \(l_2\) distance between two arrays.

Higher is better.

PARAMETER DESCRIPTION
y_true

Array of true values (e.g. labels)

TYPE: NDArray[T]

y_pred

Array of estimated values (e.g. model predictions)

TYPE: NDArray[T]

RETURNS DESCRIPTION
NDArray[T]

Array with point-wise negative \(l_2\) distances between labels and model

NDArray[T]

predictions

Source code in src/pydvl/valuation/methods/data_oob.py
def neg_l2_distance(y_true: NDArray[T], y_pred: NDArray[T]) -> NDArray[T]:
    r"""Point-wise negative $l_2$ distance between two arrays.

    Higher is better.

    Args:
        y_true: Array of true values (e.g. labels)
        y_pred: Array of estimated values (e.g. model predictions)

    Returns:
        Array with point-wise negative $l_2$ distances between labels and model
        predictions
    """
    return -np.square(np.array(y_pred - y_true), dtype=y_pred.dtype)

point_wise_accuracy

point_wise_accuracy(y_true: NDArray[T], y_pred: NDArray[T]) -> NDArray[T]

Point-wise accuracy, or 0-1 score between two arrays.

Higher is better.

PARAMETER DESCRIPTION
y_true

Array of true values (e.g. labels)

TYPE: NDArray[T]

y_pred

Array of estimated values (e.g. model predictions)

TYPE: NDArray[T]

RETURNS DESCRIPTION
NDArray[T]

Array with point-wise 0-1 accuracy between labels and model predictions

Source code in src/pydvl/valuation/methods/data_oob.py
def point_wise_accuracy(y_true: NDArray[T], y_pred: NDArray[T]) -> NDArray[T]:
    """Point-wise accuracy, or 0-1 score between two arrays.

    Higher is better.

    Args:
        y_true: Array of true values (e.g. labels)
        y_pred: Array of estimated values (e.g. model predictions)

    Returns:
        Array with point-wise 0-1 accuracy between labels and model predictions
    """
    return np.array(y_pred == y_true, dtype=y_pred.dtype)