pydvl.valuation.methods.data_oob ¶

This module implements the method described in Kwon and Zou, (2023).¹

Data-OOB value is tailored to bagging models. It defines a data point's value as the average loss of the estimators which were not fit on it.

As such it is not a semi-value, and it is not based on marginal contributions.

Info

For details on the method and a discussion on how and whether to use it by bagging models a posteriori, see the main documentation.

References¶

Kwon, Yongchan, and James Zou. Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value. In Proceedings of the 40th International Conference on Machine Learning, 18135–52. PMLR, 2023. ↩

DataOOBValuation ¶

DataOOBValuation(model: BaggingModel, score: PointwiseScore | None = None)

Bases: Valuation

Computes Data Out-Of-Bag values.

This class implements the method described in Kwon and Zou, (2023)¹.

PARAMETER	DESCRIPTION
`model`	A fitted bagging model. Bagging models in sklearn include [[BaggingClassifier]], [[BaggingRegressor]], [[IsolationForest]], RandomForest, ExtraTrees, or any model which defines an attribute `estimators_` and uses bootstrapped subsamples to compute predictions. TYPE: `BaggingModel`
`score`	A callable for point-wise comparison of true values with the predictions. If `None`, uses point-wise accuracy for classifiers and negative \(l_2\) distance for regressors. TYPE: `PointwiseScore \| None` DEFAULT: `None`

Tensor Support

DataOOBValuation supports PyTorch tensors for input data with some limitations: - The scoring functions (point_wise_accuracy and neg_l2_distance) have been updated to work with both NumPy arrays and PyTorch tensors. - Custom scoring functions must handle both array types if you plan to use tensors. - The bagging model implementation must be tensor-compatible and implement the required BaggingModel interface attributes and methods.

New in version 0.11.0

Added (partial) support for PyTorch tensors.

Source code in src/pydvl/valuation/methods/data_oob.py

def __init__(
    self,
    model: BaggingModel,
    score: PointwiseScore | None = None,
):
    super().__init__()
    self.model = model
    self.score = score
    self.algorithm_name = f"Data-OOB-{str(self.model)}"

result `property` ¶

result: ValuationResult

The current valuation result (not a copy).

fit ¶

fit(data: Dataset, continue_from: ValuationResult | None = None) -> Self

Compute the Data-OOB values.

This requires the bagging model passed upon construction to be fitted.

PARAMETER	DESCRIPTION
`data`	Data for which to compute values TYPE: `Dataset`
`continue_from`	A previously computed valuation result to continue from. TYPE: `ValuationResult \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Self`	The fitted object.

Source code in src/pydvl/valuation/methods/data_oob.py

def fit(self, data: Dataset, continue_from: ValuationResult | None = None) -> Self:
    """Compute the Data-OOB values.

    This requires the bagging model passed upon construction to be fitted.

    Args:
        data: Data for which to compute values
        continue_from: A previously computed valuation result to continue from.

    Returns:
        The fitted object.
    """

    self._result = self._init_or_check_result(data, continue_from)

    check_is_fitted(
        self.model,
        msg="The bagging model has to be fitted before calling the valuation "
        "method.",
    )

    # This should always be present after fitting
    try:
        estimators = self.model.estimators_  # type: ignore
    except AttributeError:
        raise ValueError(
            "The model has to be an sklearn-compatible bagging model, "
            "including "
            "BaggingClassifier, BaggingRegressor, IsolationForest, "
            "RandomForest*, "
            "and ExtraTrees*"
        )

    if self.score is None:
        self.score = (
            point_wise_accuracy if is_classifier(self.model) else neg_l2_distance
        )

    if hasattr(self.model, "estimators_samples_"):  # Bagging(Classifier|Regressor)
        unsampled_indices = [
            np.setxor1d(data.indices, np.unique(sampled))
            for sampled in self.model.estimators_samples_
        ]
    elif isinstance(
        self.model,
        (
            RandomForestClassifier,
            RandomForestRegressor,
            ExtraTreesClassifier,
            ExtraTreesRegressor,
            IsolationForest,
        ),
    ):
        n_samples_bootstrap = _get_n_samples_bootstrap(
            len(data), self.model.max_samples
        )
        unsampled_indices = [
            _generate_unsampled_indices(
                est.random_state, len(data.indices), n_samples_bootstrap
            )
            for est in estimators
        ]
    else:
        raise ValueError(
            "The model has to be an sklearn-compatible bagging model, "
            "including "
            "BaggingClassifier, BaggingRegressor, IsolationForest, "
            "RandomForest*, "
            "and ExtraTrees*, \n"
            "or it must implement pydvl.valuation.types.BaggingModel.\n"
            f"Was: {type(self.model)}"
        )

    for est, oob_indices in zip(estimators, unsampled_indices):
        subset = data[oob_indices].data()
        score_array = self.score(y_true=subset.y, y_pred=est.predict(subset.x))
        self._result += ValuationResult(
            algorithm=str(self),
            indices=oob_indices,
            names=data[oob_indices].names,
            values=score_array,
            counts=np.ones_like(score_array, dtype=data.indices.dtype),
        )
    return self

values ¶

values(sort: bool = False) -> ValuationResult

Returns a copy of the valuation result.

The valuation must have been run with fit() before calling this method.

PARAMETER	DESCRIPTION
`sort`	Whether to sort the valuation result by value before returning it. TYPE: `bool` DEFAULT: `False`

Returns: The result of the valuation.

Source code in src/pydvl/valuation/base.py

@deprecated(
    target=None,
    deprecated_in="0.10.0",
    remove_in="0.11.0",
)
def values(self, sort: bool = False) -> ValuationResult:
    """Returns a copy of the valuation result.

    The valuation must have been run with `fit()` before calling this method.

    Args:
        sort: Whether to sort the valuation result by value before returning it.
    Returns:
        The result of the valuation.
    """
    if not self.is_fitted:
        raise NotFittedException(type(self))
    assert self._result is not None

    r = self._result.copy()
    if sort:
        r.sort(inplace=True)
    return r

neg_l2_distance ¶

neg_l2_distance(y_true: ArrayT, y_pred: ArrayT) -> NDArray[float64]

Point-wise negative \(l_2\) distance between two arrays.

Higher is better.

PARAMETER	DESCRIPTION
`y_true`	Array of true values (e.g. labels) TYPE: `ArrayT`
`y_pred`	Array of estimated values (e.g. model predictions) TYPE: `ArrayT`

RETURNS	DESCRIPTION
`NDArray[float64]`	Array with point-wise negative \(l_2\) distances between labels and model
`NDArray[float64]`	predictions

Source code in src/pydvl/valuation/methods/data_oob.py

def neg_l2_distance(y_true: ArrayT, y_pred: ArrayT) -> NDArray[np.float64]:
    r"""Point-wise negative $l_2$ distance between two arrays.

    Higher is better.

    Args:
        y_true: Array of true values (e.g. labels)
        y_pred: Array of estimated values (e.g. model predictions)

    Returns:
        Array with point-wise negative $l_2$ distances between labels and model
        predictions
    """
    return cast(
        NDArray, -np.square(to_numpy(y_pred) - to_numpy(y_true), dtype=np.float64)
    )

point_wise_accuracy ¶

point_wise_accuracy(y_true: ArrayT, y_pred: ArrayT) -> NDArray[float64]

Point-wise accuracy, or 0-1 score between two arrays.

Higher is better.

PARAMETER	DESCRIPTION
`y_true`	Array of true values (e.g. labels) TYPE: `ArrayT`
`y_pred`	Array of estimated values (e.g. model predictions) TYPE: `ArrayT`

RETURNS	DESCRIPTION
`NDArray[float64]`	Array with point-wise 0-1 accuracy between labels and model predictions

Source code in src/pydvl/valuation/methods/data_oob.py

def point_wise_accuracy(y_true: ArrayT, y_pred: ArrayT) -> NDArray[np.float64]:
    """Point-wise accuracy, or 0-1 score between two arrays.

    Higher is better.

    Args:
        y_true: Array of true values (e.g. labels)
        y_pred: Array of estimated values (e.g. model predictions)

    Returns:
        Array with point-wise 0-1 accuracy between labels and model predictions
    """
    return cast(
        NDArray, np.array(to_numpy(y_pred) == to_numpy(y_true), dtype=np.float64)
    )

pydvl.valuation.methods.data_oob ¶

References¶

DataOOBValuation ¶

result property ¶

fit ¶

values ¶

neg_l2_distance ¶

point_wise_accuracy ¶

result `property` ¶