pydvl.valuation.methods.data_oob
¶
This module implements the method described in (Kwon and Zou, 2023)1.
A data point's Data-OOB value is defined for bagging models. It is the average loss of the estimators which were not fit on it.
Let \(w_{bj}\in Z\) be the number of times the j-th datum \((x_j, y_j)\) is selected in the b-th bootstrap dataset. The Data-OOB value is computed as follows:
where \(T: Y \times Y \rightarrow \mathbb{R}\) is a score function that represents the goodness of a weak learner \(\hat{f}_b\) at the i-th datum \((x_i, y_i)\).
References¶
-
Kwon, Yongchan, and James Zou. Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value. In Proceedings of the 40th International Conference on Machine Learning, 18135–52. PMLR, 2023. ↩
DataOOBValuation
¶
DataOOBValuation(model: BaggingModel, score: PointwiseScore | None = None)
Bases: Valuation
Computes Data Out-Of-Bag values.
This class implements the method described in (Kwon and Zou, 2023)1.
PARAMETER | DESCRIPTION |
---|---|
model
|
A fitted bagging model. Bagging models in sklearn include
[[BaggingClassifier]], [[BaggingRegressor]], [[IsolationForest]], RandomForest,
ExtraTrees, or any model which defines an attribute
TYPE:
|
score
|
A callable for point-wise comparison of true values with the predictions.
If
TYPE:
|
Source code in src/pydvl/valuation/methods/data_oob.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort
|
Whether to sort the valuation result by value before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
fit
¶
fit(data: Dataset) -> Self
Compute the Data-OOB values.
This requires the bagging model passed upon construction to be fitted.
PARAMETER | DESCRIPTION |
---|---|
data
|
Data for which to compute values
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
The fitted object. |
Source code in src/pydvl/valuation/methods/data_oob.py
point_wise_accuracy
¶
Point-wise accuracy, or 0-1 score between two arrays.
Higher is better.
PARAMETER | DESCRIPTION |
---|---|
y_true
|
Array of true values (e.g. labels)
TYPE:
|
y_pred
|
Array of estimated values (e.g. model predictions)
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray[T]
|
Array with point-wise 0-1 accuracy between labels and model predictions |
Source code in src/pydvl/valuation/methods/data_oob.py
neg_l2_distance
¶
Point-wise negative \(l_2\) distance between two arrays.
Higher is better.
PARAMETER | DESCRIPTION |
---|---|
y_true
|
Array of true values (e.g. labels)
TYPE:
|
y_pred
|
Array of estimated values (e.g. model predictions)
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray[T]
|
Array with point-wise negative \(l_2\) distances between labels and model |
NDArray[T]
|
predictions |