pydvl.valuation.methods.data_oob
¶
This module implements the method described in (Kwon and Zou, 2023)1. It fits a bagging classifier or regressor to the data with a given model as base estimator. A data point's value is the average loss of the estimators which were not fit on it.
Let \(w_{bj}\in Z\) be the number of times the j-th datum \((x_j, y_j)\) is selected in the b-th bootstrap dataset.
where \(T: Y \times Y \rightarrow \mathbb{R}\) is a score function that represents the goodness of a weak learner \(\hat{f}_b\) at the i-th datum \((x_i, y_i)\).
Warning
This implementation is a placeholder and does not match exactly the method described in the paper.
References¶
-
Kwon, Yongchan, and James Zou. Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value. In Proceedings of the 40th International Conference on Machine Learning, 18135–52. PMLR, 2023. ↩
DataOOBValuation
¶
DataOOBValuation(
model: SupervisedModel,
n_estimators: int,
max_samples: float = 0.8,
loss: LossFunction | None = None,
seed: Seed | None = None,
)
Bases: Valuation
Computes Data Out-Of-Bag values.
Tip
n_estimators
and max_samples
must be tuned jointly to ensure that all
samples are at least 1 time out-of-bag, otherwise the result could include a
NaN value for that datum.
PARAMETER | DESCRIPTION |
---|---|
data |
dataset
|
model |
TYPE:
|
n_estimators |
Number of estimators used in the bagging procedure.
TYPE:
|
max_samples |
The fraction of samples to draw to train each base estimator.
TYPE:
|
loss |
A function taking as parameters model prediction and corresponding data labels(y_true, y_pred) and returning an array of point-wise errors.
TYPE:
|
seed |
Either an instance of a numpy random number generator or a seed for it.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Object with the data values. |
this is an extended pydvl implementation of the Data-OOB valuation method
which just bags whatever model is passed to it. The paper only considers bagging models as input.
Source code in src/pydvl/valuation/methods/data_oob.py
values
¶
values(sort: bool = False) -> ValuationResult
Returns a copy of the valuation result.
The valuation must have been run with fit()
before calling this method.
PARAMETER | DESCRIPTION |
---|---|
sort |
Whether to sort the valuation result before returning it.
TYPE:
|
Returns: The result of the valuation.
Source code in src/pydvl/valuation/base.py
point_wise_accuracy
¶
Point-wise 0-1 loss between two arrays
PARAMETER | DESCRIPTION |
---|---|
y_true |
Array of true values (e.g. labels)
TYPE:
|
y_pred |
Array of estimated values (e.g. model predictions)
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray[T]
|
Array with point-wise 0-1 losses between labels and model predictions |
Source code in src/pydvl/valuation/methods/data_oob.py
neg_l2_distance
¶
Point-wise negative \(l_2\) distance between two arrays
PARAMETER | DESCRIPTION |
---|---|
y_true |
Array of true values (e.g. labels)
TYPE:
|
y_pred |
Array of estimated values (e.g. model predictions)
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NDArray[T]
|
Array with point-wise negative \(l_2\) distances between labels and model |
NDArray[T]
|
predictions |