pydvl.reporting.point_removal
¶
This module implements the standard point removal experiment in data valuation.
It is a method to evaluate the usefulness and stability of a valuation method. The idea is to remove a percentage of the data points from the training set based on their valuation, and then retrain the model and evaluate it on a test set. This is done for a range of removal percentages, and the performance is measured as a function of the percentage of data removed. By repeating this process multiple times, we can get an estimate of the stability of the valuation method.
The experiment can be run in parallel with the run_removal_experiment function. In order to call it, we need to define 3 types of factories:
- A factory that returns a train-test split of the data given a random state
- A factory that returns a utility that evaluates a model on a given test set. This is used for the performance evaluation. The model need not be the same as the one used for the valuation.
- A factory returning a valuation method. The training set is passed to the factory, in case the valuation needs to train something. E.g. for Data-OOB we need the bagging model to be fitted before the valuation is computed.
removal_job
¶
removal_job(
data_factory: DataSplitFactory,
valuation_factory: ValuationFactory,
utility_factory: UtilityFactory,
removal_percentages: NDArray,
random_state: int,
) -> tuple[dict, dict]
A job that computes the scores for a single run of the removal experiment.
PARAMETER | DESCRIPTION |
---|---|
data_factory
|
A callable that returns a tuple of Datasets (train, test) to use in the experiment.
TYPE:
|
valuation_factory
|
A callable that returns a Valuation object given a train dataset and a random state. Computing values with this object is the goal of the experiment
TYPE:
|
utility_factory
|
A callable that returns a ModelUtility object given a test dataset and a random state. This object is used to evaluate the performance of the valuation method by removing data points from the training set and retraining the model, then scoring it on the test set.
TYPE:
|
removal_percentages
|
As sequence of percentages of data to remove from the training set.
TYPE:
|
random_state
|
The random state to use in the experiment.
TYPE:
|
Returns: A tuple of dictionaries with the scores for the low and high value removals.
Source code in src/pydvl/reporting/point_removal.py
run_removal_experiment
¶
run_removal_experiment(
data_factory: DataSplitFactory,
valuation_factories: list[ValuationFactory],
utility_factory: UtilityFactory,
removal_percentages: NDArray,
n_runs: int = 1,
n_jobs: int = 1,
random_state: int | None = None,
) -> tuple[DataFrame, DataFrame]
Run the sample removal experiment.
Given the factories, the removal percentages, and the number of runs, this function does the following in each run:
- Sample a random state
-
For each valuation method, compute the values and iteratively compute the scores after retraining on subsets of the data. This is parallelized. Each job requires 3 factories:
-
A factory that returns a train-test split of the data given a random state
- A factory returning a valuation method. The training set is passed to the factory, in case the valuation needs to train something. E.g. for Data-OOB we need the bagging model to be fitted before the valuation is computed.
- A factory that returns a utility that evaluates some model on a given test set. This is used for the performance evaluation. The model need not be the same as the one used for the valuation.
- It returns the scores in two DataFrames, one for the high value removals and one for the low value removals.
PARAMETER | DESCRIPTION |
---|---|
data_factory
|
A callable that returns a tuple of Datasets (train, test) given a random state
TYPE:
|
valuation_factories
|
A list of callables that return Valuation objects given a model, train data, and random state. The training data is typically not needed for construction, but bagging models may require it
TYPE:
|
utility_factory
|
A callable that returns a ModelUtility object given a test dataset and a random state. This object is used to evaluate the performance of the valuation method by removing data points from the training set and retraining the model, then scoring it on the test set.
TYPE:
|
removal_percentages
|
The percentage of data to remove from the training set. This should be a list of floats between 0 and 1.
TYPE:
|
n_runs
|
The number of repetitions of the experiment.
TYPE:
|
n_jobs
|
The number of parallel jobs to use.
TYPE:
|
random_state
|
The initial random state.
TYPE:
|
Returns: A tuple of DataFrames with the scores for the low and high value removals
Source code in src/pydvl/reporting/point_removal.py
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
|