pydvl.valuation.utility.modelutility
¶
This module implements a utility function for supervised models.
ModelUtility holds a model and a scorer. Each call to the utility will fit the model on a subset of the training data and evaluate the scorer on the test data. It is used by all the valuation methods in pydvl.valuation.
This class is geared towards sci-kit-learn models, but can be used with any object that
implements the BaseModel protocol, i.e. that has a
fit()
method.
Errors are hidden by default
During semi-value computations, the utility can be evaluated on subsets that
break the fitting process. For instance, a classifier might require at least two
classes to fit, but the utility is sometimes evaluated on subsets with only one
class. This will raise an error with most classifiers. To avoid this, we set by
default catch_errors=True
upon instantiation, which will catch the error and
return the scorer's default value instead. While we show a warning to signal that
something went wrong, this suppression can lead to unexpected results, so it is
important to be aware of this setting and to set it to False
when testing, or if
you are sure that the utility will not be evaluated on problematic subsets.
Examples¶
Standard usage
The utility takes a model and a scorer and is passed to the valuation method. Here's the basic usage:
from joblib import parallel_config
from pydvl.valuation import (
Dataset, MinUpdates, ModelUtility, SupervisedScorer, TMCShapleyValuation
)
train, test = Dataset.from_arrays(X, y, ...)
model = SomeModel() # Implementing the basic scikit-learn interface
scorer = SupervisedScorer("r2", test, default=0.0, range=(-np.inf, 1.0))
utility = ModelUtility(model, scorer, catch_errors=True, show_warnings=True)
valuation = TMCShapleyValuation(utility, is_done=MinUpdates(1000))
with parallel_config(n_jobs=-1):
valuation.fit(train)
Directly calling the utility
The following code instantiates a utility object and calls it directly. The underlying logistic regression model will be trained on the indices passed as argument, and evaluated on the test data.
from pydvl.valuation.utility import ModelUtility
from pydvl.valuation.dataset import Dataset
from pydvl.valuation.scorers import SupervisedScorer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import load_iris
train, test = Dataset.from_sklearn(load_iris(), random_state=16)
scorer = SupervisedScorer("accuracy", test, default=0.0, range=(0.0, 1.0))
u = ModelUtility(LogisticRegression(random_state=16), scorer, catch_errors=True)
u(Sample(None, subset=train.indices))
Enabling the cache
In this example an in-memory cache is used. Note that caching is only useful under certain conditions, and does not really speed typical Monte Carlo approximations. See [the introduction][#getting-started-cache] and the module documentation for more.
(...) # Imports as above
cache_backend = InMemoryCacheBackend() # See other backends in the caching module
u = ModelUtility(
model=LogisticRegression(random_state=16),
scorer=SupervisedScorer("accuracy", test, default=0.0, range=(0.0, 1.0)),
cache_backend=cache_backend,
catch_errors=True
)
u(Sample(None, subset=train.indices))
u(Sample(None, subset=train.indices)) # The second call does not retrain the model
Data type of the underlying data arrays¶
In principle, very few to no assumptions are made about the data type. As long as it is
contained in a Dataset object, it should work. If
your data needs special handling before being fed to the model from the Dataset
, you
can override the
sample_to_data()
method. Be sure not to rely on the data being static for this. If you need to transform
it before fitting, then override
with_dataset().
Caveats with parallel computation
When running in parallel, the utility is copied to each worker, which implies copying the dataset as well, which can obviously be very expensive. In order to alleviate the problem, one can memmap the data to disk. Alas, automatic memmapping by joblib does not work for nested structures like Dataset objects, nor for pytorch tensors. For now, it should be possible to use memmap manually but it hasn't been tested.
If you are working on a cluster, the data will be copied to each worker. In this
case, subclassing of Dataset
and Utility
will be necessary to minimize copying,
and the solution will depend on your storage solution. Feel free to open an issue if
you need help with this.
ModelUtility
¶
ModelUtility(
model: ModelT,
scorer: Scorer,
*,
catch_errors: bool = True,
show_warnings: bool = True,
cache_backend: CacheBackend | None = None,
cached_func_options: CachedFuncConfig | None = None,
clone_before_fit: bool = True,
)
Bases: UtilityBase[SampleT]
, Generic[SampleT, ModelT]
Convenience wrapper with configurable memoization of the utility.
An instance of ModelUtility
holds the tuple of model, and scoring function which
determines the value of data points. This is used for the computation of all
game-theoretic values like Shapley
values and the Least
Core.
ModelUtility
expects the model to fulfill at least the
BaseModel interface, i.e. to have a fit()
method
When calling the utility, the model will be cloned if it is a Scikit-Learn model, otherwise a copy is created using copy.deepcopy
Since evaluating the scoring function requires retraining the model and that can be time-consuming, this class wraps it and caches the results of each execution. Caching is available both locally and across nodes, but must always be enabled for your project first, because most stochastic methods do not benefit much from it. See the documentation and the module documentation.
ATTRIBUTE | DESCRIPTION |
---|---|
model |
The supervised model.
TYPE:
|
scorer |
A scoring function. If None, the
TYPE:
|
PARAMETER | DESCRIPTION |
---|---|
model
|
Any supervised model. Typical choices can be found in the sci-kit learn documentation.
TYPE:
|
scorer
|
A scoring object. If None, the
TYPE:
|
catch_errors
|
set to
TYPE:
|
show_warnings
|
Set to
TYPE:
|
cache_backend
|
Optional instance of CacheBackend used to memoize results to avoid duplicate computation. Note however, that for most stochastic methods, cache hits are rare, making the memory expense of caching not worth it (YMMV).
TYPE:
|
cached_func_options
|
Optional configuration object for cached utility evaluation.
TYPE:
|
clone_before_fit
|
If
TYPE:
|
Source code in src/pydvl/valuation/utility/modelutility.py
cache_stats
property
¶
cache_stats: CacheStats | None
Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.
training_data
property
¶
training_data: Dataset | None
Retrieves the training data used by this utility.
This property is read-only. In order to set it, use with_dataset().
__call__
¶
__call__(sample: SampleT | None) -> float
PARAMETER | DESCRIPTION |
---|---|
sample
|
contains a subset of valid indices for the
TYPE:
|
Source code in src/pydvl/valuation/utility/modelutility.py
__str__
¶
Returns a string representation of the utility. Subclasses should override this method to provide a more informative string
sample_to_data
¶
sample_to_data(sample: SampleT) -> tuple
Returns the raw data corresponding to a sample.
Subclasses can override this e.g. to do reshaping of tensors. Be careful not to
rely on self.training_data
not changing between calls to this method. For
manipulations to it, use the with_dataset()
method.
PARAMETER | DESCRIPTION |
---|---|
sample
|
contains a subset of valid indices for the
TYPE:
|
Returns: Tuple of the training data and labels corresponding to the sample indices.
Source code in src/pydvl/valuation/utility/modelutility.py
with_dataset
¶
Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.