Skip to content

pydvl.valuation.utility.modelutility

This module implements a utility function for supervised models.

ModelUtility holds a model and a scorer. Each call to the utility will fit the model on a subset of the training data and evaluate the scorer on the test data. It is used by all the valuation methods in pydvl.valuation.

This class is geared towards sci-kit-learn models, but can be used with any object that implements the BaseModel protocol, i.e. that has a fit() method.

Errors are hidden by default

During semi-value computations, the utility can be evaluated on subsets that break the fitting process. For instance, a classifier might require at least two classes to fit, but the utility is sometimes evaluated on subsets with only one class. This will raise an error with most classifiers. To avoid this, we set by default catch_errors=True upon instantiation, which will catch the error and return the scorer's default value instead. While we show a warning to signal that something went wrong, this suppression can lead to unexpected results, so it is important to be aware of this setting and to set it to False when testing, or if you are sure that the utility will not be evaluated on problematic subsets.

Examples

Standard usage

The utility takes a model and a scorer and is passed to the valuation method. Here's the basic usage:

from joblib import parallel_config
from pydvl.valuation import (
    Dataset, MinUpdates, ModelUtility, SupervisedScorer, TMCShapleyValuation
)

train, test = Dataset.from_arrays(X, y, ...)
model = SomeModel()  # Implementing the basic scikit-learn interface
scorer =  SupervisedScorer("r2", test, default=0.0, range=(-np.inf, 1.0))
utility = ModelUtility(model, scorer, catch_errors=True, show_warnings=True)
valuation = TMCShapleyValuation(utility, is_done=MinUpdates(1000))
with parallel_config(n_jobs=-1):
    valuation.fit(train)
Directly calling the utility

The following code instantiates a utility object and calls it directly. The underlying logistic regression model will be trained on the indices passed as argument, and evaluated on the test data.

from pydvl.valuation.utility import ModelUtility
from pydvl.valuation.dataset import Dataset
from pydvl.valuation.scorers import SupervisedScorer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import load_iris

train, test = Dataset.from_sklearn(load_iris(), random_state=16)
scorer =  SupervisedScorer("accuracy", test, default=0.0, range=(0.0, 1.0))
u = ModelUtility(LogisticRegression(random_state=16), scorer, catch_errors=True)
u(Sample(None, subset=train.indices))
Enabling the cache

In this example an in-memory cache is used. Note that caching is only useful under certain conditions, and does not really speed typical Monte Carlo approximations. See [the introduction][#getting-started-cache] and the module documentation for more.

(...)  # Imports as above
cache_backend = InMemoryCacheBackend()  # See other backends in the caching module
u = ModelUtility(
        model=LogisticRegression(random_state=16),
        scorer=SupervisedScorer("accuracy", test, default=0.0, range=(0.0, 1.0)),
        cache_backend=cache_backend,
        catch_errors=True
    )
u(Sample(None, subset=train.indices))
u(Sample(None, subset=train.indices))  # The second call does not retrain the model

Data type of the underlying data arrays

In principle, very few to no assumptions are made about the data type. As long as it is contained in a Dataset object, it should work. If your data needs special handling before being fed to the model from the Dataset, you can override the sample_to_data() method. Be sure not to rely on the data being static for this. If you need to transform it before fitting, then override with_dataset().

Caveats with parallel computation

When running in parallel, the utility is copied to each worker, which implies copying the dataset as well, which can obviously be very expensive. In order to alleviate the problem, one can memmap the data to disk. Alas, automatic memmapping by joblib does not work for nested structures like Dataset objects, nor for pytorch tensors. For now, it should be possible to use memmap manually but it hasn't been tested.

If you are working on a cluster, the data will be copied to each worker. In this case, subclassing of Dataset and Utility will be necessary to minimize copying, and the solution will depend on your storage solution. Feel free to open an issue if you need help with this.

ModelUtility

ModelUtility(
    model: ModelT,
    scorer: Scorer,
    *,
    catch_errors: bool = True,
    show_warnings: bool = True,
    cache_backend: CacheBackend | None = None,
    cached_func_options: CachedFuncConfig | None = None,
    clone_before_fit: bool = True,
)

Bases: UtilityBase[SampleT], Generic[SampleT, ModelT]

Convenience wrapper with configurable memoization of the utility.

An instance of ModelUtility holds the tuple of model, and scoring function which determines the value of data points. This is used for the computation of all game-theoretic values like Shapley values and the Least Core.

ModelUtility expects the model to fulfill at least the BaseModel interface, i.e. to have a fit() method

When calling the utility, the model will be cloned if it is a Scikit-Learn model, otherwise a copy is created using copy.deepcopy

Since evaluating the scoring function requires retraining the model and that can be time-consuming, this class wraps it and caches the results of each execution. Caching is available both locally and across nodes, but must always be enabled for your project first, because most stochastic methods do not benefit much from it. See the documentation and the module documentation.

ATTRIBUTE DESCRIPTION
model

The supervised model.

TYPE: ModelT

scorer

A scoring function. If None, the score() method of the model will be used. See score for ways to create and compose scorers, in particular how to set default values and ranges.

TYPE: Scorer

PARAMETER DESCRIPTION
model

Any supervised model. Typical choices can be found in the sci-kit learn documentation.

TYPE: ModelT

scorer

A scoring object. If None, the score() method of the model will be used. See scorers for ways to create and compose scorers, in particular how to set default values and ranges. For convenience, a string can be passed, which will be used to construct a SupervisedScorer.

TYPE: Scorer

catch_errors

set to True to catch the errors when fit() fails. This could happen in several steps of the pipeline, e.g. when too little training data is passed, which happens often during Shapley value calculations. When this happens, the scorer's default value is returned as a score and computation continues.

TYPE: bool DEFAULT: True

show_warnings

Set to False to suppress warnings thrown by fit().

TYPE: bool DEFAULT: True

cache_backend

Optional instance of CacheBackend used to memoize results to avoid duplicate computation. Note however, that for most stochastic methods, cache hits are rare, making the memory expense of caching not worth it (YMMV).

TYPE: CacheBackend | None DEFAULT: None

cached_func_options

Optional configuration object for cached utility evaluation.

TYPE: CachedFuncConfig | None DEFAULT: None

clone_before_fit

If True, the model will be cloned before calling fit().

TYPE: bool DEFAULT: True

Source code in src/pydvl/valuation/utility/modelutility.py
def __init__(
    self,
    model: ModelT,
    scorer: Scorer,
    *,
    catch_errors: bool = True,
    show_warnings: bool = True,
    cache_backend: CacheBackend | None = None,
    cached_func_options: CachedFuncConfig | None = None,
    clone_before_fit: bool = True,
):
    self.clone_before_fit = clone_before_fit
    self.model = self._maybe_clone_model(model, clone_before_fit)
    self.scorer = scorer
    self.catch_errors = catch_errors
    self.show_warnings = show_warnings
    self.cache = cache_backend
    if cached_func_options is None:
        cached_func_options = CachedFuncConfig()
    # TODO: Find a better way to do this.
    if cached_func_options.hash_prefix is None:
        # FIX: This does not handle reusing the same across runs.
        cached_func_options.hash_prefix = str(hash((model, scorer)))
    self.cached_func_options = cached_func_options
    self._initialize_utility_wrapper()

cache_stats property

cache_stats: CacheStats | None

Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.

training_data property

training_data: Dataset | None

Retrieves the training data used by this utility.

This property is read-only. In order to set it, use with_dataset().

__call__

__call__(sample: SampleT | None) -> float
PARAMETER DESCRIPTION
sample

contains a subset of valid indices for the x_train attribute of Dataset.

TYPE: SampleT | None

Source code in src/pydvl/valuation/utility/modelutility.py
def __call__(self, sample: SampleT | None) -> float:
    """
    Args:
        sample: contains a subset of valid indices for the
            `x_train` attribute of [Dataset][pydvl.utils.dataset.Dataset].
    """
    if sample is None or len(sample.subset) == 0:
        return self.scorer.default

    return cast(float, self._utility_wrapper(sample))

__str__

__str__()

Returns a string representation of the utility. Subclasses should override this method to provide a more informative string

Source code in src/pydvl/valuation/utility/base.py
def __str__(self):
    """Returns a string representation of the utility.
    Subclasses should override this method to provide a more informative string
    """
    return f"{self.__class__.__name__}"

sample_to_data

sample_to_data(sample: SampleT) -> tuple

Returns the raw data corresponding to a sample.

Subclasses can override this e.g. to do reshaping of tensors. Be careful not to rely on self.training_data not changing between calls to this method. For manipulations to it, use the with_dataset() method.

PARAMETER DESCRIPTION
sample

contains a subset of valid indices for the x_train attribute of Dataset.

TYPE: SampleT

Returns: Tuple of the training data and labels corresponding to the sample indices.

Source code in src/pydvl/valuation/utility/modelutility.py
def sample_to_data(self, sample: SampleT) -> tuple:
    """Returns the raw data corresponding to a sample.

    Subclasses can override this e.g. to do reshaping of tensors. Be careful not to
    rely on `self.training_data` not changing between calls to this method. For
    manipulations to it, use the `with_dataset()` method.

    Args:
        sample: contains a subset of valid indices for the
            `x_train` attribute of [Dataset][pydvl.utils.dataset.Dataset].
    Returns:
        Tuple of the training data and labels corresponding to the sample indices.
    """
    if self.training_data is None:
        raise ValueError("No training data provided")

    x_train, y_train = self.training_data.data(sample.subset)
    return x_train, y_train

with_dataset

with_dataset(data: Dataset, copy: bool = True) -> Self

Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.

Source code in src/pydvl/valuation/utility/base.py
def with_dataset(self, data: Dataset, copy: bool = True) -> Self:
    """Returns the utility, or a copy of it, with the given dataset.
    Args:
        data: The dataset to use for utility fitting (training data)
        copy: Whether to copy the utility object or not. Valuation methods should
            always make copies to avoid unexpected side effects.
    Returns:
        The utility object.
    """
    utility = cp.copy(self) if copy else self
    utility._training_data = data
    return utility