pydvl.valuation.utility.modelutility ¶

This module implements a utility function for supervised models.

ModelUtility holds a model and a scorer. Each call to the utility will fit the model on a subset of the training data and evaluate the scorer on the test data. It is used by all the valuation methods in pydvl.valuation.

This class is geared towards sci-kit-learn models, but can be used with any object that implements the BaseModel protocol, i.e. that has a fit() method.

Errors are hidden by default

During semi-value computations, the utility can be evaluated on subsets that break the fitting process. For instance, a classifier might require at least two classes to fit, but the utility is sometimes evaluated on subsets with only one class. This will raise an error with most classifiers. To avoid this, we set by default catch_errors=True upon instantiation, which will catch the error and return the scorer's default value instead. While we show a warning to signal that something went wrong, this suppression can lead to unexpected results, so it is important to be aware of this setting and to set it to False when testing, or if you are sure that the utility will not be evaluated on problematic subsets.

Examples¶

Standard usage

The utility takes a model and a scorer and is passed to the valuation method. Here's the basic usage:

from joblib import parallel_config
from pydvl.valuation import (
    Dataset, MinUpdates, ModelUtility, SupervisedScorer, TMCShapleyValuation
)

train, test = Dataset.from_arrays(X, y, ...)
model = SomeModel()  # Implementing the basic scikit-learn interface
scorer =  SupervisedScorer("r2", test, default=0.0, range=(-np.inf, 1.0))
utility = ModelUtility(model, scorer, catch_errors=True, show_warnings=True)
valuation = TMCShapleyValuation(utility, is_done=MinUpdates(1000))
with parallel_config(n_jobs=-1):
    valuation.fit(train)

Directly calling the utility

The following code instantiates a utility object and calls it directly. The underlying logistic regression model will be trained on the indices passed as argument, and evaluated on the test data.

from pydvl.valuation.utility import ModelUtility
from pydvl.valuation.dataset import Dataset
from pydvl.valuation.scorers import SupervisedScorer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import load_iris

train, test = Dataset.from_sklearn(load_iris(), random_state=16)
scorer =  SupervisedScorer("accuracy", test, default=0.0, range=(0.0, 1.0))
u = ModelUtility(LogisticRegression(random_state=16), scorer, catch_errors=True)
u(Sample(None, subset=train.indices))

Enabling the cache

In this example an in-memory cache is used. Note that caching is only useful under certain conditions, and does not really speed typical Monte Carlo approximations. See the introduction and the module documentation for more.

(...)  # Imports as above
cache_backend = InMemoryCacheBackend()  # See other backends in the caching module
u = ModelUtility(
        model=LogisticRegression(random_state=16),
        scorer=SupervisedScorer("accuracy", test, default=0.0, range=(0.0, 1.0)),
        cache_backend=cache_backend,
        catch_errors=True
    )
u(Sample(None, subset=train.indices))
u(Sample(None, subset=train.indices))  # The second call does not retrain the model

Data type of the underlying data arrays¶

In principle, very few to no assumptions are made about the data type. As long as it is contained in a Dataset object, it should work. If your data needs special handling before being fed to the model from the Dataset, you can override the sample_to_data() method. Be sure not to rely on the data being static for this. If you need to transform it before fitting, then override with_dataset().

Data copying when running in parallel

When running in parallel, the utility and the dataset are copied to each worker. To avoid this, you can use mmap=True when constructing Dataset. Read Working with large datasets for more information on the subject.

ModelUtility ¶

ModelUtility(
    model: ModelT,
    scorer: Scorer,
    *,
    catch_errors: bool = True,
    show_warnings: bool = True,
    cache_backend: CacheBackend | None = None,
    cached_func_options: CachedFuncConfig | None = None,
    clone_before_fit: bool = True,
)

Bases: UtilityBase[SampleT], Generic[SampleT, ModelT]

Convenience wrapper with configurable memoization of the utility.

An instance of ModelUtility holds the tuple of model, and scoring function which determines the value of data points. This is used for the computation of all game-theoretic values like Shapley values and the Least Core.

ModelUtility expects the model to fulfill at least the BaseModel interface, i.e. to have a fit() method

When calling the utility, the model will be cloned if it is a Scikit-Learn model, otherwise a copy is created using copy.deepcopy

Since evaluating the scoring function requires retraining the model and that can be time-consuming, this class wraps it and caches the results of each execution. Caching is available both locally and across nodes, but must always be enabled for your project first, because most stochastic methods do not benefit much from it. See the documentation and the module documentation.

ATTRIBUTE	DESCRIPTION
`model`	The supervised model. TYPE: `ModelT`
`scorer`	A scoring function. If None, the `score()` method of the model will be used. See score for ways to create and compose scorers, in particular how to set default values and ranges. TYPE: `Scorer`

PARAMETER	DESCRIPTION
`model`	Any supervised model. Typical choices can be found in the sci-kit learn documentation. TYPE: `ModelT`
`scorer`	A scoring object. If None, the `score()` method of the model will be used. See scorers for ways to create and compose scorers, in particular how to set default values and ranges. For convenience, a string can be passed, which will be used to construct a SupervisedScorer. TYPE: `Scorer`
`catch_errors`	set to `True` to catch the errors when `fit()` fails. This could happen in several steps of the pipeline, e.g. when too little training data is passed, which happens often during Shapley value calculations. When this happens, the scorer's default value is returned as a score and computation continues. TYPE: `bool` DEFAULT: `True`
`show_warnings`	Set to `False` to suppress warnings thrown by `fit()`. TYPE: `bool` DEFAULT: `True`
`cache_backend`	Optional instance of [CacheBackend][ TYPE: `CacheBackend \| None` DEFAULT: `None`
`cached_func_options`	Optional configuration object for cached utility TYPE: `CachedFuncConfig \| None` DEFAULT: `None`
`clone_before_fit`	If `True`, the model will be cloned before calling `fit()`. TYPE: `bool` DEFAULT: `True`

Source code in src/pydvl/valuation/utility/modelutility.py

def __init__(
    self,
    model: ModelT,
    scorer: Scorer,
    *,
    catch_errors: bool = True,
    show_warnings: bool = True,
    cache_backend: CacheBackend | None = None,
    cached_func_options: CachedFuncConfig | None = None,
    clone_before_fit: bool = True,
):
    self.clone_before_fit = clone_before_fit
    self.model = self._maybe_clone_model(model, clone_before_fit)
    self.scorer = scorer
    self.catch_errors = catch_errors
    self.show_warnings = show_warnings
    self.cache = cache_backend
    if cached_func_options is None:
        cached_func_options = CachedFuncConfig()
    # TODO: Find a better way to do this.
    if cached_func_options.hash_prefix is None:
        # FIX: This does not handle reusing the same across runs.
        cached_func_options.hash_prefix = str(hash((model, scorer)))
    self.cached_func_options = cached_func_options
    self._initialize_utility_wrapper()

cache_stats `property` ¶

cache_stats: CacheStats | None

Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.

training_data `property` ¶

training_data: Dataset | None

Retrieves the training data used by this utility.

This property is read-only. In order to set it, use with_dataset().

call ¶

__call__(sample: SampleT | None) -> float

PARAMETER	DESCRIPTION
`sample`	contains a subset of valid indices for the `x_train` attribute of Dataset. TYPE: `SampleT \| None`

Source code in src/pydvl/valuation/utility/modelutility.py

def __call__(self, sample: SampleT | None) -> float:
    """
    Args:
        sample: contains a subset of valid indices for the
            `x_train` attribute of [Dataset][pydvl.utils.dataset.Dataset].
    """
    if sample is None or len(sample.subset) == 0:
        return self.scorer.default

    return cast(float, self._utility_wrapper(sample))

str ¶

__str__()

Returns a string representation of the utility. Subclasses should override this method to provide a more informative string

Source code in src/pydvl/valuation/utility/base.py

def __str__(self):
    """Returns a string representation of the utility.
    Subclasses should override this method to provide a more informative string
    """
    return f"{self.__class__.__name__}"

sample_to_data ¶

sample_to_data(sample: SampleT) -> tuple

Returns the raw data corresponding to a sample.

Subclasses can override this e.g. to do reshaping of tensors. Be careful not to rely on self.training_data not changing between calls to this method. For manipulations to it, use the with_dataset() method.

PARAMETER	DESCRIPTION
`sample`	contains a subset of valid indices for the `x_train` attribute of Dataset. TYPE: `SampleT`

Returns: Tuple of the training data and labels corresponding to the sample indices.

Source code in src/pydvl/valuation/utility/modelutility.py

def sample_to_data(self, sample: SampleT) -> tuple:
    """Returns the raw data corresponding to a sample.

    Subclasses can override this e.g. to do reshaping of tensors. Be careful not to
    rely on `self.training_data` not changing between calls to this method. For
    manipulations to it, use the `with_dataset()` method.

    Args:
        sample: contains a subset of valid indices for the
            `x_train` attribute of [Dataset][pydvl.utils.dataset.Dataset].
    Returns:
        Tuple of the training data and labels corresponding to the sample indices.
    """
    if self.training_data is None:
        raise ValueError("No training data provided")

    x_train, y_train = self.training_data.data(sample.subset)
    return x_train, y_train

with_dataset ¶

with_dataset(data: Dataset, copy: bool = True) -> Self

Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.

Source code in src/pydvl/valuation/utility/base.py

def with_dataset(self, data: Dataset, copy: bool = True) -> Self:
    """Returns the utility, or a copy of it, with the given dataset.
    Args:
        data: The dataset to use for utility fitting (training data)
        copy: Whether to copy the utility object or not. Valuation methods should
            always make copies to avoid unexpected side effects.
    Returns:
        The utility object.
    """
    utility = cp.copy(self) if copy else self
    utility._training_data = data
    return utility

pydvl.valuation.utility.modelutility ¶

Examples¶

Data type of the underlying data arrays¶

ModelUtility ¶

cache_stats property ¶

training_data property ¶

__call__ ¶

__str__ ¶

sample_to_data ¶

with_dataset ¶

cache_stats `property` ¶

training_data `property` ¶

call ¶

str ¶