Skip to content

pydvl.valuation.utility.modelutility

ModelUtility

ModelUtility(
    model: ModelT,
    scorer: Scorer,
    *,
    catch_errors: bool = True,
    show_warnings: bool = False,
    cache_backend: CacheBackend | None = None,
    cached_func_options: CachedFuncConfig | None = None,
    clone_before_fit: bool = True,
)

Bases: UtilityBase[SampleT], Generic[SampleT, ModelT]

Convenience wrapper with configurable memoization of the utility.

An instance of ModelUtility holds the tuple of model, and scoring function which determines the value of data points. This is used for the computation of all game-theoretic values like Shapley values and the Least Core.

ModelUtility expects the model to fulfill at least the BaseModel interface, i.e. to have a fit() method

When calling the utility, the model will be cloned if it is a Scikit-Learn model, otherwise a copy is created using copy.deepcopy

Since evaluating the scoring function requires retraining the model and that can be time-consuming, this class wraps it and caches the results of each execution. Caching is available both locally and across nodes, but must always be enabled for your project first, because most stochastic methods do not benefit much from it. See the documentation and the module documentation.

ATTRIBUTE DESCRIPTION
model

The supervised model.

TYPE: ModelT

scorer

A scoring function. If None, the score() method of the model will be used. See score for ways to create and compose scorers, in particular how to set default values and ranges.

TYPE: Scorer

PARAMETER DESCRIPTION
model

Any supervised model. Typical choices can be found in the sci-kit learn documentation.

TYPE: ModelT

scorer

A scoring object. If None, the score() method of the model will be used. See scorers for ways to create and compose scorers, in particular how to set default values and ranges. For convenience, a string can be passed, which will be used to construct a SupervisedScorer.

TYPE: Scorer

catch_errors

set to True to catch the errors when fit() fails. This could happen in several steps of the pipeline, e.g. when too little training data is passed, which happens often during Shapley value calculations. When this happens, the scorer's default value is returned as a score and computation continues.

TYPE: bool DEFAULT: True

show_warnings

Set to False to suppress warnings thrown by fit().

TYPE: bool DEFAULT: False

cache_backend

Optional instance of CacheBackend used to memoize results to avoid duplicate computation. Note however, that for most stochastic methods, cache hits are rare, making the memory expense of caching not worth it (YMMV).

TYPE: CacheBackend | None DEFAULT: None

cached_func_options

Optional configuration object for cached utility evaluation.

TYPE: CachedFuncConfig | None DEFAULT: None

clone_before_fit

If True, the model will be cloned before calling fit().

TYPE: bool DEFAULT: True

Example
>>> from pydvl.valuation.utility import ModelUtility, DataUtilityLearning
>>> from pydvl.valuation.dataset import Dataset
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> train, test = Dataset.from_sklearn(load_iris(), random_state=16)
>>> u = ModelUtility(LogisticRegression(random_state=16), Scorer("accuracy"))
>>> u(Sample(subset=dataset.indices))
0.9

With caching enabled:

>>> from pydvl.valuation.utility import ModelUtility, DataUtilityLearning
>>> from pydvl.valuation.dataset import Dataset
>>> from pydvl.utils.caching.memory import InMemoryCacheBackend
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> train, test = Dataset.from_sklearn(load_iris(), random_state=16)
>>> cache_backend = InMemoryCacheBackend()
>>> u = ModelUtility(LogisticRegression(random_state=16), Scorer("accuracy"), cache_backend=cache_backend)
>>> u(Sample(subset=train.indices))
0.9
Source code in src/pydvl/valuation/utility/modelutility.py
def __init__(
    self,
    model: ModelT,
    scorer: Scorer,
    *,
    catch_errors: bool = True,
    show_warnings: bool = False,
    cache_backend: CacheBackend | None = None,
    cached_func_options: CachedFuncConfig | None = None,
    clone_before_fit: bool = True,
):
    self.clone_before_fit = clone_before_fit
    self.model = self._maybe_clone_model(model, clone_before_fit)
    self.scorer = scorer
    self.catch_errors = catch_errors
    self.show_warnings = show_warnings
    self.cache = cache_backend
    if cached_func_options is None:
        cached_func_options = CachedFuncConfig()
    # TODO: Find a better way to do this.
    if cached_func_options.hash_prefix is None:
        # FIX: This does not handle reusing the same across runs.
        cached_func_options.hash_prefix = str(hash((model, scorer)))
    self.cached_func_options = cached_func_options
    self._initialize_utility_wrapper()

training_data property

training_data: Dataset | None

Retrieves the training data used by this utility.

This property is read-only. In order to set it, use with_dataset().

cache_stats property

cache_stats: CacheStats | None

Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.

with_dataset

with_dataset(data: Dataset, copy: bool = True) -> Self

Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.

Source code in src/pydvl/valuation/utility/base.py
def with_dataset(self, data: Dataset, copy: bool = True) -> Self:
    """Returns the utility, or a copy of it, with the given dataset.
    Args:
        data: The dataset to use for utility fitting (training data)
        copy: Whether to copy the utility object or not. Valuation methods should
            always make copies to avoid unexpected side effects.
    Returns:
        The utility object.
    """
    utility = cp.copy(self) if copy else self
    utility._training_data = data
    return utility

__call__

__call__(sample: SampleT | None) -> float
PARAMETER DESCRIPTION
sample

contains a subset of valid indices for the x_train attribute of Dataset.

TYPE: SampleT | None

Source code in src/pydvl/valuation/utility/modelutility.py
def __call__(self, sample: SampleT | None) -> float:
    """
    Args:
        sample: contains a subset of valid indices for the
            `x_train` attribute of [Dataset][pydvl.utils.dataset.Dataset].
    """
    if sample is None or len(sample.subset) == 0:
        return self.scorer.default

    return cast(float, self._utility_wrapper(sample))