Skip to content

pydvl.valuation.utility

This module contains classes to manage and learn utility functions for the computation of values.

ModelUtility holds information about model, data and scoring function (the latter being what one usually understands under utility in the general definition of Shapley value). Model-based evaluation methods define the utility as a retraining of the model on a subset of the data, which is then scored. Please see the documentation on Computing Data Values for more information.

Utilities can be automatically cached across machines when the cache is so configured and enabled upon construction.

DataUtilityLearning adds support for learning the scoring function to avoid repeated re-training of the model to compute the score. Several methods exist to learn the utility function.

ModelUtility

ModelUtility(
    model: ModelT,
    scorer: Scorer,
    *,
    catch_errors: bool = True,
    show_warnings: bool = False,
    cache_backend: CacheBackend | None = None,
    cached_func_options: CachedFuncConfig | None = None,
    clone_before_fit: bool = True,
)

Bases: UtilityBase[SampleT], Generic[SampleT, ModelT]

Convenience wrapper with configurable memoization of the utility.

An instance of ModelUtility holds the tuple of model, and scoring function which determines the value of data points. This is used for the computation of all game-theoretic values like Shapley values and the Least Core.

ModelUtility expects the model to fulfill at least the BaseModel interface, i.e. to have a fit() method

When calling the utility, the model will be cloned if it is a Scikit-Learn model, otherwise a copy is created using copy.deepcopy

Since evaluating the scoring function requires retraining the model and that can be time-consuming, this class wraps it and caches the results of each execution. Caching is available both locally and across nodes, but must always be enabled for your project first, because most stochastic methods do not benefit much from it. See the documentation and the module documentation.

ATTRIBUTE DESCRIPTION
model

The supervised model.

TYPE: ModelT

scorer

A scoring function. If None, the score() method of the model will be used. See score for ways to create and compose scorers, in particular how to set default values and ranges.

TYPE: Scorer

PARAMETER DESCRIPTION
model

Any supervised model. Typical choices can be found in the sci-kit learn documentation.

TYPE: ModelT

scorer

A scoring object. If None, the score() method of the model will be used. See scorers for ways to create and compose scorers, in particular how to set default values and ranges. For convenience, a string can be passed, which will be used to construct a SupervisedScorer.

TYPE: Scorer

catch_errors

set to True to catch the errors when fit() fails. This could happen in several steps of the pipeline, e.g. when too little training data is passed, which happens often during Shapley value calculations. When this happens, the scorer's default value is returned as a score and computation continues.

TYPE: bool DEFAULT: True

show_warnings

Set to False to suppress warnings thrown by fit().

TYPE: bool DEFAULT: False

cache_backend

Optional instance of CacheBackend used to memoize results to avoid duplicate computation. Note however, that for most stochastic methods, cache hits are rare, making the memory expense of caching not worth it (YMMV).

TYPE: CacheBackend | None DEFAULT: None

cached_func_options

Optional configuration object for cached utility evaluation.

TYPE: CachedFuncConfig | None DEFAULT: None

clone_before_fit

If True, the model will be cloned before calling fit().

TYPE: bool DEFAULT: True

Example
>>> from pydvl.valuation.utility import ModelUtility, DataUtilityLearning
>>> from pydvl.valuation.dataset import Dataset
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> train, test = Dataset.from_sklearn(load_iris(), random_state=16)
>>> u = ModelUtility(LogisticRegression(random_state=16), Scorer("accuracy"))
>>> u(Sample(subset=dataset.indices))
0.9

With caching enabled:

>>> from pydvl.valuation.utility import ModelUtility, DataUtilityLearning
>>> from pydvl.valuation.dataset import Dataset
>>> from pydvl.utils.caching.memory import InMemoryCacheBackend
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> train, test = Dataset.from_sklearn(load_iris(), random_state=16)
>>> cache_backend = InMemoryCacheBackend()
>>> u = ModelUtility(LogisticRegression(random_state=16), Scorer("accuracy"), cache_backend=cache_backend)
>>> u(Sample(subset=train.indices))
0.9
Source code in src/pydvl/valuation/utility/modelutility.py
def __init__(
    self,
    model: ModelT,
    scorer: Scorer,
    *,
    catch_errors: bool = True,
    show_warnings: bool = False,
    cache_backend: CacheBackend | None = None,
    cached_func_options: CachedFuncConfig | None = None,
    clone_before_fit: bool = True,
):
    self.clone_before_fit = clone_before_fit
    self.model = self._maybe_clone_model(model, clone_before_fit)
    self.scorer = scorer
    self.catch_errors = catch_errors
    self.show_warnings = show_warnings
    self.cache = cache_backend
    if cached_func_options is None:
        cached_func_options = CachedFuncConfig()
    # TODO: Find a better way to do this.
    if cached_func_options.hash_prefix is None:
        # FIX: This does not handle reusing the same across runs.
        cached_func_options.hash_prefix = str(hash((model, scorer)))
    self.cached_func_options = cached_func_options
    self._initialize_utility_wrapper()

training_data property

training_data: Dataset | None

Retrieves the training data used by this utility.

This property is read-only. In order to set it, use with_dataset().

cache_stats property

cache_stats: CacheStats | None

Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.

with_dataset

with_dataset(data: Dataset, copy: bool = True) -> Self

Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.

Source code in src/pydvl/valuation/utility/base.py
def with_dataset(self, data: Dataset, copy: bool = True) -> Self:
    """Returns the utility, or a copy of it, with the given dataset.
    Args:
        data: The dataset to use for utility fitting (training data)
        copy: Whether to copy the utility object or not. Valuation methods should
            always make copies to avoid unexpected side effects.
    Returns:
        The utility object.
    """
    utility = cp.copy(self) if copy else self
    utility._training_data = data
    return utility

__call__

__call__(sample: SampleT | None) -> float
PARAMETER DESCRIPTION
sample

contains a subset of valid indices for the x_train attribute of Dataset.

TYPE: SampleT | None

Source code in src/pydvl/valuation/utility/modelutility.py
def __call__(self, sample: SampleT | None) -> float:
    """
    Args:
        sample: contains a subset of valid indices for the
            `x_train` attribute of [Dataset][pydvl.utils.dataset.Dataset].
    """
    if sample is None or len(sample.subset) == 0:
        return self.scorer.default

    return cast(float, self._utility_wrapper(sample))

ClasswiseModelUtility

ClasswiseModelUtility(
    model: SupervisedModel,
    scorer: ClasswiseSupervisedScorer,
    *,
    catch_errors: bool = True,
    show_warnings: bool = False,
    cache_backend: CacheBackend | None = None,
    cached_func_options: CachedFuncConfig | None = None,
    clone_before_fit: bool = True,
)

Bases: ModelUtility[ClasswiseSample, SupervisedModel]

ModelUtility class that is specific to classwise shapley valuation.

It expects a classwise scorer and a classification task.

PARAMETER DESCRIPTION
model

Any supervised model. Typical choices can be found in the sci-kit learn documentation.

TYPE: SupervisedModel

scorer

A class-wise scoring object.

TYPE: ClasswiseSupervisedScorer

catch_errors

set to True to catch the errors when fit() fails. This could happen in several steps of the pipeline, e.g. when too little training data is passed, which happens often during Shapley value calculations. When this happens, the scorer's default value is returned as a score and computation continues.

TYPE: bool DEFAULT: True

show_warnings

Set to False to suppress warnings thrown by fit().

TYPE: bool DEFAULT: False

cache_backend

Optional instance of CacheBackend used to wrap the _utility method of the Utility instance. By default, this is set to None and that means that the utility evaluations will not be cached.

TYPE: CacheBackend | None DEFAULT: None

cached_func_options

Optional configuration object for cached utility evaluation.

TYPE: CachedFuncConfig | None DEFAULT: None

clone_before_fit

If True, the model will be cloned before calling fit().

TYPE: bool DEFAULT: True

Source code in src/pydvl/valuation/utility/classwise.py
def __init__(
    self,
    model: SupervisedModel,
    scorer: ClasswiseSupervisedScorer,
    *,
    catch_errors: bool = True,
    show_warnings: bool = False,
    cache_backend: CacheBackend | None = None,
    cached_func_options: CachedFuncConfig | None = None,
    clone_before_fit: bool = True,
):
    super().__init__(
        model,
        scorer,
        catch_errors=catch_errors,
        show_warnings=show_warnings,
        cache_backend=cache_backend,
        cached_func_options=cached_func_options,
        clone_before_fit=clone_before_fit,
    )
    if not isinstance(self.scorer, ClasswiseSupervisedScorer):
        raise ValueError("Scorer must be an instance of ClasswiseSupervisedScorer")
    self.scorer: ClasswiseSupervisedScorer

training_data property

training_data: Dataset | None

Retrieves the training data used by this utility.

This property is read-only. In order to set it, use with_dataset().

cache_stats property

cache_stats: CacheStats | None

Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.

with_dataset

with_dataset(data: Dataset, copy: bool = True) -> Self

Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.

Source code in src/pydvl/valuation/utility/base.py
def with_dataset(self, data: Dataset, copy: bool = True) -> Self:
    """Returns the utility, or a copy of it, with the given dataset.
    Args:
        data: The dataset to use for utility fitting (training data)
        copy: Whether to copy the utility object or not. Valuation methods should
            always make copies to avoid unexpected side effects.
    Returns:
        The utility object.
    """
    utility = cp.copy(self) if copy else self
    utility._training_data = data
    return utility

__call__

__call__(sample: SampleT | None) -> float
PARAMETER DESCRIPTION
sample

contains a subset of valid indices for the x_train attribute of Dataset.

TYPE: SampleT | None

Source code in src/pydvl/valuation/utility/modelutility.py
def __call__(self, sample: SampleT | None) -> float:
    """
    Args:
        sample: contains a subset of valid indices for the
            `x_train` attribute of [Dataset][pydvl.utils.dataset.Dataset].
    """
    if sample is None or len(sample.subset) == 0:
        return self.scorer.default

    return cast(float, self._utility_wrapper(sample))

KNNClassifierUtility

KNNClassifierUtility(
    model: KNeighborsClassifier,
    test_data: Dataset,
    *,
    catch_errors: bool = True,
    show_warnings: bool = False,
    cache_backend: CacheBackend | None = None,
    cached_func_options: CachedFuncConfig | None = None,
    clone_before_fit: bool = True,
)

Bases: ModelUtility[Sample, KNeighborsClassifier]

Utility object for KNN Classifiers.

The utility function is the model's predicted probability for the true class.

Uses of this utility

Although this class can be used in conjunction with any semi-value method and sampler, when computing Shapley values, it is recommended to use the dedicated class KNNShapleyValuation, because it implements a more efficient algorithm for computing Shapley values which runs in O(n log n) time for each test point.

PARAMETER DESCRIPTION
model

A KNN classifier model.

TYPE: KNeighborsClassifier

test_data

The test data to evaluate the model on.

TYPE: Dataset

catch_errors

set to True to catch the errors when fit() fails. This could happen in several steps of the pipeline, e.g. when too little training data is passed, which happens often during Shapley value calculations. When this happens, the scorer's default value is returned as a score and computation continues.

TYPE: bool DEFAULT: True

show_warnings

Set to False to suppress warnings thrown by fit().

TYPE: bool DEFAULT: False

cache_backend

Optional instance of [CacheBackend][ pydvl.utils.caching.base.CacheBackend] used to wrap the _utility method of the Utility instance. By default, this is set to None and that means that the utility evaluations will not be cached.

TYPE: CacheBackend | None DEFAULT: None

cached_func_options

Optional configuration object for cached utility evaluation.

TYPE: CachedFuncConfig | None DEFAULT: None

clone_before_fit

If True, the model will be cloned before calling fit() in utility evaluations.

TYPE: bool DEFAULT: True

Source code in src/pydvl/valuation/utility/knn.py
def __init__(
    self,
    model: KNeighborsClassifier,
    test_data: Dataset,
    *,
    catch_errors: bool = True,
    show_warnings: bool = False,
    cache_backend: CacheBackend | None = None,
    cached_func_options: CachedFuncConfig | None = None,
    clone_before_fit: bool = True,
):
    self.test_data = test_data
    self.sorted_neighbors: NDArray[np.int_] | None = None
    dummy_scorer = _DummyScorer()

    super().__init__(
        model=model,
        scorer=dummy_scorer,  # not applicable
        catch_errors=catch_errors,
        show_warnings=show_warnings,
        cache_backend=cache_backend,
        cached_func_options=cached_func_options,
        clone_before_fit=clone_before_fit,
    )

training_data property

training_data: Dataset | None

Retrieves the training data used by this utility.

This property is read-only. In order to set it, use with_dataset().

cache_stats property

cache_stats: CacheStats | None

Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.

__call__

__call__(sample: SampleT | None) -> float
PARAMETER DESCRIPTION
sample

contains a subset of valid indices for the x_train attribute of Dataset.

TYPE: SampleT | None

Source code in src/pydvl/valuation/utility/modelutility.py
def __call__(self, sample: SampleT | None) -> float:
    """
    Args:
        sample: contains a subset of valid indices for the
            `x_train` attribute of [Dataset][pydvl.utils.dataset.Dataset].
    """
    if sample is None or len(sample.subset) == 0:
        return self.scorer.default

    return cast(float, self._utility_wrapper(sample))

with_dataset

with_dataset(data: Dataset, copy: bool = True) -> Self

Return the utility, or a copy of it, with the given dataset and the model fitted on it.

PARAMETER DESCRIPTION
data

The dataset to use.

TYPE: Dataset

copy

Whether to copy the utility object or not. Additionally, if True then the model is also cloned. If False, the model is only cloned if clone_before_fit is True.

TYPE: bool DEFAULT: True

Returns: The utility object.

Source code in src/pydvl/valuation/utility/knn.py
def with_dataset(self, data: Dataset, copy: bool = True) -> Self:
    """Return the utility, or a copy of it, with the given dataset and the model
    fitted on it.

    Args:
        data: The dataset to use.
        copy: Whether to copy the utility object or not. Additionally, if `True`
            then the model is also cloned. If `False`, the model is only cloned if
            `clone_before_fit` is `True`.
    Returns:
        The utility object.
    """
    utility: Self = super().with_dataset(data, copy)
    if copy or self.clone_before_fit:
        utility.model = self._maybe_clone_model(self.model, do_clone=True)
    utility.model.fit(*data.data())
    return utility

UtilityModel

Bases: ABC

Interface for utility models.

A utility model predicts the value of a utility function given a sample. The model is trained on a collection of samples and their respective utility values. These tuples are called Utility Samples.

Utility models:

  • are fitted on dictionaries of Sample -> utility value
  • predict: Collection[samples] -> NDArray[utility values]

IndicatorUtilityModel

IndicatorUtilityModel(predictor: SupervisedModel, n_data: int)

Bases: UtilityModel

A simple wrapper for arbitrary predictors.

Uses 1-hot encoding of the indices as input for the model, as done in Wang et al., (2022)1.

Source code in src/pydvl/valuation/utility/learning.py
def __init__(self, predictor: SupervisedModel, n_data: int):
    self.n_data = n_data
    self.predictor = predictor

DataUtilityLearning

DataUtilityLearning(
    utility: UtilityBase,
    training_budget: int,
    model: UtilityModel,
    show_warnings: bool = True,
)

Bases: UtilityBase[SampleT]

This object wraps any class derived from UtilityBase and delegates calls to it, up until a given budget (number of iterations). Every tuple of input and output (a so-called utility sample) is stored. Once the budget is exhausted, DataUtilityLearning fits the given model to the utility samples. Subsequent calls will use the learned model to predict the utility instead of delegating.

PARAMETER DESCRIPTION
utility

The utility to learn. Typically, this will be a ModelUtility object encapsulating a machine learning model which requires fitting on each evaluation of the utility.

TYPE: UtilityBase

training_budget

Number of utility samples to collect before fitting the given model.

TYPE: int

model

A supervised regression model

TYPE: UtilityModel

Example
from pydvl.valuation import Dataset, DataUtilityLearning, ModelUtility,             Sample, SupervisedScorer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import load_iris

train, test = Dataset.from_sklearn(load_iris())
scorer = SupervisedScorer("accuracy", test, 0, (0,1))
utility = ModelUtility(LinearRegression(), scorer)
utility_model = IndicatorUtilityModel(LinearRegression(), len(train))
dul = DataUtilityLearning(utility, 3, utility_model)
# First 3 calls will be computed normally
for i in range(3):
    _ = dul(Sample(0, np.array([])))
# Subsequent calls will be computed using the fitted utility_model
dul(Sample(0, np.array([1, 2, 3])))
Source code in src/pydvl/valuation/utility/learning.py
def __init__(
    self,
    utility: UtilityBase,
    training_budget: int,
    model: UtilityModel,
    show_warnings: bool = True,
) -> None:
    self.utility = utility
    self.training_budget = training_budget
    self.model = model
    self.n_predictions = 0
    self.show_warnings = show_warnings
    self._is_fitted = False
    self._utility_samples: dict[Sample, float] = {}

training_data property

training_data: Dataset | None

Retrieves the training data used by this utility.

This property is read-only. In order to set it, use with_dataset().

with_dataset

with_dataset(data: Dataset, copy: bool = True) -> Self

Returns the utility, or a copy of it, with the given dataset. Args: data: The dataset to use for utility fitting (training data) copy: Whether to copy the utility object or not. Valuation methods should always make copies to avoid unexpected side effects. Returns: The utility object.

Source code in src/pydvl/valuation/utility/base.py
def with_dataset(self, data: Dataset, copy: bool = True) -> Self:
    """Returns the utility, or a copy of it, with the given dataset.
    Args:
        data: The dataset to use for utility fitting (training data)
        copy: Whether to copy the utility object or not. Valuation methods should
            always make copies to avoid unexpected side effects.
    Returns:
        The utility object.
    """
    utility = cp.copy(self) if copy else self
    utility._training_data = data
    return utility