Utility

This module contains classes to manage and learn utility functions for the computation of values. Please see the documentation on Computing Data Values for more information.

Utility holds information about model, data and scoring function (the latter being what one usually understands under utility in the general definition of Shapley value). It is automatically cached across machines when the cache is configured and it is enabled upon construction.

DataUtilityLearning adds support for learning the scoring function to avoid repeated re-training of the model to compute the score.

This module also contains derived Utility classes for toy games that are used for testing and for demonstration purposes.

References¶

Wang, T., Yang, Y. and Jia, R., 2021. Improving cooperative game theory-based data valuation via data utility learning. arXiv preprint arXiv:2107.06336. ↩

`Utility(model, data, scorer=None, *, default_score=0.0, score_range=(-np.inf, np.inf), catch_errors=True, show_warnings=False, enable_cache=False, cache_options=None, clone_before_fit=True)` ¶

Convenience wrapper with configurable memoization of the scoring function.

An instance of Utility holds the triple of model, dataset and scoring function which determines the value of data points. This is used for the computation of all game-theoretic values like Shapley values and the Least Core.

The Utility expect the model to fulfill the SupervisedModel interface i.e. to have fit(), predict(), and score() methods.

When calling the utility, the model will be cloned if it is a Sci-Kit Learn model, otherwise a copy is created using copy.deepcopy

Since evaluating the scoring function requires retraining the model and that can be time-consuming, this class wraps it and caches the results of each execution. Caching is available both locally and across nodes, but must always be enabled for your project first, see Setting up the cache.

ATTRIBUTE	DESCRIPTION
`model`	The supervised model. TYPE: `SupervisedModel`
`data`	An object containing the split data. TYPE: `Dataset`
`scorer`	A scoring function. If None, the `score()` method of the model will be used. See score for ways to create and compose scorers, in particular how to set default values and ranges. TYPE: `Scorer`

PARAMETER	DESCRIPTION
`model`	Any supervised model. Typical choices can be found in the [sci-kit learn documentation][https://scikit-learn.org/stable/supervised_learning.html]. TYPE: `SupervisedModel`
`data`	Dataset or GroupedDataset instance. TYPE: `Dataset`
`scorer`	A scoring object. If None, the `score()` method of the model will be used. See score for ways to create and compose scorers, in particular how to set default values and ranges. For convenience, a string can be passed, which will be used to construct a Scorer. TYPE: `Optional[Union[str, Scorer]]` DEFAULT: `None`
`default_score`	As a convenience when no `scorer` object is passed (where a default value can be provided), this argument also allows to set the default score for models that have not been fit, e.g. when too little data is passed, or errors arise. TYPE: `float` DEFAULT: `0.0`
`score_range`	As with `default_score`, this is a convenience argument for when no `scorer` argument is provided, to set the numerical range of the score function. Some Monte Carlo methods can use this to estimate the number of samples required for a certain quality of approximation. TYPE: `Tuple[float, float]` DEFAULT: `(-inf, inf)`
`catch_errors`	set to `True` to catch the errors when `fit()` fails. This could happen in several steps of the pipeline, e.g. when too little training data is passed, which happens often during Shapley value calculations. When this happens, the `default_score` is returned as a score and computation continues. TYPE: `bool` DEFAULT: `True`
`show_warnings`	Set to `False` to suppress warnings thrown by `fit()`. TYPE: `bool` DEFAULT: `False`
`enable_cache`	If `True`, use memcached for memoization. TYPE: `bool` DEFAULT: `False`
`cache_options`	Optional configuration object for memcached. TYPE: `Optional[MemcachedConfig]` DEFAULT: `None`
`clone_before_fit`	If `True`, the model will be cloned before calling `fit()`. TYPE: `bool` DEFAULT: `True`

Example

>>> from pydvl.utils import Utility, DataUtilityLearning, Dataset
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> dataset = Dataset.from_sklearn(load_iris(), random_state=16)
>>> u = Utility(LogisticRegression(random_state=16), dataset)
>>> u(dataset.indices)
0.9

Source code in src/pydvl/utils/utility.py

def __init__(
    self,
    model: SupervisedModel,
    data: Dataset,
    scorer: Optional[Union[str, Scorer]] = None,
    *,
    default_score: float = 0.0,
    score_range: Tuple[float, float] = (-np.inf, np.inf),
    catch_errors: bool = True,
    show_warnings: bool = False,
    enable_cache: bool = False,
    cache_options: Optional[MemcachedConfig] = None,
    clone_before_fit: bool = True,
):
    self.model = self._clone_model(model)
    self.data = data
    if isinstance(scorer, str):
        scorer = Scorer(scorer, default=default_score, range=score_range)
    self.scorer = check_scoring(self.model, scorer)
    self.default_score = scorer.default if scorer is not None else default_score
    # TODO: auto-fill from known scorers ?
    self.score_range = scorer.range if scorer is not None else np.array(score_range)
    self.catch_errors = catch_errors
    self.show_warnings = show_warnings
    self.enable_cache = enable_cache
    self.cache_options: MemcachedConfig = cache_options or MemcachedConfig()
    self.clone_before_fit = clone_before_fit
    self._signature = serialize((hash(self.model), hash(data), hash(scorer)))
    self._initialize_utility_wrapper()

`signature` `property` ¶

Signature used for caching model results.

`cache_stats: Optional[CacheStats]` `property` ¶

Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.

`call(indices)` ¶

PARAMETER	DESCRIPTION
`indices`	a subset of valid indices for the `x_train` attribute of Dataset. TYPE: `Iterable[int]`

Source code in src/pydvl/utils/utility.py

def __call__(self, indices: Iterable[int]) -> float:
    """
    Args:
        indices: a subset of valid indices for the
            `x_train` attribute of [Dataset][pydvl.utils.dataset.Dataset].
    """
    utility: float = self._utility_wrapper(frozenset(indices))
    return utility

`DataUtilityLearning(u, training_budget, model)` ¶

Implementation of Data Utility Learning (Wang et al., 2022)¹.

This object wraps a Utility and delegates calls to it, up until a given budget (number of iterations). Every tuple of input and output (a so-called utility sample) is stored. Once the budget is exhausted, DataUtilityLearning fits the given model to the utility samples. Subsequent calls will use the learned model to predict the utility instead of delegating.

PARAMETER	DESCRIPTION
`u`	The Utility to learn. TYPE: `Utility`
`training_budget`	Number of utility samples to collect before fitting the given model. TYPE: `int`
`model`	A supervised regression model TYPE: `SupervisedModel`

Example

>>> from pydvl.utils import Utility, DataUtilityLearning, Dataset
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> dataset = Dataset.from_sklearn(load_iris())
>>> u = Utility(LogisticRegression(), dataset)
>>> wrapped_u = DataUtilityLearning(u, 3, LinearRegression())
... # First 3 calls will be computed normally
>>> for i in range(3):
...     _ = wrapped_u((i,))
>>> wrapped_u((1, 2, 3)) # Subsequent calls will be computed using the fit model for DUL
0.0

Source code in src/pydvl/utils/utility.py

def __init__(
    self, u: Utility, training_budget: int, model: SupervisedModel
) -> None:
    self.utility = u
    self.training_budget = training_budget
    self.model = model
    self._current_iteration = 0
    self._is_model_fit = False
    self._utility_samples: Dict[FrozenSet, Tuple[NDArray[np.bool_], float]] = {}

`data: Dataset` `property` ¶

Returns the wrapped utility's Dataset.

`MinerGameUtility(n_miners, **kwargs)` ¶

Bases: Utility

Toy game utility that is used for testing and demonstration purposes.

Consider a group of n miners, who have discovered large bars of gold.

If two miners can carry one piece of gold, then the payoff of a coalition \(S\) is:

\[{ v(S) = \left\{\begin{array}{lll} \mid S \mid / 2 & \text{, if} & \mid S \mid \text{ is even} \\ ( \mid S \mid - 1)/2 & \text{, if} & \mid S \mid \text{ is odd} \end{array}\right. }\]

If there are more than two miners and there is an even number of miners, then the core consists of the single payoff where each miner gets 1/2.

If there is an odd number of miners, then the core is empty.

Taken from Wikipedia

PARAMETER	DESCRIPTION
`n_miners`	Number of miners that participate in the game. TYPE: `int`

Source code in src/pydvl/utils/utility.py

def __init__(self, n_miners: int, **kwargs):
    if n_miners <= 2:
        raise ValueError(f"n_miners, {n_miners} should be > 2")
    self.n_miners = n_miners

    x = np.arange(n_miners)[..., np.newaxis]
    # The y values don't matter here
    y = np.zeros_like(x)

    self.data = Dataset(x_train=x, y_train=y, x_test=x, y_test=y)

`GlovesGameUtility(left, right, **kwargs)` ¶

Bases: Utility

Toy game utility that is used for testing and demonstration purposes.

In this game, some players have a left glove and others a right glove. Single gloves have a worth of zero while pairs have a worth of 1.

The payoff of a coalition \(S\) is:

\[{ v(S) = \min( \mid S \cap L \mid, \mid S \cap R \mid ) }\]

Where \(L\), respectively \(R\), is the set of players with left gloves, respectively right gloves.

PARAMETER	DESCRIPTION
`left`	Number of players with a left glove. TYPE: `int`
`right`	Number of player with a right glove. TYPE: `int`

Source code in src/pydvl/utils/utility.py

def __init__(self, left: int, right: int, **kwargs):
    self.left = left
    self.right = right

    x = np.empty(left + right)[..., np.newaxis]
    # The y values don't matter here
    y = np.zeros_like(x)

    self.data = Dataset(x_train=x, y_train=y, x_test=x, y_test=y)

Last update: 2023-10-14
Created: 2023-10-14

Utility

References¶

Utility(model, data, scorer=None, *, default_score=0.0, score_range=(-np.inf, np.inf), catch_errors=True, show_warnings=False, enable_cache=False, cache_options=None, clone_before_fit=True) ¶

signature property ¶

cache_stats: Optional[CacheStats] property ¶

__call__(indices) ¶

DataUtilityLearning(u, training_budget, model) ¶

data: Dataset property ¶

MinerGameUtility(n_miners, **kwargs) ¶

GlovesGameUtility(left, right, **kwargs) ¶

`Utility(model, data, scorer=None, *, default_score=0.0, score_range=(-np.inf, np.inf), catch_errors=True, show_warnings=False, enable_cache=False, cache_options=None, clone_before_fit=True)` ¶

`signature` `property` ¶

`cache_stats: Optional[CacheStats]` `property` ¶

`call(indices)` ¶

`DataUtilityLearning(u, training_budget, model)` ¶

`data: Dataset` `property` ¶

`MinerGameUtility(n_miners, **kwargs)` ¶

`GlovesGameUtility(left, right, **kwargs)` ¶