Skip to content

pydvl.utils

squashed_r2 module-attribute

squashed_r2 = compose_score(Scorer('r2'), _sigmoid, (0, 1), 'squashed r2')

A scorer that squashes the R² score into the range [0, 1] using a sigmoid.

squashed_variance module-attribute

squashed_variance = compose_score(
    Scorer("explained_variance"),
    _sigmoid,
    (0, 1),
    "squashed explained variance",
)

A scorer that squashes the explained variance score into the range [0, 1] using a sigmoid.

CacheStats dataclass

CacheStats(
    sets: int = 0,
    misses: int = 0,
    hits: int = 0,
    timeouts: int = 0,
    errors: int = 0,
    reconnects: int = 0,
)

Class used to store statistics gathered by cached functions.

ATTRIBUTE DESCRIPTION
sets

Number of times a value was set in the cache.

TYPE: int

misses

Number of times a value was not found in the cache.

TYPE: int

hits

Number of times a value was found in the cache.

TYPE: int

timeouts

Number of times a timeout occurred.

TYPE: int

errors

Number of times an error occurred.

TYPE: int

reconnects

Number of times the client reconnected to the server.

TYPE: int

CacheBackend

CacheBackend()

Bases: ABC

Abstract base class for cache backends.

Defines interface for cache access including wrapping callables, getting/setting results, clearing cache, and combining cache keys.

ATTRIBUTE DESCRIPTION
stats

Cache statistics tracker.

Source code in src/pydvl/utils/caching/base.py
def __init__(self) -> None:
    self.stats = CacheStats()

wrap

wrap(
    func: Callable, *, config: Optional[CachedFuncConfig] = None
) -> CachedFunc

Wraps a function to cache its results.

PARAMETER DESCRIPTION
func

The function to wrap.

TYPE: Callable

config

Optional caching options for the wrapped function.

TYPE: Optional[CachedFuncConfig] DEFAULT: None

RETURNS DESCRIPTION
CachedFunc

The wrapped cached function.

Source code in src/pydvl/utils/caching/base.py
def wrap(
    self,
    func: Callable,
    *,
    config: Optional[CachedFuncConfig] = None,
) -> "CachedFunc":
    """Wraps a function to cache its results.

    Args:
        func: The function to wrap.
        config: Optional caching options for the wrapped function.

    Returns:
        The wrapped cached function.
    """
    return CachedFunc(
        func,
        cache_backend=self,
        config=config,
    )

get abstractmethod

get(key: str) -> Optional[CacheResult]

Abstract method to retrieve a cached result.

Implemented by subclasses.

PARAMETER DESCRIPTION
key

The cache key.

TYPE: str

RETURNS DESCRIPTION
Optional[CacheResult]

The cached result or None if not found.

Source code in src/pydvl/utils/caching/base.py
@abstractmethod
def get(self, key: str) -> Optional[CacheResult]:
    """Abstract method to retrieve a cached result.

    Implemented by subclasses.

    Args:
        key: The cache key.

    Returns:
        The cached result or None if not found.
    """
    pass

set abstractmethod

set(key: str, value: CacheResult) -> None

Abstract method to set a cached result.

Implemented by subclasses.

PARAMETER DESCRIPTION
key

The cache key.

TYPE: str

value

The result to cache.

TYPE: CacheResult

Source code in src/pydvl/utils/caching/base.py
@abstractmethod
def set(self, key: str, value: CacheResult) -> None:
    """Abstract method to set a cached result.

    Implemented by subclasses.

    Args:
        key: The cache key.
        value: The result to cache.
    """
    pass

clear abstractmethod

clear() -> None

Abstract method to clear the entire cache.

Source code in src/pydvl/utils/caching/base.py
@abstractmethod
def clear(self) -> None:
    """Abstract method to clear the entire cache."""
    pass

combine_hashes abstractmethod

combine_hashes(*args: str) -> str

Abstract method to combine cache keys.

Source code in src/pydvl/utils/caching/base.py
@abstractmethod
def combine_hashes(self, *args: str) -> str:
    """Abstract method to combine cache keys."""
    pass

CachedFunc

CachedFunc(
    func: Callable[..., float],
    *,
    cache_backend: CacheBackend,
    config: Optional[CachedFuncConfig] = None,
)

Caches callable function results with a provided cache backend.

Wraps a callable function to cache its results using a provided an instance of a subclass of CacheBackend.

This class is heavily inspired from that of joblib.memory.MemorizedFunc.

This class caches calls to the wrapped callable by generating a hash key based on the wrapped callable's code, the arguments passed to it and the optional hash_prefix.

Warning

This class only works with hashable arguments to the wrapped callable.

PARAMETER DESCRIPTION
func

Callable to wrap.

TYPE: Callable[..., float]

cache_backend

Instance of CacheBackendBase that handles setting and getting values.

TYPE: CacheBackend

config

Configuration for wrapped function.

TYPE: Optional[CachedFuncConfig] DEFAULT: None

Source code in src/pydvl/utils/caching/base.py
def __init__(
    self,
    func: Callable[..., float],
    *,
    cache_backend: CacheBackend,
    config: Optional[CachedFuncConfig] = None,
) -> None:
    self.func = func
    self.cache_backend = cache_backend
    if config is None:
        config = CachedFuncConfig()
    self.config = config

    self.__doc__ = f"A wrapper around {func.__name__}() with caching enabled.\n" + (
        CachedFunc.__doc__ or ""
    )
    self.__name__ = f"cached_{func.__name__}"
    path = list(reversed(func.__qualname__.split(".")))
    patched = [f"cached_{path[0]}"] + path[1:]
    self.__qualname__ = ".".join(reversed(patched))

stats property

stats: CacheStats

Cache backend statistics.

__call__

__call__(*args, **kwargs) -> float

Call the wrapped cached function.

Executes the wrapped function, caching and returning the result.

Source code in src/pydvl/utils/caching/base.py
def __call__(self, *args, **kwargs) -> float:
    """Call the wrapped cached function.

    Executes the wrapped function, caching and returning the result.
    """
    return self._cached_call(args, kwargs)

DiskCacheBackend

DiskCacheBackend(cache_dir: Optional[Union[PathLike, str]] = None)

Bases: CacheBackend

Disk cache backend that stores results in files.

Implements the CacheBackend interface for a disk-based cache. Stores cache entries as pickled files on disk, keyed by cache key. This allows sharing evaluations across processes in a single node/computer.

PARAMETER DESCRIPTION
cache_dir

Base directory for cache storage.

TYPE: Optional[Union[PathLike, str]] DEFAULT: None

ATTRIBUTE DESCRIPTION
cache_dir

Base directory for cache storage.

Example

Basic usage:

>>> from pydvl.utils.caching.disk import DiskCacheBackend
>>> cache_backend = DiskCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> cache_backend.set("key", value)
>>> cache_backend.get("key")
42

Callable wrapping:

>>> from pydvl.utils.caching.disk import DiskCacheBackend
>>> cache_backend = DiskCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> def foo(x: int):
...     return x + 1
...
>>> wrapped_foo = cache_backend.wrap(foo)
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
0
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
1

PARAMETER DESCRIPTION
cache_dir

Base directory for cache storage. If not provided, this defaults to a newly created temporary directory.

TYPE: Optional[Union[PathLike, str]] DEFAULT: None

Source code in src/pydvl/utils/caching/disk.py
def __init__(
    self,
    cache_dir: Optional[Union[os.PathLike, str]] = None,
) -> None:
    """Initialize the disk cache backend.

    Args:
        cache_dir: Base directory for cache storage.
            If not provided, this defaults to a newly created
            temporary directory.
    """
    super().__init__()
    if cache_dir is None:
        cache_dir = tempfile.mkdtemp(prefix="pydvl")
    self.cache_dir = Path(cache_dir)
    self.cache_dir.mkdir(exist_ok=True, parents=True)

wrap

wrap(
    func: Callable, *, config: Optional[CachedFuncConfig] = None
) -> CachedFunc

Wraps a function to cache its results.

PARAMETER DESCRIPTION
func

The function to wrap.

TYPE: Callable

config

Optional caching options for the wrapped function.

TYPE: Optional[CachedFuncConfig] DEFAULT: None

RETURNS DESCRIPTION
CachedFunc

The wrapped cached function.

Source code in src/pydvl/utils/caching/base.py
def wrap(
    self,
    func: Callable,
    *,
    config: Optional[CachedFuncConfig] = None,
) -> "CachedFunc":
    """Wraps a function to cache its results.

    Args:
        func: The function to wrap.
        config: Optional caching options for the wrapped function.

    Returns:
        The wrapped cached function.
    """
    return CachedFunc(
        func,
        cache_backend=self,
        config=config,
    )

get

get(key: str) -> Optional[Any]

Get a value from the cache.

PARAMETER DESCRIPTION
key

Cache key.

TYPE: str

RETURNS DESCRIPTION
Optional[Any]

Cached value or None if not found.

Source code in src/pydvl/utils/caching/disk.py
def get(self, key: str) -> Optional[Any]:
    """Get a value from the cache.

    Args:
        key: Cache key.

    Returns:
        Cached value or None if not found.
    """
    cache_file = self.cache_dir / key
    if not cache_file.exists():
        self.stats.misses += 1
        return None
    self.stats.hits += 1
    with cache_file.open("rb") as f:
        return cloudpickle.load(f)

set

set(key: str, value: Any) -> None

Set a value in the cache.

PARAMETER DESCRIPTION
key

Cache key.

TYPE: str

value

Value to cache.

TYPE: Any

Source code in src/pydvl/utils/caching/disk.py
def set(self, key: str, value: Any) -> None:
    """Set a value in the cache.

    Args:
        key: Cache key.
        value: Value to cache.
    """
    cache_file = self.cache_dir / key
    self.stats.sets += 1
    with cache_file.open("wb") as f:
        cloudpickle.dump(value, f, protocol=PICKLE_VERSION)

clear

clear() -> None

Deletes cache directory and recreates it.

Source code in src/pydvl/utils/caching/disk.py
def clear(self) -> None:
    """Deletes cache directory and recreates it."""
    shutil.rmtree(self.cache_dir)
    self.cache_dir.mkdir(exist_ok=True, parents=True)

combine_hashes

combine_hashes(*args: str) -> str

Join cache key components.

Source code in src/pydvl/utils/caching/disk.py
def combine_hashes(self, *args: str) -> str:
    """Join cache key components."""
    return os.pathsep.join(args)

InMemoryCacheBackend

InMemoryCacheBackend()

Bases: CacheBackend

In-memory cache backend that stores results in a dictionary.

Implements the CacheBackend interface for an in-memory-based cache. Stores cache entries as values in a dictionary, keyed by cache key. This allows sharing evaluations across threads in a single process.

The implementation is not thread-safe.

ATTRIBUTE DESCRIPTION
cached_values

Dictionary used to store cached values.

TYPE: Dict[str, Any]

Example

Basic usage:

>>> from pydvl.utils.caching.memory import InMemoryCacheBackend
>>> cache_backend = InMemoryCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> cache_backend.set("key", value)
>>> cache_backend.get("key")
42

Callable wrapping:

>>> from pydvl.utils.caching.memory import InMemoryCacheBackend
>>> cache_backend = InMemoryCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> def foo(x: int):
...     return x + 1
...
>>> wrapped_foo = cache_backend.wrap(foo)
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
0
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
1

Source code in src/pydvl/utils/caching/memory.py
def __init__(self) -> None:
    """Initialize the in-memory cache backend."""
    super().__init__()
    self.cached_values: Dict[str, Any] = {}

wrap

wrap(
    func: Callable, *, config: Optional[CachedFuncConfig] = None
) -> CachedFunc

Wraps a function to cache its results.

PARAMETER DESCRIPTION
func

The function to wrap.

TYPE: Callable

config

Optional caching options for the wrapped function.

TYPE: Optional[CachedFuncConfig] DEFAULT: None

RETURNS DESCRIPTION
CachedFunc

The wrapped cached function.

Source code in src/pydvl/utils/caching/base.py
def wrap(
    self,
    func: Callable,
    *,
    config: Optional[CachedFuncConfig] = None,
) -> "CachedFunc":
    """Wraps a function to cache its results.

    Args:
        func: The function to wrap.
        config: Optional caching options for the wrapped function.

    Returns:
        The wrapped cached function.
    """
    return CachedFunc(
        func,
        cache_backend=self,
        config=config,
    )

get

get(key: str) -> Optional[Any]

Get a value from the cache.

PARAMETER DESCRIPTION
key

Cache key.

TYPE: str

RETURNS DESCRIPTION
Optional[Any]

Cached value or None if not found.

Source code in src/pydvl/utils/caching/memory.py
def get(self, key: str) -> Optional[Any]:
    """Get a value from the cache.

    Args:
        key: Cache key.

    Returns:
        Cached value or None if not found.
    """
    value = self.cached_values.get(key, None)
    if value is not None:
        self.stats.hits += 1
    else:
        self.stats.misses += 1
    return value

set

set(key: str, value: Any) -> None

Set a value in the cache.

PARAMETER DESCRIPTION
key

Cache key.

TYPE: str

value

Value to cache.

TYPE: Any

Source code in src/pydvl/utils/caching/memory.py
def set(self, key: str, value: Any) -> None:
    """Set a value in the cache.

    Args:
        key: Cache key.
        value: Value to cache.
    """
    self.cached_values[key] = value
    self.stats.sets += 1

clear

clear() -> None

Deletes cache dictionary and recreates it.

Source code in src/pydvl/utils/caching/memory.py
def clear(self) -> None:
    """Deletes cache dictionary and recreates it."""
    del self.cached_values
    self.cached_values = {}

combine_hashes

combine_hashes(*args: str) -> str

Join cache key components.

Source code in src/pydvl/utils/caching/memory.py
def combine_hashes(self, *args: str) -> str:
    """Join cache key components."""
    return os.pathsep.join(args)

MemcachedClientConfig dataclass

MemcachedClientConfig(
    server: Tuple[str, int] = ("localhost", 11211),
    connect_timeout: float = 1.0,
    timeout: float = 1.0,
    no_delay: bool = True,
    serde: PickleSerde = PickleSerde(pickle_version=PICKLE_VERSION),
)

Configuration of the memcached client.

PARAMETER DESCRIPTION
server

A tuple of (IP|domain name, port).

TYPE: Tuple[str, int] DEFAULT: ('localhost', 11211)

connect_timeout

How many seconds to wait before raising ConnectionRefusedError on failure to connect.

TYPE: float DEFAULT: 1.0

timeout

Duration in seconds to wait for send or recv calls on the socket connected to memcached.

TYPE: float DEFAULT: 1.0

no_delay

If True, set the TCP_NODELAY flag, which may help with performance in some cases.

TYPE: bool DEFAULT: True

serde

Serializer / Deserializer ("serde"). The default PickleSerde should work in most cases. See pymemcache.client.base.Client for details.

TYPE: PickleSerde DEFAULT: PickleSerde(pickle_version=PICKLE_VERSION)

MemcachedCacheBackend

MemcachedCacheBackend(config: MemcachedClientConfig = MemcachedClientConfig())

Bases: CacheBackend

Memcached cache backend for the distributed caching of functions.

Implements the CacheBackend interface for a memcached based cache. This allows sharing evaluations across processes and nodes in a cluster. You can run memcached as a service, locally or remotely, see the caching documentation.

PARAMETER DESCRIPTION
config

Memcached client configuration.

TYPE: MemcachedClientConfig DEFAULT: MemcachedClientConfig()

ATTRIBUTE DESCRIPTION
config

Memcached client configuration.

client

Memcached client instance.

Example

Basic usage:

>>> from pydvl.utils.caching.memcached import MemcachedCacheBackend
>>> cache_backend = MemcachedCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> cache_backend.set("key", value)
>>> cache_backend.get("key")
42

Callable wrapping:

>>> from pydvl.utils.caching.memcached import MemcachedCacheBackend
>>> cache_backend = MemcachedCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> def foo(x: int):
...     return x + 1
...
>>> wrapped_foo = cache_backend.wrap(foo)
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
0
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
1

PARAMETER DESCRIPTION
config

Memcached client configuration.

TYPE: MemcachedClientConfig DEFAULT: MemcachedClientConfig()

Source code in src/pydvl/utils/caching/memcached.py
def __init__(self, config: MemcachedClientConfig = MemcachedClientConfig()) -> None:
    """Initialize memcached cache backend.

    Args:
        config: Memcached client configuration.
    """

    super().__init__()
    self.config = config
    self.client = self._connect(self.config)

wrap

wrap(
    func: Callable, *, config: Optional[CachedFuncConfig] = None
) -> CachedFunc

Wraps a function to cache its results.

PARAMETER DESCRIPTION
func

The function to wrap.

TYPE: Callable

config

Optional caching options for the wrapped function.

TYPE: Optional[CachedFuncConfig] DEFAULT: None

RETURNS DESCRIPTION
CachedFunc

The wrapped cached function.

Source code in src/pydvl/utils/caching/base.py
def wrap(
    self,
    func: Callable,
    *,
    config: Optional[CachedFuncConfig] = None,
) -> "CachedFunc":
    """Wraps a function to cache its results.

    Args:
        func: The function to wrap.
        config: Optional caching options for the wrapped function.

    Returns:
        The wrapped cached function.
    """
    return CachedFunc(
        func,
        cache_backend=self,
        config=config,
    )

get

get(key: str) -> Optional[Any]

Get value from memcached.

PARAMETER DESCRIPTION
key

Cache key.

TYPE: str

RETURNS DESCRIPTION
Optional[Any]

Cached value or None if not found or client disconnected.

Source code in src/pydvl/utils/caching/memcached.py
def get(self, key: str) -> Optional[Any]:
    """Get value from memcached.

    Args:
        key: Cache key.

    Returns:
        Cached value or None if not found or client disconnected.
    """
    result = None
    try:
        result = self.client.get(key)
    except socket.timeout as e:
        self.stats.timeouts += 1
        warnings.warn(f"{type(self).__name__}: {str(e)}", RuntimeWarning)
    except OSError as e:
        self.stats.errors += 1
        warnings.warn(f"{type(self).__name__}: {str(e)}", RuntimeWarning)
    except AttributeError as e:
        # FIXME: this depends on _recv() failing on invalid sockets
        # See pymemcache.base.py,
        self.stats.reconnects += 1
        warnings.warn(f"{type(self).__name__}: {str(e)}", RuntimeWarning)
        self.client = self._connect(self.config)
    if result is None:
        self.stats.misses += 1
    else:
        self.stats.hits += 1
    return result

set

set(key: str, value: Any) -> None

Set value in memcached.

PARAMETER DESCRIPTION
key

Cache key.

TYPE: str

value

Value to cache.

TYPE: Any

Source code in src/pydvl/utils/caching/memcached.py
def set(self, key: str, value: Any) -> None:
    """Set value in memcached.

    Args:
        key: Cache key.
        value: Value to cache.
    """
    self.client.set(key, value, noreply=True)
    self.stats.sets += 1

clear

clear() -> None

Flush all values from memcached.

Source code in src/pydvl/utils/caching/memcached.py
def clear(self) -> None:
    """Flush all values from memcached."""
    self.client.flush_all(noreply=True)

combine_hashes

combine_hashes(*args: str) -> str

Join cache key components for Memcached.

Source code in src/pydvl/utils/caching/memcached.py
def combine_hashes(self, *args: str) -> str:
    """Join cache key components for Memcached."""
    return ":".join(args)

__getstate__

__getstate__() -> Dict

Enables pickling after a socket has been opened to the memcached server, by removing the client from the stored data.

Source code in src/pydvl/utils/caching/memcached.py
def __getstate__(self) -> Dict:
    """Enables pickling after a socket has been opened to the
    memcached server, by removing the client from the stored
    data."""
    odict = self.__dict__.copy()
    del odict["client"]
    return odict

__setstate__

__setstate__(d: Dict)

Restores a client connection after loading from a pickle.

Source code in src/pydvl/utils/caching/memcached.py
def __setstate__(self, d: Dict):
    """Restores a client connection after loading from a pickle."""
    self.config = d["config"]
    self.stats = d["stats"]
    self.client = self._connect(self.config)

CachedFuncConfig dataclass

CachedFuncConfig(
    hash_prefix: Optional[str] = None,
    ignore_args: Collection[str] = list(),
    time_threshold: float = 0.3,
    allow_repeated_evaluations: bool = False,
    rtol_stderr: float = 0.1,
    min_repetitions: int = 3,
)

Configuration for cached functions and methods, providing memoization of function calls.

Instances of this class are typically used as arguments for the construction of a Utility.

PARAMETER DESCRIPTION
hash_prefix

Optional string prefix that be prepended to the cache key. This can be provided in order to guarantee cache reuse across runs.

TYPE: Optional[str] DEFAULT: None

ignore_args

Do not take these keyword arguments into account when hashing the wrapped function for usage as key. This allows sharing the cache among different jobs for the same experiment run if the callable happens to have "nuisance" parameters like job_id which do not affect the result of the computation.

TYPE: Collection[str] DEFAULT: list()

time_threshold

Computations taking less time than this many seconds are not cached. A value of 0 means that it will always cache results.

TYPE: float DEFAULT: 0.3

allow_repeated_evaluations

If True, repeated calls to a function with the same arguments will be allowed and outputs averaged until the running standard deviation of the mean stabilizes below rtol_stderr * mean.

TYPE: bool DEFAULT: False

rtol_stderr

relative tolerance for repeated evaluations. More precisely, memcached() will stop evaluating the function once the standard deviation of the mean is smaller than rtol_stderr * mean.

TYPE: float DEFAULT: 0.1

min_repetitions

minimum number of times that a function evaluation on the same arguments is repeated before returning cached values. Useful for stochastic functions only. If the model training is very noisy, set this number to higher values to reduce variance.

TYPE: int DEFAULT: 3

ParallelConfig dataclass

ParallelConfig(
    backend: Literal["joblib", "ray"] = "joblib",
    address: Optional[Union[str, Tuple[str, int]]] = None,
    n_cpus_local: Optional[int] = None,
    logging_level: Optional[int] = None,
    wait_timeout: float = 1.0,
)

Configuration for parallel computation backend.

PARAMETER DESCRIPTION
backend

Type of backend to use. Defaults to 'joblib'

TYPE: Literal['joblib', 'ray'] DEFAULT: 'joblib'

address

(DEPRECATED) Address of existing remote or local cluster to use.

TYPE: Optional[Union[str, Tuple[str, int]]] DEFAULT: None

n_cpus_local

(DEPRECATED) Number of CPUs to use when creating a local ray cluster. This has no effect when using an existing ray cluster.

TYPE: Optional[int] DEFAULT: None

logging_level

(DEPRECATED) Logging level for the parallel backend's worker.

TYPE: Optional[int] DEFAULT: None

wait_timeout

(DEPRECATED) Timeout in seconds for waiting on futures.

TYPE: float DEFAULT: 1.0

Dataset

Dataset(
    x_train: Union[NDArray, DataFrame],
    y_train: Union[NDArray, DataFrame],
    x_test: Union[NDArray, DataFrame],
    y_test: Union[NDArray, DataFrame],
    feature_names: Optional[Sequence[str]] = None,
    target_names: Optional[Sequence[str]] = None,
    data_names: Optional[Sequence[str]] = None,
    description: Optional[str] = None,
    is_multi_output: bool = False,
)

A convenience class to handle datasets.

It holds a dataset, split into training and test data, together with several labels on feature names, data point names and a description.

PARAMETER DESCRIPTION
x_train

training data

TYPE: Union[NDArray, DataFrame]

y_train

labels for training data

TYPE: Union[NDArray, DataFrame]

x_test

test data

TYPE: Union[NDArray, DataFrame]

y_test

labels for test data

TYPE: Union[NDArray, DataFrame]

feature_names

name of the features of input data

TYPE: Optional[Sequence[str]] DEFAULT: None

target_names

names of the features of target data

TYPE: Optional[Sequence[str]] DEFAULT: None

data_names

names assigned to data points. For example, if the dataset is a time series, each entry can be a timestamp which can be referenced directly instead of using a row number.

TYPE: Optional[Sequence[str]] DEFAULT: None

description

A textual description of the dataset.

TYPE: Optional[str] DEFAULT: None

is_multi_output

set to False if labels are scalars, or to True if they are vectors of dimension > 1.

TYPE: bool DEFAULT: False

Source code in src/pydvl/utils/dataset.py
def __init__(
    self,
    x_train: Union[NDArray, pd.DataFrame],
    y_train: Union[NDArray, pd.DataFrame],
    x_test: Union[NDArray, pd.DataFrame],
    y_test: Union[NDArray, pd.DataFrame],
    feature_names: Optional[Sequence[str]] = None,
    target_names: Optional[Sequence[str]] = None,
    data_names: Optional[Sequence[str]] = None,
    description: Optional[str] = None,
    # FIXME: use same parameter name as in check_X_y()
    is_multi_output: bool = False,
):
    """Constructs a Dataset from data and labels.

    Args:
        x_train: training data
        y_train: labels for training data
        x_test: test data
        y_test: labels for test data
        feature_names: name of the features of input data
        target_names: names of the features of target data
        data_names: names assigned to data points.
            For example, if the dataset is a time series, each entry can be a
            timestamp which can be referenced directly instead of using a row
            number.
        description: A textual description of the dataset.
        is_multi_output: set to `False` if labels are scalars, or to
            `True` if they are vectors of dimension > 1.
    """
    self.x_train, self.y_train = check_X_y(
        x_train, y_train, multi_output=is_multi_output
    )
    self.x_test, self.y_test = check_X_y(
        x_test, y_test, multi_output=is_multi_output
    )

    if x_train.shape[-1] != x_test.shape[-1]:
        raise ValueError(
            f"Mismatching number of features: "
            f"{x_train.shape[-1]} and {x_test.shape[-1]}"
        )
    if x_train.shape[0] != y_train.shape[0]:
        raise ValueError(
            f"Mismatching number of samples: "
            f"{x_train.shape[-1]} and {x_test.shape[-1]}"
        )
    if x_test.shape[0] != y_test.shape[0]:
        raise ValueError(
            f"Mismatching number of samples: "
            f"{x_test.shape[-1]} and {y_test.shape[-1]}"
        )

    def make_names(s: str, a: np.ndarray) -> List[str]:
        n = a.shape[1] if len(a.shape) > 1 else 1
        return [f"{s}{i:0{1 + int(np.log10(n))}d}" for i in range(1, n + 1)]

    self.feature_names = feature_names
    self.target_names = target_names

    if self.feature_names is None:
        if isinstance(x_train, pd.DataFrame):
            self.feature_names = x_train.columns.tolist()
        else:
            self.feature_names = make_names("x", x_train)

    if self.target_names is None:
        if isinstance(y_train, pd.DataFrame):
            self.target_names = y_train.columns.tolist()
        else:
            self.target_names = make_names("y", y_train)

    if len(self.x_train.shape) > 1:
        if (
            len(self.feature_names) != self.x_train.shape[-1]
            or len(self.feature_names) != self.x_test.shape[-1]
        ):
            raise ValueError("Mismatching number of features and names")
    if len(self.y_train.shape) > 1:
        if (
            len(self.target_names) != self.y_train.shape[-1]
            or len(self.target_names) != self.y_test.shape[-1]
        ):
            raise ValueError("Mismatching number of targets and names")

    self.description = description or "No description"
    self._indices: NDArray[np.int_] = np.arange(len(self.x_train), dtype=np.int_)
    self._data_names: NDArray[np.object_] = (
        np.array(data_names, dtype=object)
        if data_names is not None
        else self._indices.astype(object)
    )

indices property

indices: NDArray[int_]

Index of positions in data.x_train.

Contiguous integers from 0 to len(Dataset).

data_names property

data_names: NDArray[object_]

Names of each individual datapoint.

Used for reporting Shapley values.

dim property

dim: int

Returns the number of dimensions of a sample.

get_training_data

get_training_data(
    indices: Optional[Iterable[int]] = None,
) -> Tuple[NDArray, NDArray]

Given a set of indices, returns the training data that refer to those indices.

This is used mainly by Utility to retrieve subsets of the data from indices. It is typically not needed in algorithms.

PARAMETER DESCRIPTION
indices

Optional indices that will be used to select points from the training data. If None, the entire training data will be returned.

TYPE: Optional[Iterable[int]] DEFAULT: None

RETURNS DESCRIPTION
Tuple[NDArray, NDArray]

If indices is not None, the selected x and y arrays from the training data. Otherwise, the entire dataset.

Source code in src/pydvl/utils/dataset.py
def get_training_data(
    self, indices: Optional[Iterable[int]] = None
) -> Tuple[NDArray, NDArray]:
    """Given a set of indices, returns the training data that refer to those
    indices.

    This is used mainly by [Utility][pydvl.utils.utility.Utility] to retrieve
    subsets of the data from indices. It is typically **not needed in
    algorithms**.

    Args:
        indices: Optional indices that will be used to select points from
            the training data. If `None`, the entire training data will be
            returned.

    Returns:
        If `indices` is not `None`, the selected x and y arrays from the
            training data. Otherwise, the entire dataset.
    """
    if indices is None:
        return self.x_train, self.y_train
    x = self.x_train[indices]
    y = self.y_train[indices]
    return x, y

get_test_data

get_test_data(
    indices: Optional[Iterable[int]] = None,
) -> Tuple[NDArray, NDArray]

Returns the entire test set regardless of the passed indices.

The passed indices will not be used because for data valuation we generally want to score the trained model on the entire test data.

Additionally, the way this method is used in the Utility class, the passed indices will be those of the training data and would not work on the test data.

There may be cases where it is desired to use parts of the test data. In those cases, it is recommended to inherit from Dataset and override get_test_data().

For example, the following snippet shows how one could go about mapping the training data indices into test data indices inside get_test_data():

Example
>>> from pydvl.utils import Dataset
>>> import numpy as np
>>> class DatasetWithTestDataIndices(Dataset):
...    def get_test_data(self, indices=None):
...        if indices is None:
...            return self.x_test, self.y_test
...        fraction = len(list(indices)) / len(self)
...        mapped_indices = len(self.x_test) / len(self) * np.asarray(indices)
...        mapped_indices = np.unique(mapped_indices.astype(int))
...        return self.x_test[mapped_indices], self.y_test[mapped_indices]
...
>>> X = np.random.rand(100, 10)
>>> y = np.random.randint(0, 2, 100)
>>> dataset = DatasetWithTestDataIndices.from_arrays(X, y)
>>> indices = np.random.choice(dataset.indices, 30, replace=False)
>>> _ = dataset.get_training_data(indices)
>>> _ = dataset.get_test_data(indices)
PARAMETER DESCRIPTION
indices

Optional indices into the test data. This argument is unused left for compatibility with get_training_data().

TYPE: Optional[Iterable[int]] DEFAULT: None

RETURNS DESCRIPTION
Tuple[NDArray, NDArray]

The entire test data.

Source code in src/pydvl/utils/dataset.py
def get_test_data(
    self, indices: Optional[Iterable[int]] = None
) -> Tuple[NDArray, NDArray]:
    """Returns the entire test set regardless of the passed indices.

    The passed indices will not be used because for data valuation
    we generally want to score the trained model on the entire test data.

    Additionally, the way this method is used in the
    [Utility][pydvl.utils.utility.Utility] class, the passed indices will
    be those of the training data and would not work on the test data.

    There may be cases where it is desired to use parts of the test data.
    In those cases, it is recommended to inherit from
    [Dataset][pydvl.utils.dataset.Dataset] and override
    [get_test_data()][pydvl.utils.dataset.Dataset.get_test_data].

    For example, the following snippet shows how one could go about
    mapping the training data indices into test data indices
    inside [get_test_data()][pydvl.utils.dataset.Dataset.get_test_data]:

    ??? Example
        ```pycon
        >>> from pydvl.utils import Dataset
        >>> import numpy as np
        >>> class DatasetWithTestDataIndices(Dataset):
        ...    def get_test_data(self, indices=None):
        ...        if indices is None:
        ...            return self.x_test, self.y_test
        ...        fraction = len(list(indices)) / len(self)
        ...        mapped_indices = len(self.x_test) / len(self) * np.asarray(indices)
        ...        mapped_indices = np.unique(mapped_indices.astype(int))
        ...        return self.x_test[mapped_indices], self.y_test[mapped_indices]
        ...
        >>> X = np.random.rand(100, 10)
        >>> y = np.random.randint(0, 2, 100)
        >>> dataset = DatasetWithTestDataIndices.from_arrays(X, y)
        >>> indices = np.random.choice(dataset.indices, 30, replace=False)
        >>> _ = dataset.get_training_data(indices)
        >>> _ = dataset.get_test_data(indices)
        ```

    Args:
        indices: Optional indices into the test data. This argument is
            unused left for compatibility with
            [get_training_data()][pydvl.utils.dataset.Dataset.get_training_data].

    Returns:
        The entire test data.
    """
    return self.x_test, self.y_test

from_sklearn classmethod

from_sklearn(
    data: Bunch,
    train_size: float = 0.8,
    random_state: Optional[int] = None,
    stratify_by_target: bool = False,
    **kwargs: Any,
) -> Dataset

Constructs a Dataset object from a sklearn.utils.Bunch, as returned by the load_* functions in scikit-learn toy datasets.

Example
>>> from pydvl.utils import Dataset
>>> from sklearn.datasets import load_boston
>>> dataset = Dataset.from_sklearn(load_boston())
PARAMETER DESCRIPTION
data

scikit-learn Bunch object. The following attributes are supported:

  • data: covariates.
  • target: target variables (labels).
  • feature_names (optional): the feature names.
  • target_names (optional): the target names.
  • DESCR (optional): a description.

TYPE: Bunch

train_size

size of the training dataset. Used in train_test_split

TYPE: float DEFAULT: 0.8

random_state

seed for train / test split

TYPE: Optional[int] DEFAULT: None

stratify_by_target

If True, data is split in a stratified fashion, using the target variable as labels. Read more in scikit-learn's user guide.

TYPE: bool DEFAULT: False

kwargs

Additional keyword arguments to pass to the Dataset constructor. Use this to pass e.g. is_multi_output.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
Dataset

Object with the sklearn dataset

Changed in version 0.6.0

Added kwargs to pass to the Dataset constructor.

Source code in src/pydvl/utils/dataset.py
@classmethod
def from_sklearn(
    cls,
    data: Bunch,
    train_size: float = 0.8,
    random_state: Optional[int] = None,
    stratify_by_target: bool = False,
    **kwargs: Any,
) -> "Dataset":
    """Constructs a [Dataset][pydvl.utils.Dataset] object from a
    [sklearn.utils.Bunch][], as returned by the `load_*`
    functions in [scikit-learn toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html).

    ??? Example
        ```pycon
        >>> from pydvl.utils import Dataset
        >>> from sklearn.datasets import load_boston
        >>> dataset = Dataset.from_sklearn(load_boston())
        ```

    Args:
        data: scikit-learn Bunch object. The following attributes are supported:

            - `data`: covariates.
            - `target`: target variables (labels).
            - `feature_names` (**optional**): the feature names.
            - `target_names` (**optional**): the target names.
            - `DESCR` (**optional**): a description.
        train_size: size of the training dataset. Used in `train_test_split`
        random_state: seed for train / test split
        stratify_by_target: If `True`, data is split in a stratified
            fashion, using the target variable as labels. Read more in
            [scikit-learn's user guide](https://scikit-learn.org/stable/modules/cross_validation.html#stratification).
        kwargs: Additional keyword arguments to pass to the
            [Dataset][pydvl.utils.Dataset] constructor. Use this to pass e.g. `is_multi_output`.

    Returns:
        Object with the sklearn dataset

    !!! tip "Changed in version 0.6.0"
        Added kwargs to pass to the [Dataset][pydvl.utils.Dataset] constructor.
    """
    x_train, x_test, y_train, y_test = train_test_split(
        data.data,
        data.target,
        train_size=train_size,
        random_state=random_state,
        stratify=data.target if stratify_by_target else None,
    )
    return cls(
        x_train,
        y_train,
        x_test,
        y_test,
        feature_names=data.get("feature_names"),
        target_names=data.get("target_names"),
        description=data.get("DESCR"),
        **kwargs,
    )

from_arrays classmethod

from_arrays(
    X: NDArray,
    y: NDArray,
    train_size: float = 0.8,
    random_state: Optional[int] = None,
    stratify_by_target: bool = False,
    **kwargs: Any,
) -> Dataset

Constructs a Dataset object from X and y numpy arrays as returned by the make_* functions in sklearn generated datasets.

Example
>>> from pydvl.utils import Dataset
>>> from sklearn.datasets import make_regression
>>> X, y = make_regression()
>>> dataset = Dataset.from_arrays(X, y)
PARAMETER DESCRIPTION
X

numpy array of shape (n_samples, n_features)

TYPE: NDArray

y

numpy array of shape (n_samples,)

TYPE: NDArray

train_size

size of the training dataset. Used in train_test_split

TYPE: float DEFAULT: 0.8

random_state

seed for train / test split

TYPE: Optional[int] DEFAULT: None

stratify_by_target

If True, data is split in a stratified fashion, using the y variable as labels. Read more in sklearn's user guide.

TYPE: bool DEFAULT: False

kwargs

Additional keyword arguments to pass to the Dataset constructor. Use this to pass e.g. feature_names or target_names.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
Dataset

Object with the passed X and y arrays split across training and test sets.

New in version 0.4.0

Changed in version 0.6.0

Added kwargs to pass to the Dataset constructor.

Source code in src/pydvl/utils/dataset.py
@classmethod
def from_arrays(
    cls,
    X: NDArray,
    y: NDArray,
    train_size: float = 0.8,
    random_state: Optional[int] = None,
    stratify_by_target: bool = False,
    **kwargs: Any,
) -> "Dataset":
    """Constructs a [Dataset][pydvl.utils.Dataset] object from X and y numpy arrays  as
    returned by the `make_*` functions in [sklearn generated datasets](https://scikit-learn.org/stable/datasets/sample_generators.html).

    ??? Example
        ```pycon
        >>> from pydvl.utils import Dataset
        >>> from sklearn.datasets import make_regression
        >>> X, y = make_regression()
        >>> dataset = Dataset.from_arrays(X, y)
        ```

    Args:
        X: numpy array of shape (n_samples, n_features)
        y: numpy array of shape (n_samples,)
        train_size: size of the training dataset. Used in `train_test_split`
        random_state: seed for train / test split
        stratify_by_target: If `True`, data is split in a stratified fashion,
            using the y variable as labels. Read more in [sklearn's user
            guide](https://scikit-learn.org/stable/modules/cross_validation.html#stratification).
        kwargs: Additional keyword arguments to pass to the
            [Dataset][pydvl.utils.Dataset] constructor. Use this to pass e.g. `feature_names`
            or `target_names`.

    Returns:
        Object with the passed X and y arrays split across training and test sets.

    !!! tip "New in version 0.4.0"

    !!! tip "Changed in version 0.6.0"
        Added kwargs to pass to the [Dataset][pydvl.utils.Dataset] constructor.
    """
    x_train, x_test, y_train, y_test = train_test_split(
        X,
        y,
        train_size=train_size,
        random_state=random_state,
        stratify=y if stratify_by_target else None,
    )
    return cls(x_train, y_train, x_test, y_test, **kwargs)

GroupedDataset

GroupedDataset(
    x_train: NDArray,
    y_train: NDArray,
    x_test: NDArray,
    y_test: NDArray,
    data_groups: Sequence,
    feature_names: Optional[Sequence[str]] = None,
    target_names: Optional[Sequence[str]] = None,
    group_names: Optional[Sequence[str]] = None,
    description: Optional[str] = None,
    **kwargs: Any,
)

Bases: Dataset

Used for calculating Shapley values of subsets of the data considered as logical units. For instance, one can group by value of a categorical feature, by bin into which a continuous feature falls, or by label.

PARAMETER DESCRIPTION
x_train

training data

TYPE: NDArray

y_train

labels of training data

TYPE: NDArray

x_test

test data

TYPE: NDArray

y_test

labels of test data

TYPE: NDArray

data_groups

Iterable of the same length as x_train containing a group label for each training data point. The label can be of any type, e.g. str or int. Data points with the same label will then be grouped by this object and considered as one for effects of valuation.

TYPE: Sequence

feature_names

names of the covariates' features.

TYPE: Optional[Sequence[str]] DEFAULT: None

target_names

names of the labels or targets y

TYPE: Optional[Sequence[str]] DEFAULT: None

group_names

names of the groups. If not provided, the labels from data_groups will be used.

TYPE: Optional[Sequence[str]] DEFAULT: None

description

A textual description of the dataset

TYPE: Optional[str] DEFAULT: None

kwargs

Additional keyword arguments to pass to the Dataset constructor.

TYPE: Any DEFAULT: {}

Changed in version 0.6.0

Added group_names and forwarding of kwargs

Source code in src/pydvl/utils/dataset.py
def __init__(
    self,
    x_train: NDArray,
    y_train: NDArray,
    x_test: NDArray,
    y_test: NDArray,
    data_groups: Sequence,
    feature_names: Optional[Sequence[str]] = None,
    target_names: Optional[Sequence[str]] = None,
    group_names: Optional[Sequence[str]] = None,
    description: Optional[str] = None,
    **kwargs: Any,
):
    """Class for grouping datasets.

    Used for calculating Shapley values of subsets of the data considered
    as logical units. For instance, one can group by value of a categorical
    feature, by bin into which a continuous feature falls, or by label.

    Args:
        x_train: training data
        y_train: labels of training data
        x_test: test data
        y_test: labels of test data
        data_groups: Iterable of the same length as `x_train` containing
            a group label for each training data point. The label can be of any
            type, e.g. `str` or `int`. Data points with the same label will
            then be grouped by this object and considered as one for effects of
            valuation.
        feature_names: names of the covariates' features.
        target_names: names of the labels or targets y
        group_names: names of the groups. If not provided, the labels
            from `data_groups` will be used.
        description: A textual description of the dataset
        kwargs: Additional keyword arguments to pass to the
            [Dataset][pydvl.utils.Dataset] constructor.

    !!! tip "Changed in version 0.6.0"
    Added `group_names` and forwarding of `kwargs`
    """
    super().__init__(
        x_train=x_train,
        y_train=y_train,
        x_test=x_test,
        y_test=y_test,
        feature_names=feature_names,
        target_names=target_names,
        description=description,
        **kwargs,
    )

    if len(data_groups) != len(x_train):
        raise ValueError(
            f"data_groups and x_train must have the same length."
            f"Instead got {len(data_groups)=} and {len(x_train)=}"
        )

    self.groups: OrderedDict[Any, List[int]] = OrderedDict(
        {k: [] for k in set(data_groups)}
    )
    for idx, group in enumerate(data_groups):
        self.groups[group].append(idx)
    self.group_items = list(self.groups.items())
    self._indices = np.arange(len(self.groups.keys()))
    self._data_names = (
        np.array(group_names, dtype=object)
        if group_names is not None
        else np.array(list(self.groups.keys()), dtype=object)
    )

dim property

dim: int

Returns the number of dimensions of a sample.

indices property

indices

Indices of the groups.

data_names property

data_names

Names of the groups.

get_test_data

get_test_data(
    indices: Optional[Iterable[int]] = None,
) -> Tuple[NDArray, NDArray]

Returns the entire test set regardless of the passed indices.

The passed indices will not be used because for data valuation we generally want to score the trained model on the entire test data.

Additionally, the way this method is used in the Utility class, the passed indices will be those of the training data and would not work on the test data.

There may be cases where it is desired to use parts of the test data. In those cases, it is recommended to inherit from Dataset and override get_test_data().

For example, the following snippet shows how one could go about mapping the training data indices into test data indices inside get_test_data():

Example
>>> from pydvl.utils import Dataset
>>> import numpy as np
>>> class DatasetWithTestDataIndices(Dataset):
...    def get_test_data(self, indices=None):
...        if indices is None:
...            return self.x_test, self.y_test
...        fraction = len(list(indices)) / len(self)
...        mapped_indices = len(self.x_test) / len(self) * np.asarray(indices)
...        mapped_indices = np.unique(mapped_indices.astype(int))
...        return self.x_test[mapped_indices], self.y_test[mapped_indices]
...
>>> X = np.random.rand(100, 10)
>>> y = np.random.randint(0, 2, 100)
>>> dataset = DatasetWithTestDataIndices.from_arrays(X, y)
>>> indices = np.random.choice(dataset.indices, 30, replace=False)
>>> _ = dataset.get_training_data(indices)
>>> _ = dataset.get_test_data(indices)
PARAMETER DESCRIPTION
indices

Optional indices into the test data. This argument is unused left for compatibility with get_training_data().

TYPE: Optional[Iterable[int]] DEFAULT: None

RETURNS DESCRIPTION
Tuple[NDArray, NDArray]

The entire test data.

Source code in src/pydvl/utils/dataset.py
def get_test_data(
    self, indices: Optional[Iterable[int]] = None
) -> Tuple[NDArray, NDArray]:
    """Returns the entire test set regardless of the passed indices.

    The passed indices will not be used because for data valuation
    we generally want to score the trained model on the entire test data.

    Additionally, the way this method is used in the
    [Utility][pydvl.utils.utility.Utility] class, the passed indices will
    be those of the training data and would not work on the test data.

    There may be cases where it is desired to use parts of the test data.
    In those cases, it is recommended to inherit from
    [Dataset][pydvl.utils.dataset.Dataset] and override
    [get_test_data()][pydvl.utils.dataset.Dataset.get_test_data].

    For example, the following snippet shows how one could go about
    mapping the training data indices into test data indices
    inside [get_test_data()][pydvl.utils.dataset.Dataset.get_test_data]:

    ??? Example
        ```pycon
        >>> from pydvl.utils import Dataset
        >>> import numpy as np
        >>> class DatasetWithTestDataIndices(Dataset):
        ...    def get_test_data(self, indices=None):
        ...        if indices is None:
        ...            return self.x_test, self.y_test
        ...        fraction = len(list(indices)) / len(self)
        ...        mapped_indices = len(self.x_test) / len(self) * np.asarray(indices)
        ...        mapped_indices = np.unique(mapped_indices.astype(int))
        ...        return self.x_test[mapped_indices], self.y_test[mapped_indices]
        ...
        >>> X = np.random.rand(100, 10)
        >>> y = np.random.randint(0, 2, 100)
        >>> dataset = DatasetWithTestDataIndices.from_arrays(X, y)
        >>> indices = np.random.choice(dataset.indices, 30, replace=False)
        >>> _ = dataset.get_training_data(indices)
        >>> _ = dataset.get_test_data(indices)
        ```

    Args:
        indices: Optional indices into the test data. This argument is
            unused left for compatibility with
            [get_training_data()][pydvl.utils.dataset.Dataset.get_training_data].

    Returns:
        The entire test data.
    """
    return self.x_test, self.y_test

get_training_data

get_training_data(
    indices: Optional[Iterable[int]] = None,
) -> Tuple[NDArray, NDArray]

Returns the data and labels of all samples in the given groups.

PARAMETER DESCRIPTION
indices

group indices whose elements to return. If None, all data from all groups are returned.

TYPE: Optional[Iterable[int]] DEFAULT: None

RETURNS DESCRIPTION
Tuple[NDArray, NDArray]

Tuple of training data x and labels y.

Source code in src/pydvl/utils/dataset.py
def get_training_data(
    self, indices: Optional[Iterable[int]] = None
) -> Tuple[NDArray, NDArray]:
    """Returns the data and labels of all samples in the given groups.

    Args:
        indices: group indices whose elements to return. If `None`,
            all data from all groups are returned.

    Returns:
        Tuple of training data x and labels y.
    """
    if indices is None:
        indices = self.indices
    data_indices = [
        idx for group_id in indices for idx in self.group_items[group_id][1]
    ]
    return super().get_training_data(data_indices)

from_sklearn classmethod

from_sklearn(
    data: Bunch,
    train_size: float = 0.8,
    random_state: Optional[int] = None,
    stratify_by_target: bool = False,
    data_groups: Optional[Sequence] = None,
    **kwargs: Any,
) -> GroupedDataset

Constructs a GroupedDataset object from a sklearn.utils.Bunch as returned by the load_* functions in scikit-learn toy datasets and groups it.

Example
>>> from sklearn.datasets import load_iris
>>> from pydvl.utils import GroupedDataset
>>> iris = load_iris()
>>> data_groups = iris.data[:, 0] // 0.5
>>> dataset = GroupedDataset.from_sklearn(iris, data_groups=data_groups)
PARAMETER DESCRIPTION
data

scikit-learn Bunch object. The following attributes are supported:

  • data: covariates.
  • target: target variables (labels).
  • feature_names (optional): the feature names.
  • target_names (optional): the target names.
  • DESCR (optional): a description.

TYPE: Bunch

train_size

size of the training dataset. Used in train_test_split.

TYPE: float DEFAULT: 0.8

random_state

seed for train / test split.

TYPE: Optional[int] DEFAULT: None

stratify_by_target

If True, data is split in a stratified fashion, using the target variable as labels. Read more in sklearn's user guide.

TYPE: bool DEFAULT: False

data_groups

an array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset.

TYPE: Optional[Sequence] DEFAULT: None

kwargs

Additional keyword arguments to pass to the Dataset constructor.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
GroupedDataset

Dataset with the selected sklearn data

Source code in src/pydvl/utils/dataset.py
@classmethod
def from_sklearn(
    cls,
    data: Bunch,
    train_size: float = 0.8,
    random_state: Optional[int] = None,
    stratify_by_target: bool = False,
    data_groups: Optional[Sequence] = None,
    **kwargs: Any,
) -> "GroupedDataset":
    """Constructs a [GroupedDataset][pydvl.utils.GroupedDataset] object from a
    [sklearn.utils.Bunch][sklearn.utils.Bunch] as returned by the `load_*` functions in
    [scikit-learn toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html) and groups
    it.

    ??? Example
        ```pycon
        >>> from sklearn.datasets import load_iris
        >>> from pydvl.utils import GroupedDataset
        >>> iris = load_iris()
        >>> data_groups = iris.data[:, 0] // 0.5
        >>> dataset = GroupedDataset.from_sklearn(iris, data_groups=data_groups)
        ```

    Args:
        data: scikit-learn Bunch object. The following attributes are supported:

            - `data`: covariates.
            - `target`: target variables (labels).
            - `feature_names` (**optional**): the feature names.
            - `target_names` (**optional**): the target names.
            - `DESCR` (**optional**): a description.
        train_size: size of the training dataset. Used in `train_test_split`.
        random_state: seed for train / test split.
        stratify_by_target: If `True`, data is split in a stratified
            fashion, using the target variable as labels. Read more in
            [sklearn's user guide](https://scikit-learn.org/stable/modules/cross_validation.html#stratification).
        data_groups: an array holding the group index or name for each
            data point. The length of this array must be equal to the number of
            data points in the dataset.
        kwargs: Additional keyword arguments to pass to the
            [Dataset][pydvl.utils.Dataset] constructor.

    Returns:
        Dataset with the selected sklearn data
    """
    if data_groups is None:
        raise ValueError(
            "data_groups must be provided when constructing a GroupedDataset"
        )

    x_train, x_test, y_train, y_test, data_groups_train, _ = train_test_split(
        data.data,
        data.target,
        data_groups,
        train_size=train_size,
        random_state=random_state,
        stratify=data.target if stratify_by_target else None,
    )

    dataset = Dataset(
        x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test, **kwargs
    )
    return cls.from_dataset(dataset, data_groups_train)  # type: ignore

from_arrays classmethod

from_arrays(
    X: NDArray,
    y: NDArray,
    train_size: float = 0.8,
    random_state: Optional[int] = None,
    stratify_by_target: bool = False,
    data_groups: Optional[Sequence] = None,
    **kwargs: Any,
) -> Dataset

Constructs a GroupedDataset object from X and y numpy arrays as returned by the make_* functions in scikit-learn generated datasets.

Example
>>> from sklearn.datasets import make_classification
>>> from pydvl.utils import GroupedDataset
>>> X, y = make_classification(
...     n_samples=100,
...     n_features=4,
...     n_informative=2,
...     n_redundant=0,
...     random_state=0,
...     shuffle=False
... )
>>> data_groups = X[:, 0] // 0.5
>>> dataset = GroupedDataset.from_arrays(X, y, data_groups=data_groups)
PARAMETER DESCRIPTION
X

array of shape (n_samples, n_features)

TYPE: NDArray

y

array of shape (n_samples,)

TYPE: NDArray

train_size

size of the training dataset. Used in train_test_split.

TYPE: float DEFAULT: 0.8

random_state

seed for train / test split.

TYPE: Optional[int] DEFAULT: None

stratify_by_target

If True, data is split in a stratified fashion, using the y variable as labels. Read more in sklearn's user guide.

TYPE: bool DEFAULT: False

data_groups

an array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset.

TYPE: Optional[Sequence] DEFAULT: None

kwargs

Additional keyword arguments that will be passed to the Dataset constructor.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
Dataset

Dataset with the passed X and y arrays split across training and test sets.

New in version 0.4.0

Changed in version 0.6.0

Added kwargs to pass to the Dataset constructor.

Source code in src/pydvl/utils/dataset.py
@classmethod
def from_arrays(
    cls,
    X: NDArray,
    y: NDArray,
    train_size: float = 0.8,
    random_state: Optional[int] = None,
    stratify_by_target: bool = False,
    data_groups: Optional[Sequence] = None,
    **kwargs: Any,
) -> "Dataset":
    """Constructs a [GroupedDataset][pydvl.utils.GroupedDataset] object from X and y numpy arrays
    as returned by the `make_*` functions in
    [scikit-learn generated datasets](https://scikit-learn.org/stable/datasets/sample_generators.html).

    ??? Example
        ```pycon
        >>> from sklearn.datasets import make_classification
        >>> from pydvl.utils import GroupedDataset
        >>> X, y = make_classification(
        ...     n_samples=100,
        ...     n_features=4,
        ...     n_informative=2,
        ...     n_redundant=0,
        ...     random_state=0,
        ...     shuffle=False
        ... )
        >>> data_groups = X[:, 0] // 0.5
        >>> dataset = GroupedDataset.from_arrays(X, y, data_groups=data_groups)
        ```

    Args:
        X: array of shape (n_samples, n_features)
        y: array of shape (n_samples,)
        train_size: size of the training dataset. Used in `train_test_split`.
        random_state: seed for train / test split.
        stratify_by_target: If `True`, data is split in a stratified
            fashion, using the y variable as labels. Read more in
            [sklearn's user guide](https://scikit-learn.org/stable/modules/cross_validation.html#stratification).
        data_groups: an array holding the group index or name for each data
            point. The length of this array must be equal to the number of
            data points in the dataset.
        kwargs: Additional keyword arguments that will be passed to the
            [Dataset][pydvl.utils.Dataset] constructor.

    Returns:
        Dataset with the passed X and y arrays split across training and
            test sets.

    !!! tip "New in version 0.4.0"

    !!! tip "Changed in version 0.6.0"
        Added kwargs to pass to the [Dataset][pydvl.utils.Dataset] constructor.
    """
    if data_groups is None:
        raise ValueError(
            "data_groups must be provided when constructing a GroupedDataset"
        )
    x_train, x_test, y_train, y_test, data_groups_train, _ = train_test_split(
        X,
        y,
        data_groups,
        train_size=train_size,
        random_state=random_state,
        stratify=y if stratify_by_target else None,
    )
    dataset = Dataset(
        x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test, **kwargs
    )
    return cls.from_dataset(dataset, data_groups_train)

from_dataset classmethod

from_dataset(dataset: Dataset, data_groups: Sequence[Any]) -> GroupedDataset

Creates a GroupedDataset object from the data a Dataset object and a mapping of data groups.

Example
>>> import numpy as np
>>> from pydvl.utils import Dataset, GroupedDataset
>>> dataset = Dataset.from_arrays(
...     X=np.asarray([[1, 2], [3, 4], [5, 6], [7, 8]]),
...     y=np.asarray([0, 1, 0, 1]),
... )
>>> dataset = GroupedDataset.from_dataset(dataset, data_groups=[0, 0, 1, 1])
PARAMETER DESCRIPTION
dataset

The original data.

TYPE: Dataset

data_groups

An array holding the group index or name for each data point. The length of this array must be equal to the number of data points in the dataset.

TYPE: Sequence[Any]

RETURNS DESCRIPTION
GroupedDataset

A GroupedDataset with the initial Dataset grouped by data_groups.

Source code in src/pydvl/utils/dataset.py
@classmethod
def from_dataset(
    cls, dataset: Dataset, data_groups: Sequence[Any]
) -> "GroupedDataset":
    """Creates a [GroupedDataset][pydvl.utils.GroupedDataset] object from the data a
    [Dataset][pydvl.utils.Dataset] object and a mapping of data groups.

    ??? Example
        ```pycon
        >>> import numpy as np
        >>> from pydvl.utils import Dataset, GroupedDataset
        >>> dataset = Dataset.from_arrays(
        ...     X=np.asarray([[1, 2], [3, 4], [5, 6], [7, 8]]),
        ...     y=np.asarray([0, 1, 0, 1]),
        ... )
        >>> dataset = GroupedDataset.from_dataset(dataset, data_groups=[0, 0, 1, 1])
        ```

    Args:
        dataset: The original data.
        data_groups: An array holding the group index or name for each data
            point. The length of this array must be equal to the number of
            data points in the dataset.

    Returns:
        A [GroupedDataset][pydvl.utils.GroupedDataset] with the initial
            [Dataset][pydvl.utils.Dataset] grouped by data_groups.
    """
    return cls(
        x_train=dataset.x_train,
        y_train=dataset.y_train,
        x_test=dataset.x_test,
        y_test=dataset.y_test,
        data_groups=data_groups,
        feature_names=dataset.feature_names,
        target_names=dataset.target_names,
        description=dataset.description,
    )

Progress

Progress(iterable: Iterable[T], is_done: StoppingCriterion, **kwargs: Any)

Bases: Generic[T]

Displays an optional progress bar for an iterable, using StoppingCriterion.completion for the progress.

PARAMETER DESCRIPTION
iterable

The iterable to wrap.

TYPE: Iterable[T]

is_done

The stopping criterion.

TYPE: StoppingCriterion

kwargs

Additional keyword arguments passed to tqdm. - total: The total number of items in the iterable (Default: 100) - unit: The unit of the progress bar. (Default: %) - desc: Description of the progress bar. (Default: str(is_done)) - bar_format: Format of the progress bar. (Default is a percentage bar) - plus anything else that tqdm accepts

TYPE: Any DEFAULT: {}

Source code in src/pydvl/utils/progress.py
def __init__(
    self,
    iterable: Iterable[T],
    is_done: StoppingCriterion,
    **kwargs: Any,
) -> None:
    self.iterable = iterable
    self.is_done = is_done
    self.total = kwargs.pop("total", 100)
    desc = kwargs.pop("desc", str(is_done))
    unit = kwargs.pop("unit", "%")
    bar_format = kwargs.pop(
        "bar_format",
        "{desc}: {percentage:0.2f}%|{bar}| [{elapsed}<{remaining}, {rate_fmt}{postfix}]",
    )
    self.pbar = tqdm(
        total=self.total,
        desc=desc,
        unit=unit,
        bar_format=bar_format,
        **kwargs,
    )

ScorerCallable

Bases: Protocol

Signature for a scorer

Scorer

Scorer(
    scoring: Union[str, ScorerCallable],
    default: float = nan,
    range: Tuple = (-inf, inf),
    name: Optional[str] = None,
)

A scoring callable that takes a model, data, and labels and returns a scalar.

PARAMETER DESCRIPTION
scoring

Either a string or callable that can be passed to get_scorer.

TYPE: Union[str, ScorerCallable]

default

score to be used when a model cannot be fit, e.g. when too little data is passed, or errors arise.

TYPE: float DEFAULT: nan

range

numerical range of the score function. Some Monte Carlo methods can use this to estimate the number of samples required for a certain quality of approximation. If not provided, it can be read from the scoring object if it provides it, for instance if it was constructed with compose_score().

TYPE: Tuple DEFAULT: (-inf, inf)

name

The name of the scorer. If not provided, the name of the function passed will be used.

TYPE: Optional[str] DEFAULT: None

New in version 0.5.0

Source code in src/pydvl/utils/score.py
def __init__(
    self,
    scoring: Union[str, ScorerCallable],
    default: float = np.nan,
    range: Tuple = (-np.inf, np.inf),
    name: Optional[str] = None,
):
    if name is None and isinstance(scoring, str):
        name = scoring
    self._scorer = get_scorer(scoring)
    self.default = default
    # TODO: auto-fill from known scorers ?
    self.range = np.array(range)
    self._name = getattr(self._scorer, "__name__", name or "scorer")

Status

Bases: Enum

Status of a computation.

Statuses can be combined using bitwise or (|) and bitwise and (&) to get the status of a combined computation. For example, if we have two computations, one that has converged and one that has failed, then the combined status is Status.Converged | Status.Failed == Status.Converged, but Status.Converged & Status.Failed == Status.Failed.

OR

The result of bitwise or-ing two valuation statuses with | is given by the following table:

P C F
P P C P
C C C C
F P C F

where P = Pending, C = Converged, F = Failed.

AND

The result of bitwise and-ing two valuation statuses with & is given by the following table:

P C F
P P P F
C P C F
F F F F

where P = Pending, C = Converged, F = Failed.

NOT

The result of bitwise negation of a Status with ~ is Failed if the status is Converged, or Converged otherwise:

~P == C, ~C == F, ~F == C

Boolean casting

A Status evaluates to True iff it's Converged or Failed:

bool(Status.Pending) == False
bool(Status.Converged) == True
bool(Status.Failed) == True

Warning

These truth values are inconsistent with the usual boolean operations. In particular the XOR of two instances of Status is not the same as the XOR of their boolean values.

BaseModel

Bases: Protocol

This is the minimal model protocol with the method fit()

fit

fit(x: NDArray, y: NDArray | None)

Fit the model to the data

PARAMETER DESCRIPTION
x

Independent variables

TYPE: NDArray

y

Dependent variable

TYPE: NDArray | None

Source code in src/pydvl/utils/types.py
def fit(self, x: NDArray, y: NDArray | None):
    """Fit the model to the data

    Args:
        x: Independent variables
        y: Dependent variable
    """
    pass

SupervisedModel

Bases: Protocol

This is the standard sklearn Protocol with the methods fit(), predict() and score().

fit

fit(x: NDArray, y: NDArray | None)

Fit the model to the data

PARAMETER DESCRIPTION
x

Independent variables

TYPE: NDArray

y

Dependent variable

TYPE: NDArray | None

Source code in src/pydvl/utils/types.py
def fit(self, x: NDArray, y: NDArray | None):
    """Fit the model to the data

    Args:
        x: Independent variables
        y: Dependent variable
    """
    pass

predict

predict(x: NDArray) -> NDArray

Compute predictions for the input

PARAMETER DESCRIPTION
x

Independent variables for which to compute predictions

TYPE: NDArray

RETURNS DESCRIPTION
NDArray

Predictions for the input

Source code in src/pydvl/utils/types.py
def predict(self, x: NDArray) -> NDArray:
    """Compute predictions for the input

    Args:
        x: Independent variables for which to compute predictions

    Returns:
        Predictions for the input
    """
    pass

score

score(x: NDArray, y: NDArray | None) -> float

Compute the score of the model given test data

PARAMETER DESCRIPTION
x

Independent variables

TYPE: NDArray

y

Dependent variable

TYPE: NDArray | None

RETURNS DESCRIPTION
float

The score of the model on (x, y)

Source code in src/pydvl/utils/types.py
def score(self, x: NDArray, y: NDArray | None) -> float:
    """Compute the score of the model given test data

    Args:
        x: Independent variables
        y: Dependent variable

    Returns:
        The score of the model on `(x, y)`
    """
    pass

BaggingModel

Bases: Protocol

Any model with the attributes n_estimators and max_samples is considered a bagging model.

fit

fit(x: NDArray, y: NDArray | None)

Fit the model to the data

PARAMETER DESCRIPTION
x

Independent variables

TYPE: NDArray

y

Dependent variable

TYPE: NDArray | None

Source code in src/pydvl/utils/types.py
def fit(self, x: NDArray, y: NDArray | None):
    """Fit the model to the data

    Args:
        x: Independent variables
        y: Dependent variable
    """
    pass

predict

predict(x: NDArray) -> NDArray

Compute predictions for the input

PARAMETER DESCRIPTION
x

Independent variables for which to compute predictions

TYPE: NDArray

RETURNS DESCRIPTION
NDArray

Predictions for the input

Source code in src/pydvl/utils/types.py
def predict(self, x: NDArray) -> NDArray:
    """Compute predictions for the input

    Args:
        x: Independent variables for which to compute predictions

    Returns:
        Predictions for the input
    """
    pass

Utility

Utility(
    model: SupervisedModel,
    data: Dataset,
    scorer: Optional[Union[str, Scorer]] = None,
    *,
    default_score: float = 0.0,
    score_range: Tuple[float, float] = (-inf, inf),
    catch_errors: bool = True,
    show_warnings: bool = False,
    cache_backend: Optional[CacheBackend] = None,
    cached_func_options: Optional[CachedFuncConfig] = None,
    clone_before_fit: bool = True,
)

Convenience wrapper with configurable memoization of the scoring function.

An instance of Utility holds the triple of model, dataset and scoring function which determines the value of data points. This is used for the computation of all game-theoretic values like Shapley values and the Least Core.

The Utility expect the model to fulfill the SupervisedModel interface i.e. to have fit(), predict(), and score() methods.

When calling the utility, the model will be cloned if it is a Scikit-Learn model, otherwise a copy is created using copy.deepcopy

Since evaluating the scoring function requires retraining the model and that can be time-consuming, this class wraps it and caches the results of each execution. Caching is available both locally and across nodes, but must always be enabled for your project first, see the documentation and the module documentation.

ATTRIBUTE DESCRIPTION
model

The supervised model.

TYPE: SupervisedModel

data

An object containing the split data.

TYPE: Dataset

scorer

A scoring function. If None, the score() method of the model will be used. See score for ways to create and compose scorers, in particular how to set default values and ranges.

TYPE: Scorer

PARAMETER DESCRIPTION
model

Any supervised model. Typical choices can be found in the sci-kit learn documentation.

TYPE: SupervisedModel

data

Dataset or GroupedDataset instance.

TYPE: Dataset

scorer

A scoring object. If None, the score() method of the model will be used. See score for ways to create and compose scorers, in particular how to set default values and ranges. For convenience, a string can be passed, which will be used to construct a Scorer.

TYPE: Optional[Union[str, Scorer]] DEFAULT: None

default_score

As a convenience when no scorer object is passed (where a default value can be provided), this argument also allows to set the default score for models that have not been fit, e.g. when too little data is passed, or errors arise.

TYPE: float DEFAULT: 0.0

score_range

As with default_score, this is a convenience argument for when no scorer argument is provided, to set the numerical range of the score function. Some Monte Carlo methods can use this to estimate the number of samples required for a certain quality of approximation.

TYPE: Tuple[float, float] DEFAULT: (-inf, inf)

catch_errors

set to True to catch the errors when fit() fails. This could happen in several steps of the pipeline, e.g. when too little training data is passed, which happens often during Shapley value calculations. When this happens, the default_score is returned as a score and computation continues.

TYPE: bool DEFAULT: True

show_warnings

Set to False to suppress warnings thrown by fit().

TYPE: bool DEFAULT: False

cache_backend

Optional instance of CacheBackend used to wrap the _utility method of the Utility instance. By default, this is set to None and that means that the utility evaluations will not be cached.

TYPE: Optional[CacheBackend] DEFAULT: None

cached_func_options

Optional configuration object for cached utility evaluation.

TYPE: Optional[CachedFuncConfig] DEFAULT: None

clone_before_fit

If True, the model will be cloned before calling fit().

TYPE: bool DEFAULT: True

Example
>>> from pydvl.utils import Utility, DataUtilityLearning, Dataset
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> dataset = Dataset.from_sklearn(load_iris(), random_state=16)
>>> u = Utility(LogisticRegression(random_state=16), dataset)
>>> u(dataset.indices)
0.9

With caching enabled:

>>> from pydvl.utils import Utility, DataUtilityLearning, Dataset
>>> from pydvl.utils.caching.memory import InMemoryCacheBackend
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> dataset = Dataset.from_sklearn(load_iris(), random_state=16)
>>> cache_backend = InMemoryCacheBackend()
>>> u = Utility(LogisticRegression(random_state=16), dataset, cache_backend=cache_backend)
>>> u(dataset.indices)
0.9
Source code in src/pydvl/utils/utility.py
def __init__(
    self,
    model: SupervisedModel,
    data: Dataset,
    scorer: Optional[Union[str, Scorer]] = None,
    *,
    default_score: float = 0.0,
    score_range: Tuple[float, float] = (-np.inf, np.inf),
    catch_errors: bool = True,
    show_warnings: bool = False,
    cache_backend: Optional[CacheBackend] = None,
    cached_func_options: Optional[CachedFuncConfig] = None,
    clone_before_fit: bool = True,
):
    self.model = self._clone_model(model)
    self.data = data
    if isinstance(scorer, str):
        scorer = Scorer(scorer, default=default_score, range=score_range)
    self.scorer = check_scoring(self.model, scorer)
    self.default_score = scorer.default if scorer is not None else default_score
    # TODO: auto-fill from known scorers ?
    self.score_range = scorer.range if scorer is not None else np.array(score_range)
    self.clone_before_fit = clone_before_fit
    self.catch_errors = catch_errors
    self.show_warnings = show_warnings
    self.cache = cache_backend
    if cached_func_options is None:
        cached_func_options = CachedFuncConfig()
    # TODO: Find a better way to do this.
    if cached_func_options.hash_prefix is None:
        # FIX: This does not handle reusing the same across runs.
        cached_func_options.hash_prefix = str(hash((model, data, scorer)))
    self.cached_func_options = cached_func_options
    self._initialize_utility_wrapper()

cache_stats property

cache_stats: Optional[CacheStats]

Cache statistics are gathered when cache is enabled. See CacheStats for all fields returned.

__call__

__call__(indices: Iterable[int]) -> float
PARAMETER DESCRIPTION
indices

a subset of valid indices for the x_train attribute of Dataset.

TYPE: Iterable[int]

Source code in src/pydvl/utils/utility.py
def __call__(self, indices: Iterable[int]) -> float:
    """
    Args:
        indices: a subset of valid indices for the
            `x_train` attribute of [Dataset][pydvl.utils.dataset.Dataset].
    """
    utility: float = self._utility_wrapper(frozenset(indices))
    return utility

DataUtilityLearning

DataUtilityLearning(u: Utility, training_budget: int, model: SupervisedModel)

Implementation of Data Utility Learning (Wang et al., 2022)1.

This object wraps a Utility and delegates calls to it, up until a given budget (number of iterations). Every tuple of input and output (a so-called utility sample) is stored. Once the budget is exhausted, DataUtilityLearning fits the given model to the utility samples. Subsequent calls will use the learned model to predict the utility instead of delegating.

PARAMETER DESCRIPTION
u

The Utility to learn.

TYPE: Utility

training_budget

Number of utility samples to collect before fitting the given model.

TYPE: int

model

A supervised regression model

TYPE: SupervisedModel

Example
>>> from pydvl.utils import Utility, DataUtilityLearning, Dataset
>>> from sklearn.linear_model import LinearRegression, LogisticRegression
>>> from sklearn.datasets import load_iris
>>> dataset = Dataset.from_sklearn(load_iris())
>>> u = Utility(LogisticRegression(), dataset)
>>> wrapped_u = DataUtilityLearning(u, 3, LinearRegression())
... # First 3 calls will be computed normally
>>> for i in range(3):
...     _ = wrapped_u((i,))
>>> wrapped_u((1, 2, 3)) # Subsequent calls will be computed using the fit model for DUL
0.0
Source code in src/pydvl/utils/utility.py
def __init__(
    self, u: Utility, training_budget: int, model: SupervisedModel
) -> None:
    self.utility = u
    self.training_budget = training_budget
    self.model = model
    self._current_iteration = 0
    self._is_model_fit = False
    self._utility_samples: Dict[FrozenSet, Tuple[NDArray[np.bool_], float]] = {}

data property

data: Dataset

Returns the wrapped utility's Dataset.

maybe_add_argument

maybe_add_argument(fun: Callable, new_arg: str) -> Callable

Wraps a function to accept the given keyword parameter if it doesn't already.

If fun already takes a keyword parameter of name new_arg, then it is returned as is. Otherwise, a wrapper is returned which merely ignores the argument.

PARAMETER DESCRIPTION
fun

The function to wrap

TYPE: Callable

new_arg

The name of the argument that the new function will accept (and ignore).

TYPE: str

RETURNS DESCRIPTION
Callable

A new function accepting one more keyword argument.

Changed in version 0.7.0

Ability to work with partials.

Source code in src/pydvl/utils/functional.py
def maybe_add_argument(fun: Callable, new_arg: str) -> Callable:
    """Wraps a function to accept the given keyword parameter if it doesn't
    already.

    If `fun` already takes a keyword parameter of name `new_arg`, then it is
    returned as is. Otherwise, a wrapper is returned which merely ignores the
    argument.

    Args:
        fun: The function to wrap
        new_arg: The name of the argument that the new function will accept
            (and ignore).

    Returns:
        A new function accepting one more keyword argument.

    !!! tip "Changed in version 0.7.0"
        Ability to work with partials.
    """
    if new_arg in free_arguments(fun):
        return fun

    return functools.partial(_accept_additional_argument, fun=fun, arg=new_arg)

suppress_warnings

suppress_warnings(fun: Callable[P, R]) -> Callable[P, R]
suppress_warnings(
    fun: None = None,
    *,
    categories: Sequence[Type[Warning]] = (Warning,),
    flag: str = "",
) -> Callable[[Callable[P, R]], Callable[P, R]]
suppress_warnings(
    fun: Callable[P, R],
    *,
    categories: Sequence[Type[Warning]] = (Warning,),
    flag: str = "",
) -> Callable[P, R]
suppress_warnings(
    fun: Callable[P, R] | None = None,
    *,
    categories: Sequence[Type[Warning]] = (Warning,),
    flag: str = "",
) -> Union[Callable[[Callable[P, R]], Callable[P, R]], Callable[P, R]]

Decorator for class methods to conditionally suppress warnings.

The decorated method will execute with warnings suppressed for the specified categories. If the instance has the attribute named by flag, and it evaluates to True, then suppression will be deactivated.

Suppress all warnings
class A:
    @suppress_warnings
    def method(self, ...):
        ...
Suppress only UserWarning
class A:
    @suppress_warnings(categories=(UserWarning,))
    def method(self, ...):
        ...
Configuring behaviour at runtime
class A:
    def __init__(self, warn_enabled: bool):
        self.warn_enabled = warn_enabled

    @suppress_warnings(flag="warn_enabled")
    def method(self, ...):
        ...
PARAMETER DESCRIPTION
fun

Optional callable to decorate. If provided, the decorator is applied inline.

TYPE: Callable[P, R] | None DEFAULT: None

categories

Sequence of warning categories to suppress.

TYPE: Sequence[Type[Warning]] DEFAULT: (Warning,)

flag

Name of an instance attribute to check for enabling warnings. If the attribute exists and evaluates to True, warnings will not be suppressed.

TYPE: str DEFAULT: ''

RETURNS DESCRIPTION
Union[Callable[[Callable[P, R]], Callable[P, R]], Callable[P, R]]

Either a decorator (if no function is provided) or the decorated callable.

Source code in src/pydvl/utils/functional.py
def suppress_warnings(
    fun: Callable[P, R] | None = None,
    *,
    categories: Sequence[Type[Warning]] = (Warning,),
    flag: str = "",
) -> Union[Callable[[Callable[P, R]], Callable[P, R]], Callable[P, R]]:
    """Decorator for class methods to conditionally suppress warnings.

    The decorated method will execute with warnings suppressed for the specified
    categories. If the instance has the attribute named by `flag`, and it evaluates to
    `True`, then suppression will be deactivated.

    ??? Example "Suppress all warnings"
        ```python
        class A:
            @suppress_warnings
            def method(self, ...):
                ...
        ```
    ??? Example "Suppress only `UserWarning`"
        ```python
        class A:
            @suppress_warnings(categories=(UserWarning,))
            def method(self, ...):
                ...
        ```
    ??? Example "Configuring behaviour at runtime"
        ```python
        class A:
            def __init__(self, warn_enabled: bool):
                self.warn_enabled = warn_enabled

            @suppress_warnings(flag="warn_enabled")
            def method(self, ...):
                ...
        ```

    Args:
        fun: Optional callable to decorate. If provided, the decorator is applied inline.
        categories: Sequence of warning categories to suppress.
        flag: Name of an instance attribute to check for enabling warnings. If the
              attribute exists and evaluates to `True`, warnings will **not** be
              suppressed.

    Returns:
        Either a decorator (if no function is provided) or the decorated callable.
    """

    def decorator(fn: Callable[P, R]) -> Callable[P, R]:
        # Use a simple heuristic: if the first parameter is "self", assume it's a method.
        sig = inspect.signature(fn)
        params = list(sig.parameters)
        if not params or params[0] != "self":
            if flag:
                raise ValueError("Cannot use suppress_warnings flag with non-methods")

            @functools.wraps(fn)
            def wrapper(*args: Any, **kwargs: Any) -> R:
                with warnings.catch_warnings():
                    for category in categories:
                        warnings.simplefilter("ignore", category=category)
                    return fn(*args, **kwargs)

            return cast(Callable[P, R], wrapper)
        else:

            @functools.wraps(fn)
            def wrapper(self, *args: Any, **kwargs: Any) -> R:
                if flag and not hasattr(self, flag):
                    raise AttributeError(
                        f"Instance has no attribute '{flag}' for suppress_warnings"
                    )
                if flag and getattr(self, flag, False):
                    return fn(self, *args, **kwargs)
                with warnings.catch_warnings():
                    for category in categories:
                        warnings.simplefilter("ignore", category=category)
                    return fn(self, *args, **kwargs)

            return cast(Callable[P, R], wrapper)

    if fun is None:
        return decorator
    return decorator(fun)

timed

timed(fun: Callable[P, R]) -> TimedCallable[P, R]
timed(
    fun: None = None, *, accumulate: bool = False, logger: Logger | None = None
) -> Callable[[Callable[P, R]], TimedCallable[P, R]]
timed(
    fun: Callable[P, R],
    *,
    accumulate: bool = False,
    logger: Logger | None = None,
) -> TimedCallable[P, R]
timed(
    fun: Callable[P, R] | None = None,
    *,
    accumulate: bool = False,
    logger: Logger | None = None,
) -> Union[
    Callable[[Callable[P, R]], TimedCallable[P, R]], TimedCallable[P, R]
]

A decorator that measures the execution time of the wrapped function. Optionally logs the time taken.

Decorator usage
@timed
def fun(...):
    ...

@timed(accumulate=True, logger=getLogger(__name__))
def fun(...):
    ...
Inline usage
timed_fun = timed(fun)
fun(...)
print(timed_fun.execution_time)

accum_time = timed(fun, accumulate=True)
fun(...)
fun(...)
print(accum_time.execution_time)
PARAMETER DESCRIPTION
fun

TYPE: Callable[P, R] | None DEFAULT: None

accumulate

If True, the total execution time will be accumulated across all calls.

TYPE: bool DEFAULT: False

logger

If provided, the execution time will be logged at the logger's level.

TYPE: Logger | None DEFAULT: None

RETURNS DESCRIPTION
Union[Callable[[Callable[P, R]], TimedCallable[P, R]], TimedCallable[P, R]]

A decorator that wraps a function, measuring and optionally logging its

Union[Callable[[Callable[P, R]], TimedCallable[P, R]], TimedCallable[P, R]]

execution time. The function will have an attribute execution_time where

Union[Callable[[Callable[P, R]], TimedCallable[P, R]], TimedCallable[P, R]]

either the time of the last execution or the accumulated total is stored.

Source code in src/pydvl/utils/functional.py
def timed(
    fun: Callable[P, R] | None = None,
    *,
    accumulate: bool = False,
    logger: Logger | None = None,
) -> Union[Callable[[Callable[P, R]], TimedCallable[P, R]], TimedCallable[P, R]]:
    """A decorator that measures the execution time of the wrapped function.
    Optionally logs the time taken.

    ??? Example "Decorator usage"
        ```python
        @timed
        def fun(...):
            ...

        @timed(accumulate=True, logger=getLogger(__name__))
        def fun(...):
            ...
        ```

    ??? Example "Inline usage"
        ```python
        timed_fun = timed(fun)
        fun(...)
        print(timed_fun.execution_time)

        accum_time = timed(fun, accumulate=True)
        fun(...)
        fun(...)
        print(accum_time.execution_time)
        ```

    Args:
        fun:
        accumulate: If `True`, the total execution time will be accumulated across all
            calls.
        logger: If provided, the execution time will be logged at the logger's level.

    Returns:
        A decorator that wraps a function, measuring and optionally logging its
        execution time. The function will have an attribute `execution_time` where
        either the time of the last execution or the accumulated total is stored.
    """

    if fun is None:

        def decorator(func: Callable[P, R]) -> TimedCallable[P, R]:
            return timed(func, accumulate=accumulate, logger=logger)

        return decorator

    assert fun is not None

    @functools.wraps(fun)
    def wrapper(*args, **kwargs) -> R:
        start = time.perf_counter()
        try:
            assert fun is not None
            result = fun(*args, **kwargs)
        finally:
            elapsed = time.perf_counter() - start
            if accumulate:
                cast(TimedCallable, wrapper).execution_time += elapsed
            else:
                cast(TimedCallable, wrapper).execution_time = elapsed
            if logger is not None:
                assert fun is not None
                logger.log(
                    logger.level,
                    f"{fun.__module__}.{fun.__qualname__} took {elapsed:.5f} seconds",
                )
        return result

    cast(TimedCallable, wrapper).execution_time = 0.0

    return cast(TimedCallable[P, R], wrapper)

complement

complement(
    include: NDArray[T], exclude: NDArray[T] | Sequence[T | None]
) -> NDArray[T]

Returns the complement of the set of indices excluding the given indices.

PARAMETER DESCRIPTION
include

The set of indices to consider.

TYPE: NDArray[T]

exclude

The indices to exclude from the complement. These must be a subset of include. If an index is None it is ignored.

TYPE: NDArray[T] | Sequence[T | None]

RETURNS DESCRIPTION
NDArray[T]

The complement of the set of indices excluding the given indices.

Source code in src/pydvl/utils/numeric.py
def complement(
    include: NDArray[T], exclude: NDArray[T] | Sequence[T | None]
) -> NDArray[T]:
    """Returns the complement of the set of indices excluding the given
    indices.

    Args:
        include: The set of indices to consider.
        exclude: The indices to exclude from the complement. These must be a subset
            of `include`. If an index is `None` it is ignored.

    Returns:
        The complement of the set of indices excluding the given indices.
    """
    _exclude = np.array([i for i in exclude if i is not None], dtype=include.dtype)
    return np.setdiff1d(include, _exclude).astype(np.int_)

powerset

powerset(s: NDArray[T]) -> Iterator[Collection[T]]

Returns an iterator for the power set of the argument.

Subsets are generated in sequence by growing size. See random_powerset() for random sampling.

Example
>>> import numpy as np
>>> from pydvl.utils.numeric import powerset
>>> list(powerset(np.array((1,2))))
[(), (1,), (2,), (1, 2)]
PARAMETER DESCRIPTION
s

The set to use

TYPE: NDArray[T]

RETURNS DESCRIPTION
Iterator[Collection[T]]

An iterator over all subsets of the set of indices s.

Source code in src/pydvl/utils/numeric.py
def powerset(s: NDArray[T]) -> Iterator[Collection[T]]:
    """Returns an iterator for the power set of the argument.

     Subsets are generated in sequence by growing size. See
     [random_powerset()][pydvl.utils.numeric.random_powerset] for random
     sampling.

    ??? Example
        ``` pycon
        >>> import numpy as np
        >>> from pydvl.utils.numeric import powerset
        >>> list(powerset(np.array((1,2))))
        [(), (1,), (2,), (1, 2)]
        ```

    Args:
         s: The set to use

    Returns:
        An iterator over all subsets of the set of indices `s`.
    """
    return chain.from_iterable(combinations(s, r) for r in range(len(s) + 1))

num_samples_permutation_hoeffding

num_samples_permutation_hoeffding(
    eps: float, delta: float, u_range: float
) -> int

Lower bound on the number of samples required for MonteCarlo Shapley to obtain an (ε,δ)-approximation.

That is: with probability 1-δ, the estimated value for one data point will be ε-close to the true quantity, if at least this many permutations are sampled.

PARAMETER DESCRIPTION
eps

ε > 0

TYPE: float

delta

0 < δ <= 1

TYPE: float

u_range

Range of the Utility function

TYPE: float

RETURNS DESCRIPTION
int

Number of permutations required to guarantee ε-correct Shapley values with probability 1-δ

Source code in src/pydvl/utils/numeric.py
def num_samples_permutation_hoeffding(eps: float, delta: float, u_range: float) -> int:
    """Lower bound on the number of samples required for MonteCarlo Shapley to
    obtain an (ε,δ)-approximation.

    That is: with probability 1-δ, the estimated value for one data point will
    be ε-close to the true quantity, if at least this many permutations are
    sampled.

    Args:
        eps: ε > 0
        delta: 0 < δ <= 1
        u_range: Range of the [Utility][pydvl.utils.utility.Utility] function

    Returns:
        Number of _permutations_ required to guarantee ε-correct Shapley
            values with probability 1-δ
    """
    return int(np.ceil(np.log(2 / delta) * 2 * u_range**2 / eps**2))

random_subset

random_subset(
    s: NDArray[T], q: float = 0.5, seed: Optional[Seed] = None
) -> NDArray[T]

Returns one subset at random from s.

PARAMETER DESCRIPTION
s

set to sample from

TYPE: NDArray[T]

q

Sampling probability for elements. The default 0.5 yields a uniform distribution over the power set of s.

TYPE: float DEFAULT: 0.5

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

RETURNS DESCRIPTION
NDArray[T]

The subset

Source code in src/pydvl/utils/numeric.py
def random_subset(
    s: NDArray[T], q: float = 0.5, seed: Optional[Seed] = None
) -> NDArray[T]:
    """Returns one subset at random from ``s``.

    Args:
        s: set to sample from
        q: Sampling probability for elements. The default 0.5 yields a
            uniform distribution over the power set of s.
        seed: Either an instance of a numpy random number generator or a seed
            for it.

    Returns:
        The subset
    """
    rng = np.random.default_rng(seed)
    selection = rng.uniform(size=len(s)) < q
    return s[selection]

random_powerset

random_powerset(
    s: NDArray[T],
    n_samples: Optional[int] = None,
    q: float = 0.5,
    seed: Optional[Seed] = None,
) -> Generator[NDArray[T], None, None]

Samples subsets from the power set of the argument, without pre-generating all subsets and in no order.

See powerset if you wish to deterministically generate all subsets.

To generate subsets, len(s) Bernoulli draws with probability q are drawn. The default value of q = 0.5 provides a uniform distribution over the power set of s. Other choices can be used e.g. to implement owen_sampling_shapley.

PARAMETER DESCRIPTION
s

set to sample from

TYPE: NDArray[T]

n_samples

if set, stop the generator after this many steps. Defaults to np.iinfo(np.int32).max

TYPE: Optional[int] DEFAULT: None

q

Sampling probability for elements. The default 0.5 yields a uniform distribution over the power set of s.

TYPE: float DEFAULT: 0.5

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

RETURNS DESCRIPTION
None

Samples from the power set of s.

RAISES DESCRIPTION
ValueError

if the element sampling probability is not in [0,1]

Source code in src/pydvl/utils/numeric.py
def random_powerset(
    s: NDArray[T],
    n_samples: Optional[int] = None,
    q: float = 0.5,
    seed: Optional[Seed] = None,
) -> Generator[NDArray[T], None, None]:
    """Samples subsets from the power set of the argument, without
    pre-generating all subsets and in no order.

    See [powerset][pydvl.utils.numeric.powerset] if you wish to deterministically generate all subsets.

    To generate subsets, `len(s)` Bernoulli draws with probability `q` are
    drawn. The default value of `q = 0.5` provides a uniform distribution over
    the power set of `s`. Other choices can be used e.g. to implement
    [owen_sampling_shapley][pydvl.value.shapley.owen.owen_sampling_shapley].

    Args:
        s: set to sample from
        n_samples: if set, stop the generator after this many steps.
            Defaults to `np.iinfo(np.int32).max`
        q: Sampling probability for elements. The default 0.5 yields a
            uniform distribution over the power set of s.
        seed: Either an instance of a numpy random number generator or a seed for it.

    Returns:
        Samples from the power set of `s`.

    Raises:
        ValueError: if the element sampling probability is not in [0,1]

    """
    if q < 0 or q > 1:
        raise ValueError("Element sampling probability must be in [0,1]")

    rng = np.random.default_rng(seed)
    total = 1
    if n_samples is None:
        n_samples = np.iinfo(np.int32).max
    while total <= n_samples:
        yield random_subset(s, q, seed=rng)
        total += 1

random_powerset_label_min

random_powerset_label_min(
    s: NDArray[T],
    labels: NDArray[int_],
    min_elements_per_label: int = 1,
    seed: Optional[Seed] = None,
) -> Generator[NDArray[T], None, None]

Draws random subsets from s, while ensuring that at least min_elements_per_label elements per label are included in the draw. It can be used for classification problems to ensure that a set contains information for all labels (or not if min_elements_per_label=0).

PARAMETER DESCRIPTION
s

Set to sample from

TYPE: NDArray[T]

labels

Labels for the samples

TYPE: NDArray[int_]

min_elements_per_label

Minimum number of elements for each label.

TYPE: int DEFAULT: 1

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

RETURNS DESCRIPTION
None

Generated draw from the powerset of s with min_elements_per_label for each

None

label.

RAISES DESCRIPTION
ValueError

If s and labels are of different length or min_elements_per_label is smaller than 0.

Source code in src/pydvl/utils/numeric.py
def random_powerset_label_min(
    s: NDArray[T],
    labels: NDArray[np.int_],
    min_elements_per_label: int = 1,
    seed: Optional[Seed] = None,
) -> Generator[NDArray[T], None, None]:
    """Draws random subsets from `s`, while ensuring that at least
    `min_elements_per_label` elements per label are included in the draw. It can be used
    for classification problems to ensure that a set contains information for all labels
    (or not if `min_elements_per_label=0`).

    Args:
        s: Set to sample from
        labels: Labels for the samples
        min_elements_per_label: Minimum number of elements for each label.
        seed: Either an instance of a numpy random number generator or a seed for it.

    Returns:
        Generated draw from the powerset of s with `min_elements_per_label` for each
        label.

    Raises:
        ValueError: If `s` and `labels` are of different length or
            `min_elements_per_label` is smaller than 0.
    """
    if len(labels) != len(s):
        raise ValueError("Set and labels have to be of same size.")

    if min_elements_per_label < 0:
        raise ValueError(
            f"Parameter min_elements={min_elements_per_label} needs to be bigger or "
            f"equal to 0."
        )

    rng = np.random.default_rng(seed)
    unique_labels = np.unique(labels)

    while True:
        subsets: list[NDArray[T]] = []
        for label in unique_labels:
            label_indices = np.asarray(np.where(labels == label)[0])
            subset_size = int(
                rng.integers(
                    min(min_elements_per_label, len(label_indices)),
                    len(label_indices) + 1,
                )
            )
            if subset_size > 0:
                subsets.append(
                    random_subset_of_size(s[label_indices], subset_size, seed=rng)
                )

        if len(subsets) > 0:
            subset = np.concatenate(tuple(subsets))
            rng.shuffle(subset)
            yield subset
        else:
            yield np.array([], dtype=s.dtype)

random_subset_of_size

random_subset_of_size(
    s: NDArray[T], size: int, seed: Optional[Seed] = None
) -> NDArray[T]

Samples a random subset of given size uniformly from the powerset of s.

PARAMETER DESCRIPTION
s

Set to sample from

TYPE: NDArray[T]

size

Size of the subset to generate

TYPE: int

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

RETURNS DESCRIPTION
NDArray[T]

The subset

Raises ValueError: If size > len(s)

Source code in src/pydvl/utils/numeric.py
def random_subset_of_size(
    s: NDArray[T], size: int, seed: Optional[Seed] = None
) -> NDArray[T]:
    """Samples a random subset of given size uniformly from the powerset
    of `s`.

    Args:
        s: Set to sample from
        size: Size of the subset to generate
        seed: Either an instance of a numpy random number generator or a seed for it.

    Returns:
        The subset

    Raises
        ValueError: If size > len(s)
    """
    if size > len(s):
        raise ValueError("Cannot sample subset larger than set")
    rng = np.random.default_rng(seed)
    return rng.choice(s, size=size, replace=False)

random_matrix_with_condition_number

random_matrix_with_condition_number(
    n: int, condition_number: float, seed: Optional[Seed] = None
) -> NDArray

Constructs a square matrix with a given condition number.

Taken from: https://gist.github.com/bstellato/23322fe5d87bb71da922fbc41d658079#file-random_mat_condition_number-py

Also see: https://math.stackexchange.com/questions/1351616/condition-number-of-ata.

PARAMETER DESCRIPTION
n

size of the matrix

TYPE: int

condition_number

duh

TYPE: float

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

RETURNS DESCRIPTION
NDArray

An (n,n) matrix with the requested condition number.

Source code in src/pydvl/utils/numeric.py
def random_matrix_with_condition_number(
    n: int, condition_number: float, seed: Optional[Seed] = None
) -> NDArray:
    """Constructs a square matrix with a given condition number.

    Taken from:
    [https://gist.github.com/bstellato/23322fe5d87bb71da922fbc41d658079#file-random_mat_condition_number-py](
    https://gist.github.com/bstellato/23322fe5d87bb71da922fbc41d658079#file-random_mat_condition_number-py)

    Also see:
    [https://math.stackexchange.com/questions/1351616/condition-number-of-ata](
    https://math.stackexchange.com/questions/1351616/condition-number-of-ata).

    Args:
        n: size of the matrix
        condition_number: duh
        seed: Either an instance of a numpy random number generator or a seed for it.

    Returns:
        An (n,n) matrix with the requested condition number.
    """
    if n < 2:
        raise ValueError("Matrix size must be at least 2")

    if condition_number <= 1:
        raise ValueError("Condition number must be greater than 1")

    rng = np.random.default_rng(seed)
    log_condition_number = np.log(condition_number)
    exp_vec = np.arange(
        -log_condition_number / 4.0,
        log_condition_number * (n + 1) / (4 * (n - 1)),
        log_condition_number / (2.0 * (n - 1)),
    )
    exp_vec = exp_vec[:n]
    s: np.ndarray = np.exp(exp_vec)
    S = np.diag(s)
    U, _ = np.linalg.qr((rng.uniform(size=(n, n)) - 5.0) * 200)
    V, _ = np.linalg.qr((rng.uniform(size=(n, n)) - 5.0) * 200)
    P: np.ndarray = U.dot(S).dot(V.T)
    P = P.dot(P.T)
    return P

running_moments

running_moments(
    previous_avg: float,
    previous_variance: float,
    count: int,
    new_value: float,
    unbiased: bool = True,
) -> tuple[float, float]

Calculates running average and variance of a series of numbers.

See Welford's algorithm in wikipedia

Warning

This is not really using Welford's correction for numerical stability for the variance. (FIXME)

Todo

This could be generalised to arbitrary moments. See this paper

PARAMETER DESCRIPTION
previous_avg

average value at previous step.

TYPE: float

previous_variance

variance at previous step.

TYPE: float

count

number of points seen so far,

TYPE: int

new_value

new value in the series of numbers.

TYPE: float

unbiased

whether to use the unbiased variance estimator (same as np.var with ddof=1).

TYPE: bool DEFAULT: True

Returns: new_average, new_variance, calculated with the new count

Source code in src/pydvl/utils/numeric.py
def running_moments(
    previous_avg: float,
    previous_variance: float,
    count: int,
    new_value: float,
    unbiased: bool = True,
) -> tuple[float, float]:
    """Calculates running average and variance of a series of numbers.

    See [Welford's algorithm in
    wikipedia](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm)

    !!! Warning
        This is not really using Welford's correction for numerical stability
        for the variance. (FIXME)

    !!! Todo
        This could be generalised to arbitrary moments. See [this
        paper](https://www.osti.gov/biblio/1028931)

    Args:
        previous_avg: average value at previous step.
        previous_variance: variance at previous step.
        count: number of points seen so far,
        new_value: new value in the series of numbers.
        unbiased: whether to use the unbiased variance estimator (same as `np.var` with
            `ddof=1`).
    Returns:
        new_average, new_variance, calculated with the new count
    """
    delta = new_value - previous_avg
    new_average = previous_avg + delta / (count + 1)

    if unbiased:
        if count > 0:
            new_variance = (
                previous_variance + delta**2 / (count + 1) - previous_variance / count
            )
        else:
            new_variance = 0.0
    else:
        new_variance = previous_variance + (
            delta * (new_value - new_average) - previous_variance
        ) / (count + 1)

    return new_average, new_variance

top_k_value_accuracy

top_k_value_accuracy(
    y_true: NDArray[float64], y_pred: NDArray[float64], k: int = 3
) -> float

Computes the top-k accuracy for the estimated values by comparing indices of the highest k values.

PARAMETER DESCRIPTION
y_true

Exact/true value

TYPE: NDArray[float64]

y_pred

Predicted/estimated value

TYPE: NDArray[float64]

k

Number of the highest values taken into account

TYPE: int DEFAULT: 3

RETURNS DESCRIPTION
float

Accuracy

Source code in src/pydvl/utils/numeric.py
def top_k_value_accuracy(
    y_true: NDArray[np.float64], y_pred: NDArray[np.float64], k: int = 3
) -> float:
    """Computes the top-k accuracy for the estimated values by comparing indices
    of the highest k values.

    Args:
        y_true: Exact/true value
        y_pred: Predicted/estimated value
        k: Number of the highest values taken into account

    Returns:
        Accuracy
    """
    top_k_exact_values = np.argsort(y_true)[-k:]
    top_k_pred_values = np.argsort(y_pred)[-k:]
    top_k_accuracy = len(np.intersect1d(top_k_exact_values, top_k_pred_values)) / k
    return top_k_accuracy

logcomb

logcomb(n: int, k: int) -> float

Computes the log of the binomial coefficient (n choose k).

\[ \begin{array}{rcl} \log\binom{n}{k} & = & \log(n!) - \log(k!) - \log((n-k)!) \\ & = & \log\Gamma(n+1) - \log\Gamma(k+1) - \log\Gamma(n-k+1). \end{array} \]
PARAMETER DESCRIPTION
n

Total number of elements

TYPE: int

k

Number of elements to choose

TYPE: int

Returns: The log of the binomial coefficient

Source code in src/pydvl/utils/numeric.py
def logcomb(n: int, k: int) -> float:
    r"""Computes the log of the binomial coefficient (n choose k).

    $$
    \begin{array}{rcl}
        \log\binom{n}{k} & = & \log(n!) - \log(k!) - \log((n-k)!) \\
                         & = & \log\Gamma(n+1) - \log\Gamma(k+1) - \log\Gamma(n-k+1).
    \end{array}
    $$

    Args:
        n: Total number of elements
        k: Number of elements to choose
    Returns:
        The log of the binomial coefficient
        """
    if k < 0 or k > n or n < 0:
        raise ValueError(f"Invalid arguments: n={n}, k={k}")
    return float(gammaln(n + 1) - gammaln(k + 1) - gammaln(n - k + 1))

logexp

logexp(x: float, a: float) -> float

Computes log(x^a).

PARAMETER DESCRIPTION
x

Base

TYPE: float

a

Exponent

TYPE: float

Returns a * log(x)

Source code in src/pydvl/utils/numeric.py
def logexp(x: float, a: float) -> float:
    """Computes log(x^a).

    Args:
        x: Base
        a: Exponent
    Returns
        a * log(x)
    """
    return float(a * np.log(x))

logsumexp_two

logsumexp_two(log_a: float, log_b: float) -> float

Numerically stable computation of log(exp(log_a) + exp(log_b)).

Uses standard log sum exp trick:

\[ \log(\exp(\log a) + \exp(\log b)) = m + \log(\exp(\log a - m) + \exp(\log b - m)), \]

where \(m = \max(\log a, \log b)\).

PARAMETER DESCRIPTION
log_a

Log of the first value

TYPE: float

log_b

Log of the second value

TYPE: float

Returns: The log of the sum of the exponentials

Source code in src/pydvl/utils/numeric.py
def logsumexp_two(log_a: float, log_b: float) -> float:
    r"""Numerically stable computation of log(exp(log_a) + exp(log_b)).

    Uses standard log sum exp trick:

    $$
    \log(\exp(\log a) + \exp(\log b)) = m + \log(\exp(\log a - m) + \exp(\log b - m)),
    $$

    where $m = \max(\log a, \log b)$.

    Args:
        log_a: Log of the first value
        log_b: Log of the second value
    Returns:
        The log of the sum of the exponentials
    """
    if log_a == -np.inf:
        return log_b
    if log_b == -np.inf:
        return log_a
    m = max(log_a, log_b)
    return float(m + np.log(np.exp(log_a - m) + np.exp(log_b - m)))

log_running_moments

log_running_moments(
    previous_log_sum_pos: float,
    previous_log_sum_neg: float,
    previous_log_sum2: float,
    count: int,
    new_log_value: float,
    new_sign: int,
    unbiased: bool = True,
) -> tuple[float, float, float, float, float]

Update running moments when the new value is provided in log space, allowing for negative values via an explicit sign.

Here the actual value is x = new_sign * exp(new_log_value). Rather than updating the arithmetic sum S = sum(x) and S2 = sum(x^2) directly, we maintain:

L_S+ = log(sum_{i: x_i >= 0} x_i) L_S- = log(sum_{i: x_i < 0} |x_i|) L_S2 = log(sum_i x_i^2)

The running mean is then computed as:

 mean = exp(L_S+) - exp(L_S-)

and the second moment is:

 second_moment = exp(L_S2 - log(count))

so that the variance is:

 variance = second_moment - mean^2

For the unbiased (sample) estimator, we scale the variance by count/(count-1) when count > 1 (and define variance = 0 when count == 1).

PARAMETER DESCRIPTION
previous_log_sum_pos

running log(sum of positive contributions), or -inf if none.

TYPE: float

previous_log_sum_neg

running log(sum of negative contributions in absolute value), or -inf if none.

TYPE: float

previous_log_sum2

running log(sum of squares) so far (or -inf if none).

TYPE: float

count

number of points processed so far.

TYPE: int

new_log_value

log(|x_new|), where x_new is the new value.

TYPE: float

new_sign

sign of the new value (should be +1, 0, or -1).

TYPE: int

unbiased

if True, compute the unbiased estimator of the variance.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
new_mean

running mean in the linear domain.

TYPE: float

new_variance

running variance in the linear domain.

TYPE: float

new_log_sum_pos

updated running log(sum of positive contributions).

TYPE: float

new_log_sum_neg

updated running log(sum of negative contributions).

TYPE: float

new_log_sum2

updated running log(sum of squares).

TYPE: float

new_count

updated count.

TYPE: tuple[float, float, float, float, float]

Source code in src/pydvl/utils/numeric.py
def log_running_moments(
    previous_log_sum_pos: float,
    previous_log_sum_neg: float,
    previous_log_sum2: float,
    count: int,
    new_log_value: float,
    new_sign: int,
    unbiased: bool = True,
) -> tuple[float, float, float, float, float]:
    """
    Update running moments when the new value is provided in log space,
    allowing for negative values via an explicit sign.

    Here the actual value is x = new_sign * exp(new_log_value). Rather than
    updating the arithmetic sum S = sum(x) and S2 = sum(x^2) directly, we maintain:

       L_S+ = log(sum_{i: x_i >= 0} x_i)
       L_S- = log(sum_{i: x_i < 0} |x_i|)
       L_S2 = log(sum_i x_i^2)

    The running mean is then computed as:

         mean = exp(L_S+) - exp(L_S-)

    and the second moment is:

         second_moment = exp(L_S2 - log(count))

    so that the variance is:

         variance = second_moment - mean^2

    For the unbiased (sample) estimator, we scale the variance by count/(count-1)
    when count > 1 (and define variance = 0 when count == 1).

    Args:
        previous_log_sum_pos: running log(sum of positive contributions), or -inf if none.
        previous_log_sum_neg: running log(sum of negative contributions in absolute
            value), or -inf if none.
        previous_log_sum2: running log(sum of squares) so far (or -inf if none).
        count: number of points processed so far.
        new_log_value: log(|x_new|), where x_new is the new value.
        new_sign: sign of the new value (should be +1, 0, or -1).
        unbiased: if True, compute the unbiased estimator of the variance.

    Returns:
        new_mean: running mean in the linear domain.
        new_variance: running variance in the linear domain.
        new_log_sum_pos: updated running log(sum of positive contributions).
        new_log_sum_neg: updated running log(sum of negative contributions).
        new_log_sum2: updated running log(sum of squares).
        new_count: updated count.
    """

    if count == 0:
        if new_sign >= 0:
            new_log_sum_pos = new_log_value
            new_log_sum_neg = -np.inf  # No negative contribution yet.
        else:
            new_log_sum_pos = -np.inf
            new_log_sum_neg = new_log_value
        new_log_sum2 = 2 * new_log_value
    else:
        if new_sign >= 0:
            new_log_sum_pos = logsumexp_two(previous_log_sum_pos, new_log_value)
            new_log_sum_neg = previous_log_sum_neg
        else:
            new_log_sum_neg = logsumexp_two(previous_log_sum_neg, new_log_value)
            new_log_sum_pos = previous_log_sum_pos
        new_log_sum2 = logsumexp_two(previous_log_sum2, 2 * new_log_value)
    new_count = count + 1

    # Compute 1st and 2nd moments in the linear domain.
    pos_sum = np.exp(new_log_sum_pos) if new_log_sum_pos != -np.inf else 0.0
    neg_sum = np.exp(new_log_sum_neg) if new_log_sum_neg != -np.inf else 0.0
    new_mean = (pos_sum - neg_sum) / new_count

    second_moment = np.exp(new_log_sum2 - np.log(new_count))

    # Compute variance using either the population or unbiased estimator.
    if unbiased:
        if new_count > 1:
            new_variance = new_count / (new_count - 1) * (second_moment - new_mean**2)
        else:
            new_variance = 0.0
    else:
        new_variance = second_moment - new_mean**2

    return new_mean, new_variance, new_log_sum_pos, new_log_sum_neg, new_log_sum2

repeat_indices

repeat_indices(
    indices: Collection[int],
    result: ValuationResult,
    done: StoppingCriterion,
    **kwargs: Any,
) -> Iterator[int]

Helper function to cycle indefinitely over a collection of indices until the stopping criterion is satisfied while displaying progress.

PARAMETER DESCRIPTION
indices

Collection of indices that will be cycled until done.

TYPE: Collection[int]

result

Object containing the current results.

TYPE: ValuationResult

done

Stopping criterion.

TYPE: StoppingCriterion

kwargs

Keyword arguments passed to tqdm.

TYPE: Any DEFAULT: {}

Source code in src/pydvl/utils/progress.py
@deprecated(
    target=True,
    deprecated_in="0.10.0",
    remove_in="0.12.0",
    template_mgs="%(source_name)s used only by the old value module. "
    "It will be removed in %(remove_in)s.",
)
def repeat_indices(
    indices: Collection[int],
    result: ValuationResult,
    done: StoppingCriterion,
    **kwargs: Any,
) -> Iterator[int]:
    """Helper function to cycle indefinitely over a collection of indices
    until the stopping criterion is satisfied while displaying progress.

    Args:
        indices: Collection of indices that will be cycled until done.
        result: Object containing the current results.
        done: Stopping criterion.
        kwargs: Keyword arguments passed to tqdm.
    """
    with tqdm(total=100, unit="%", **kwargs) as pbar:
        it = takewhile(lambda _: not done(result), cycle(indices))
        for i in it:
            yield i
            pbar.update(100 * done.completion() - pbar.n)
            pbar.refresh()

log_duration

log_duration(_func=None, *, log_level=DEBUG)

Decorator to log execution time of a function with a configurable logging level. It can be used with or without specifying a log level.

Source code in src/pydvl/utils/progress.py
def log_duration(_func=None, *, log_level=logging.DEBUG):
    """
    Decorator to log execution time of a function with a configurable logging level.
    It can be used with or without specifying a log level.
    """

    def decorator_log_duration(func):
        @wraps(func)
        def wrapper_log_duration(*args, **kwargs):
            func_name = func.__qualname__
            logger.log(log_level, f"Function '{func_name}' is starting.")
            start_time = time()
            result = func(*args, **kwargs)
            duration = time() - start_time
            logger.log(
                log_level,
                f"Function '{func_name}' completed. Duration: {duration:.2f} sec",
            )
            return result

        return wrapper_log_duration

    if _func is None:
        # If log_duration was called without arguments, return decorator
        return decorator_log_duration
    else:
        # If log_duration was called with a function, apply decorator directly
        return decorator_log_duration(_func)

compose_score

compose_score(
    scorer: Scorer,
    transformation: Callable[[float], float],
    range: Tuple[float, float],
    name: str,
) -> Scorer

Composes a scoring function with an arbitrary scalar transformation.

Useful to squash unbounded scores into ranges manageable by data valuation methods.

Example:

sigmoid = lambda x: 1/(1+np.exp(-x))
compose_score(Scorer("r2"), sigmoid, range=(0,1), name="squashed r2")
PARAMETER DESCRIPTION
scorer

The object to be composed.

TYPE: Scorer

transformation

A scalar transformation

TYPE: Callable[[float], float]

range

The range of the transformation. This will be used e.g. by Utility for the range of the composed.

TYPE: Tuple[float, float]

name

A string representation for the composition, for str().

TYPE: str

RETURNS DESCRIPTION
Scorer

The composite Scorer.

Source code in src/pydvl/utils/score.py
def compose_score(
    scorer: Scorer,
    transformation: Callable[[float], float],
    range: Tuple[float, float],
    name: str,
) -> Scorer:
    """Composes a scoring function with an arbitrary scalar transformation.

    Useful to squash unbounded scores into ranges manageable by data valuation
    methods.

    Example:

    ```python
    sigmoid = lambda x: 1/(1+np.exp(-x))
    compose_score(Scorer("r2"), sigmoid, range=(0,1), name="squashed r2")
    ```

    Args:
        scorer: The object to be composed.
        transformation: A scalar transformation
        range: The range of the transformation. This will be used e.g. by
            [Utility][pydvl.utils.utility.Utility] for the range of the composed.
        name: A string representation for the composition, for `str()`.

    Returns:
        The composite [Scorer][pydvl.utils.score.Scorer].
    """

    class CompositeScorer(Scorer):
        def __call__(self, model: SupervisedModel, X: NDArray, y: NDArray) -> float:
            score = self._scorer(model=model, X=X, y=y)
            return transformation(score)

    return CompositeScorer(scorer, range=range, name=name)

ensure_seed_sequence

ensure_seed_sequence(
    seed: Optional[Union[Seed, SeedSequence]] = None,
) -> SeedSequence

If the passed seed is a SeedSequence object then it is returned as is. If it is a Generator the internal protected seed sequence from the generator gets extracted. Otherwise, a new SeedSequence object is created from the passed (optional) seed.

PARAMETER DESCRIPTION
seed

Either an int, a Generator object a SeedSequence object or None.

TYPE: Optional[Union[Seed, SeedSequence]] DEFAULT: None

RETURNS DESCRIPTION
SeedSequence

A SeedSequence object.

New in version 0.7.0

Source code in src/pydvl/utils/types.py
def ensure_seed_sequence(
    seed: Optional[Union[Seed, SeedSequence]] = None,
) -> SeedSequence:
    """
    If the passed seed is a SeedSequence object then it is returned as is. If it is
    a Generator the internal protected seed sequence from the generator gets extracted.
    Otherwise, a new SeedSequence object is created from the passed (optional) seed.

    Args:
        seed: Either an int, a Generator object a SeedSequence object or None.

    Returns:
        A SeedSequence object.

    !!! tip "New in version 0.7.0"
    """
    if isinstance(seed, SeedSequence):
        return seed
    elif isinstance(seed, Generator):
        return cast(SeedSequence, seed.bit_generator.seed_seq)  # type: ignore
    else:
        return SeedSequence(seed)