pydvl.utils.caching
¶
This module provides caching of functions.
PyDVL can cache (memoize) the computation of the utility function and speed up some computations for data valuation.
Warning
Function evaluations are cached with a key based on the function's signature and code. This can lead to undesired cache hits, see Cache reuse.
Remember not to reuse utility objects for different datasets.
Configuration¶
Caching is disabled by default but can be enabled easily, see Setting up the cache. When enabled, it will be added to any callable used to construct a Utility (done with the wrap method of CacheBackend). Depending on the nature of the utility you might want to enable the computation of a running average of function values, see Usage with stochastic functions. You can see all configuration options under CachedFuncConfig.
Supported Backends¶
pyDVL supports 3 different caching backends:
- InMemoryCacheBackend: an in-memory cache backend that uses a dictionary to store and retrieve cached values. This is used to share cached values between threads in a single process.
- DiskCacheBackend: a disk-based cache backend that uses pickled values written to and read from disk. This is used to share cached values between processes in a single machine.
-
MemcachedCacheBackend: a Memcached-based cache backend that uses pickled values written to and read from a Memcached server. This is used to share cached values between processes across multiple machines.
Info
This specific backend requires optional dependencies not installed by default. See Extra dependencies for more information.
Usage with stochastic functions¶
In addition to standard memoization, the wrapped functions can compute running average and standard error of repeated evaluations for the same input. This can be useful for stochastic functions with high variance (e.g. model training for small sample sizes), but drastically reduces the speed benefits of memoization.
This behaviour can be activated with the option allow_repeated_evaluations.
Cache reuse¶
When working directly with CachedFunc, it is essential to only cache pure functions. If they have any kind of state, either internal or external (e.g. a closure over some data that may change), then the cache will fail to notice this and the same value will be returned.
When a function is wrapped with CachedFunc for memoization, its signature (input and output names) and code are used as a key for the cache.
If you are running experiments with the same Utility
but different datasets, this will lead to evaluations of the utility on new data
returning old values because utilities only use sample indices as arguments (so
there is no way to tell the difference between '1' for dataset A and '1' for
dataset 2 from the point of view of the cache). One solution is to empty the
cache between runs by calling the clear
method of the cache backend instance,
but the preferred one is to use a different Utility object for each dataset.
Unexpected cache misses¶
Because all arguments to a function are used as part of the key for the cache,
sometimes one must exclude some of them. For example, If a function is going to
run across multiple processes and some reporting arguments are added (like a
job_id
for logging purposes), these will be part of the signature and make the
functions distinct to the eyes of the cache. This can be avoided with the use of
ignore_args option in the configuration.
CacheStats
dataclass
¶
CacheStats(
sets: int = 0,
misses: int = 0,
hits: int = 0,
timeouts: int = 0,
errors: int = 0,
reconnects: int = 0,
)
Class used to store statistics gathered by cached functions.
ATTRIBUTE | DESCRIPTION |
---|---|
sets |
Number of times a value was set in the cache.
TYPE:
|
misses |
Number of times a value was not found in the cache.
TYPE:
|
hits |
Number of times a value was found in the cache.
TYPE:
|
timeouts |
Number of times a timeout occurred.
TYPE:
|
errors |
Number of times an error occurred.
TYPE:
|
reconnects |
Number of times the client reconnected to the server.
TYPE:
|
CacheBackend
¶
Bases: ABC
Abstract base class for cache backends.
Defines interface for cache access including wrapping callables, getting/setting results, clearing cache, and combining cache keys.
ATTRIBUTE | DESCRIPTION |
---|---|
stats |
Cache statistics tracker.
|
Source code in src/pydvl/utils/caching/base.py
wrap
¶
wrap(
func: Callable, *, config: Optional[CachedFuncConfig] = None
) -> CachedFunc
Wraps a function to cache its results.
PARAMETER | DESCRIPTION |
---|---|
func
|
The function to wrap.
TYPE:
|
config
|
Optional caching options for the wrapped function.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
CachedFunc
|
The wrapped cached function. |
Source code in src/pydvl/utils/caching/base.py
get
abstractmethod
¶
get(key: str) -> Optional[CacheResult]
Abstract method to retrieve a cached result.
Implemented by subclasses.
PARAMETER | DESCRIPTION |
---|---|
key
|
The cache key.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Optional[CacheResult]
|
The cached result or None if not found. |
Source code in src/pydvl/utils/caching/base.py
set
abstractmethod
¶
set(key: str, value: CacheResult) -> None
Abstract method to set a cached result.
Implemented by subclasses.
PARAMETER | DESCRIPTION |
---|---|
key
|
The cache key.
TYPE:
|
value
|
The result to cache.
TYPE:
|
clear
abstractmethod
¶
CachedFunc
¶
CachedFunc(
func: Callable[..., float],
*,
cache_backend: CacheBackend,
config: Optional[CachedFuncConfig] = None,
)
Caches callable function results with a provided cache backend.
Wraps a callable function to cache its results using a provided an instance of a subclass of CacheBackend.
This class is heavily inspired from that of joblib.memory.MemorizedFunc.
This class caches calls to the wrapped callable by generating a hash key based on the wrapped callable's code, the arguments passed to it and the optional hash_prefix.
Warning
This class only works with hashable arguments to the wrapped callable.
PARAMETER | DESCRIPTION |
---|---|
func
|
Callable to wrap. |
cache_backend
|
Instance of CacheBackendBase that handles setting and getting values.
TYPE:
|
config
|
Configuration for wrapped function.
TYPE:
|
Source code in src/pydvl/utils/caching/base.py
CachedFuncConfig
dataclass
¶
CachedFuncConfig(
hash_prefix: Optional[str] = None,
ignore_args: Collection[str] = list(),
time_threshold: float = 0.3,
allow_repeated_evaluations: bool = False,
rtol_stderr: float = 0.1,
min_repetitions: int = 3,
)
Configuration for cached functions and methods, providing memoization of function calls.
Instances of this class are typically used as arguments for the construction of a Utility.
PARAMETER | DESCRIPTION |
---|---|
hash_prefix
|
Optional string prefix that be prepended to the cache key. This can be provided in order to guarantee cache reuse across runs. |
ignore_args
|
Do not take these keyword arguments into account when
hashing the wrapped function for usage as key. This allows
sharing the cache among different jobs for the same experiment run if
the callable happens to have "nuisance" parameters like
TYPE:
|
time_threshold
|
Computations taking less time than this many seconds are not cached. A value of 0 means that it will always cache results.
TYPE:
|
allow_repeated_evaluations
|
If
TYPE:
|
rtol_stderr
|
relative tolerance for repeated evaluations. More precisely,
memcached() will stop evaluating the function
once the standard deviation of the mean is smaller than
TYPE:
|
min_repetitions
|
minimum number of times that a function evaluation on the same arguments is repeated before returning cached values. Useful for stochastic functions only. If the model training is very noisy, set this number to higher values to reduce variance.
TYPE:
|
DiskCacheBackend
¶
Bases: CacheBackend
Disk cache backend that stores results in files.
Implements the CacheBackend interface for a disk-based cache. Stores cache entries as pickled files on disk, keyed by cache key. This allows sharing evaluations across processes in a single node/computer.
PARAMETER | DESCRIPTION |
---|---|
cache_dir
|
Base directory for cache storage. |
ATTRIBUTE | DESCRIPTION |
---|---|
cache_dir |
Base directory for cache storage.
|
Example
Basic usage:
>>> from pydvl.utils.caching.disk import DiskCacheBackend
>>> cache_backend = DiskCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> cache_backend.set("key", value)
>>> cache_backend.get("key")
42
Callable wrapping:
>>> from pydvl.utils.caching.disk import DiskCacheBackend
>>> cache_backend = DiskCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> def foo(x: int):
... return x + 1
...
>>> wrapped_foo = cache_backend.wrap(foo)
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
0
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
1
PARAMETER | DESCRIPTION |
---|---|
cache_dir
|
Base directory for cache storage. If not provided, this defaults to a newly created temporary directory. |
Source code in src/pydvl/utils/caching/disk.py
wrap
¶
wrap(
func: Callable, *, config: Optional[CachedFuncConfig] = None
) -> CachedFunc
Wraps a function to cache its results.
PARAMETER | DESCRIPTION |
---|---|
func
|
The function to wrap.
TYPE:
|
config
|
Optional caching options for the wrapped function.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
CachedFunc
|
The wrapped cached function. |
Source code in src/pydvl/utils/caching/base.py
get
¶
Get a value from the cache.
PARAMETER | DESCRIPTION |
---|---|
key
|
Cache key.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Optional[Any]
|
Cached value or None if not found. |
Source code in src/pydvl/utils/caching/disk.py
set
¶
clear
¶
InMemoryCacheBackend
¶
Bases: CacheBackend
In-memory cache backend that stores results in a dictionary.
Implements the CacheBackend interface for an in-memory-based cache. Stores cache entries as values in a dictionary, keyed by cache key. This allows sharing evaluations across threads in a single process.
The implementation is not thread-safe.
ATTRIBUTE | DESCRIPTION |
---|---|
cached_values |
Dictionary used to store cached values. |
Example
Basic usage:
>>> from pydvl.utils.caching.memory import InMemoryCacheBackend
>>> cache_backend = InMemoryCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> cache_backend.set("key", value)
>>> cache_backend.get("key")
42
Callable wrapping:
>>> from pydvl.utils.caching.memory import InMemoryCacheBackend
>>> cache_backend = InMemoryCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> def foo(x: int):
... return x + 1
...
>>> wrapped_foo = cache_backend.wrap(foo)
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
0
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
1
Source code in src/pydvl/utils/caching/memory.py
wrap
¶
wrap(
func: Callable, *, config: Optional[CachedFuncConfig] = None
) -> CachedFunc
Wraps a function to cache its results.
PARAMETER | DESCRIPTION |
---|---|
func
|
The function to wrap.
TYPE:
|
config
|
Optional caching options for the wrapped function.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
CachedFunc
|
The wrapped cached function. |
Source code in src/pydvl/utils/caching/base.py
get
¶
set
¶
clear
¶
MemcachedClientConfig
dataclass
¶
MemcachedClientConfig(
server: Tuple[str, int] = ("localhost", 11211),
connect_timeout: float = 1.0,
timeout: float = 1.0,
no_delay: bool = True,
serde: PickleSerde = PickleSerde(pickle_version=PICKLE_VERSION),
)
Configuration of the memcached client.
PARAMETER | DESCRIPTION |
---|---|
server
|
A tuple of (IP|domain name, port). |
connect_timeout
|
How many seconds to wait before raising
TYPE:
|
timeout
|
Duration in seconds to wait for send or recv calls on the socket connected to memcached.
TYPE:
|
no_delay
|
If True, set the
TYPE:
|
serde
|
Serializer / Deserializer ("serde"). The default
TYPE:
|
MemcachedCacheBackend
¶
MemcachedCacheBackend(config: MemcachedClientConfig = MemcachedClientConfig())
Bases: CacheBackend
Memcached cache backend for the distributed caching of functions.
Implements the CacheBackend interface for a memcached based cache. This allows sharing evaluations across processes and nodes in a cluster. You can run memcached as a service, locally or remotely, see the caching documentation.
PARAMETER | DESCRIPTION |
---|---|
config
|
Memcached client configuration.
TYPE:
|
ATTRIBUTE | DESCRIPTION |
---|---|
config |
Memcached client configuration.
|
client |
Memcached client instance.
|
Example
Basic usage:
>>> from pydvl.utils.caching.memcached import MemcachedCacheBackend
>>> cache_backend = MemcachedCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> cache_backend.set("key", value)
>>> cache_backend.get("key")
42
Callable wrapping:
>>> from pydvl.utils.caching.memcached import MemcachedCacheBackend
>>> cache_backend = MemcachedCacheBackend()
>>> cache_backend.clear()
>>> value = 42
>>> def foo(x: int):
... return x + 1
...
>>> wrapped_foo = cache_backend.wrap(foo)
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
0
>>> wrapped_foo(value)
43
>>> wrapped_foo.stats.misses
1
>>> wrapped_foo.stats.hits
1
PARAMETER | DESCRIPTION |
---|---|
config
|
Memcached client configuration.
TYPE:
|
Source code in src/pydvl/utils/caching/memcached.py
wrap
¶
wrap(
func: Callable, *, config: Optional[CachedFuncConfig] = None
) -> CachedFunc
Wraps a function to cache its results.
PARAMETER | DESCRIPTION |
---|---|
func
|
The function to wrap.
TYPE:
|
config
|
Optional caching options for the wrapped function.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
CachedFunc
|
The wrapped cached function. |
Source code in src/pydvl/utils/caching/base.py
get
¶
Get value from memcached.
PARAMETER | DESCRIPTION |
---|---|
key
|
Cache key.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Optional[Any]
|
Cached value or None if not found or client disconnected. |
Source code in src/pydvl/utils/caching/memcached.py
set
¶
clear
¶
combine_hashes
¶
__getstate__
¶
__getstate__() -> Dict
Enables pickling after a socket has been opened to the memcached server, by removing the client from the stored data.