Advanced usage¶

Besides the dos and don'ts of data valuation itself, which are the subject of the examples and the documentation of each method, there are two main things to keep in mind when using pyDVL namely Parallelization and Caching.

Parallelization¶

pyDVL uses parallelization to scale and speed up computations. It does so using Dask or Joblib (with any of its backends). The first is used in the influence package whereas the latter is used in the valuation package.

Data valuation¶

For data valuation, pyDVL uses joblib for transparent parallelization of most methods using any of the backends available to joblib.

If you want to use ray or dask as backends, please follow the instructions in joblib's documentation. Mostly it's just a matter of registering the backend with joblib.register_parallel_backend and then using it within the context manager joblib.parallel_config around the code that you want to parallelize, which is usually the call to the fit method of the valuation object.

Basic fitting in parallel

import sklearn as sk
from joblib import parallel_config, register_parallel_backend
from pydvl.valuation import *
from ray.util.joblib import register_ray

register_ray()

train, test = Dataset.from_arrays(...)
model = sk.svm.SVC()
scorer = SupervisedScorer("accuracy", test, default=0.0, range=(0, 1))
utility = ModelUtility(model, scorer)
sampler = PermutationSampler(truncation=NoTruncation())
stopping = MinUpdates(7000) | MaxTime(3600)
shapley = ShapleyValuation(utility, sampler, stopping, progress=True)

with parallel_config(backend="ray", n_jobs=128):
    shapley.fit(train)

results = shapley.result

Note that you will have to install additional dependencies (see Extras) and to provide a running cluster (or run ray in local mode). For instance, for ray follow the instructions in Ray's documentation to set up a remote cluster. You could alternatively use a local cluster and in that case you don't have to set anything up.

Info

As of v0.10.0 pyDVL does not allow requesting resources per task sent to the cluster, so you will need to make sure that each worker has enough resources to handle the tasks it receives. A data valuation task using game-theoretic methods will typically make a copy of the whole model and dataset to each worker, even if the re-training only happens on a subset of the data. Some backends, like "loky" will use memory mapping to avoid copying the dataset to each worker, but in general you should make sure that each worker has enough memory to handle the whole dataset.

Working with large datasets¶

When running in parallel, the utility is copied to each worker. This implies copying the dataset as well, which can obviously be very expensive. In order to alleviate the problem, one can memmap the data from disk by setting mmap=True when creating the Dataset objects. In case you create the Dataset with previously memory-mapped arrays, you must ensure that the shapes conform to the requirements, since internal checks are disabled to avoid additional copying. This amounts to calling check_X_y() on the arrays beforehand.

If you are working with torch tensors as underlying raw data, you can try activating shared memory for them using tensor.share_memory_(), but whether this yields a benefit or not will depend on the precise situation.

If you are working on a cluster, the data will be copied to each worker node. In this case, subclassing of Dataset to leverage your particular distributed storage solution will be necessary. Feel free to open an issue if you need help with this.

Influence functions¶

Refer to Scaling influence computation for explanations about parallelization for Influence Functions.

Caching¶

PyDVL can cache (memoize) the computation of the utility function and speed up some computations for data valuation. It is however disabled by default because single runs of methods rarely benefit much from it. When it is enabled it takes into account the data indices passed as argument and the utility function wrapped into the Utility object. This means that care must be taken when reusing the same utility function with different data, see the documentation for the caching package for more information.

In general, caching won't play a major role in the computation of Shapley values because the probability of sampling the same subset twice, and hence needing the same utility function computation, is very low. However, it can be very useful when comparing methods that use the same utility function, or when running multiple experiments with the same data.

pyDVL supports 3 different caching backends:

InMemoryCacheBackend: an in-memory cache backend that uses a dictionary to store and retrieve cached values. This is used to share cached values between threads in a single process. This backend is provided for completeness, since parallelization is almost never done using threads,
DiskCacheBackend: a disk-based cache backend that uses pickled values written to and read from disk. This is used to share cached values between processes in a single machine. !!! warning "Disk cache" The disk cache is a stub implementation which pickles each utility evaluation and is extremely inefficient. If it proves useful, we might implement a more efficient version in the future.
MemcachedCacheBackend: a Memcached-based cache backend that uses pickled values written to and read from a Memcached server. This is used to share cached values between processes across one or multiple machines.

Memcached extras

The Memcached backend requires optional dependencies. See Extras for more information.

Using the caches is as simple as passing the backend to the utility constructor. Please refer to the documentation and examples of each backend class for more details.

Using the cache

Continue reading about the cache in the documentation for the caching package.

Setting up the Memcached cache¶

Memcached is an in-memory key-value store accessible over the network.

You can either install it as a package or run it inside a docker container (the simplest). For installation instructions, refer to the Getting started section in memcached's wiki. Then you can run it with:

memcached -u user

To run memcached inside a container in daemon mode instead, use:

docker container run -d --rm -p 11211:11211 memcached:latest