Getting started¶
If you want to jump straight in, install pyDVL and then check out the examples. You will probably want to install with support for influence function computation.
We have introductions to the ideas behind Data valuation and Influence functions, as well as a short overview of common applications.
Installing pyDVL¶
To install the latest release use:
See Extras for optional dependencies, in particular if you are interested in influence functions. You can also install the latest development version from TestPyPI:
In order to check the installation you can use:
Dependencies¶
pyDVL requires Python >= 3.8, numpy, scikit-learn, scipy, cvxpy for the core methods, and joblib for parallelization locally. Additionally,the Influence functions module requires PyTorch (see Extras below).
Extras¶
pyDVL has a few extra dependencies that can be optionally installed:
Influence functions¶
To use the module on influence functions, pydvl.influence, run:
This includes a dependency on PyTorch (Version 2.0 and above) and thus is left out by default.
CuPy¶
In case that you have a supported version of CUDA installed (v11.2 to 11.8 as of this writing), you can enable eigenvalue computations for low-rank approximations with CuPy on the GPU by using:
This installs cupy-cuda11x.
If you use a different version of CUDA, please install CuPy manually.
Ray¶
If you want to use Ray to distribute data valuation workloads across nodes in a cluster (it can be used locally as well, but for this we recommend joblib instead) install pyDVL using:
see the intro to parallelization for more details on how to use it.
Memcached¶
If you want to use Memcached for caching utility evaluations, use:
This installs pymemcache additionally. Be aware that you still have to start a memcached server manually. See Setting up the Memcached cache.