Contributing to pyDVL¶

The goal of pyDVL is to be a repository of successful algorithms for the valuation of data, in a broader sense. Contributions are welcome from anyone in the form of pull requests, bug reports and feature requests.

We will consider for inclusion any (tested) implementation of an algorithm appearing in a peer-reviewed journal (even if the method does not improve the state of the art, for benchmarking and comparison purposes). We are also open to improvements to the currently implemented methods and other ideas. Please open a ticket with yours.

If you are interested in setting up a similar project, consider the template pymetrius.

Local development¶

This project uses ruff to lint and format code and pre-commit to invoke it as a git pre-commit hook. Consider installing any of ruff's IDE integrations to make your life easier.

Run the following to set up the pre-commit git hook to run before pushes:

pre-commit install --hook-type pre-push

Additionally, we use Git LFS for some files like images. Install with

git lfs install

Setting up your environment¶

We strongly suggest using some form of virtual environment for working with the library. E.g. with venv:

python -m venv ./venv
. venv/bin/activate  # `venv\Scripts\activate` in windows
pip install -r requirements-dev.txt -r requirements-docs.txt

With conda:

conda create -n pydvl python=3.9
conda activate pydvl
pip install -r requirements-dev.txt -r requirements-docs.txt

A very convenient way of working with your library during development is to install it in editable mode into your environment by running

pip install -e .

In order to build the documentation locally (which is done as part of the tox suite) you need to install additional non-python dependencies as described in the documentation of mkdocs-material.

In addition, pandoc is required. Except for OSX, it should be installed automatically as a dependency with requirements-docs.txt. Under OSX you can install pandoc (you'll need at least version 2.11) with:

brew install pandoc

Remember to mark all autogenerated directories as excluded in your IDE. In particular docs_build and .tox should be marked as excluded to avoid slowdowns when searching or refactoring code.

If you use remote execution, don't forget to exclude data paths from deployment (unless you really want to sync them).

Testing¶

Automated builds, tests, generation of documentation and publishing are handled by CI pipelines. Before pushing your changes to the remote we recommend to execute tox locally in order to detect mistakes early on and to avoid failing pipelines. tox will: * run the test suite * build the documentation * build and test installation of the package. * generate coverage reports in html, as well as badges.

You can configure pytest, coverage and ruff by adjusting pyproject.toml.

Besides the usual unit tests, most algorithms are tested using pytest. This requires ray for the parallelization and Memcached for caching. Please install both before running the tests. We run tests in CI as well.

It is possible to pass optional command line arguments to pytest, for example to run only certain tests using patterns (-k) or marker (-m).

tox -e tests -- <optional arguments>

There are a few important arguments:

--memcached-service allows to change the default of localhost:11211 (memcached's default) to a different address.

Memcached is needed for testing caching as well as speeding certain methods (e.g. Permutation Shapley).

To start memcached locally in the background with Docker use:

docker run --name pydvl-memcache -p 11211:11211 -d memcached

-n sets the number of parallel workers for pytest-xdist.

There are two layers of parallelization in the tests. An inner one within the tests themselves, i.e. the parallelism in the algorithms, and an outer one by pytest-xdist. The latter is controlled by the -n argument. If you experience segmentation faults with the tests, try running them with -n 0 to disable parallelization.

--slow-tests enables running slow tests. See below for a description of slow tests.
--with-cuda sets the device fixture in tests/influence/torch/conftest.py to cuda if it is available. Using this fixture within tests, you can run parts of your tests on a cuda device. Be aware, that you still have to take care of the usage of the device manually in a specific test. Setting this flag does not result in running all tests on a GPU.

Markers¶

We use a few different markers to differentiate between tests and runs groups of them of separately. Use pytest --markers to get a list and description of all available markers.

Two important markers are:

pytest.mark.slow which is used to mark slow tests and skip them by default.

A slow test is any test that takes 45 seconds or more to run and that can be skipped most of the time. In some cases a test is slow, but it is required in order to ensure that a feature works as expected and that are no bugs. In those cases, we should not use this marker.

Slow tests are always run on CI. Locally, they are skipped by default but can be additionally run using: pytest --slow-tests.

pytest.mark.torch which is used to mark tests that require PyTorch.

To test modules that rely on PyTorch, use:

tox -e tests -- -m "torch"

Other Things¶

To test the notebooks separately, run (see below for details):

tox -e notebook-tests

To create a package locally, run:

python setup.py sdist bdist_wheel

Notebooks¶

We use notebooks both as documentation (copied over to docs/examples) and as integration tests. All notebooks in the notebooks directory are executed during the test run. Because run times are typically too long for large datasets, you must check for the CI environment variable to work with smaller ones. For example, you can select a subset of the data:

# In CI we only use a subset of the training set
if os.environ.get('CI'):
    training_data = training_data[:10]

This switching should happen in a separate notebook cell tagged with hide to hide the cell's input and output when rendering it as part of the documents. We want to avoid as much clutter and boilerplate as possible in the notebooks themselves.

Because we want documentation to include the full dataset, we commit notebooks with their outputs running with full datasets to the repo. The notebooks are then added by CI to the section Examples of the documentation.

Hiding cells in notebooks¶

Switching between CI or not, importing generic modules and plotting results are all examples of boilerplate code irrelevant to a reader interested in pyDVL's functionality. For this reason we choose to isolate this code into separate cells which are then hidden in the documentation.

In order to do this, cells are marked with tags understood by the mkdocs plugin mkdocs-jupyter, namely adding the following to the metadata of the relevant cells:

"tags": [
  "hide"
]

To hide the cell's input and output.

Or:

"tags": [
  "hide-input"
]

To only hide the input and

"tags": [
  "hide-output"
]

for hiding the output only.

It is important to leave a warning at the top of the document to avoid confusion. Examples for hidden imports and plots are available in the notebooks, e.g. in notebooks/shapley_basic_spotify.ipynb.

Plots in Notebooks¶

If you add a plot to a notebook, which should also render nicely in browser dark mode, add the tag invertible-output, i.e.

"tags": [
  "invertible-output"
]

This applies a simple CSS-filter to the output image of the cell.

Documentation¶

API documentation and examples from notebooks are built with mkdocs, using a number of plugins, including mkdoctrings, with versioning handled by mike.

Notebooks are an integral part of the documentation as well, please read the section on notebooks above.

If you want to build the documentation locally, please make sure you followed the instructions in the section Setting up your environment.

Use the following command to build the documentation the same way it is done in CI:

mkdocs build

Locally, you can use this command instead to continuously rebuild documentation on changes to the docs and src folder:

mkdocs serve

This will rebuild the documentation on changes to .md files inside docs, notebooks and python files.

On OSX, it is possible that the cairo lib file is not properly linked when installed via homebrew. In this case you might encounter an error like this

OSError: no library called "cairo-2" was found
no library called "cairo" was found
no library called "libcairo-2" was found

when calling mkdocs build or mkdocs serve. This can be resolved via setting the environment variable DYLD_FALLBACK_LIBRARY_PATH:

export DYLD_FALLBACK_LIBRARY_PATH=$DYLD_FALLBACK_LIBRARY_PATH:/opt/homebrew/lib

Automatic API documentation¶

We use mkdocstrings to automatically generate API documentation from docstrings, following almost verbatim this recipe: Stubs are generated for all modules on the fly using generate_api_docs.py thanks to the pluging mkdocstrings-gen-files and navigation is generated for mkdocs-literate-nav.

With some renaming and using section-index __init__.py files are used as entry points for the documentation of a module.

Since very often we re-export symbols in the __init__.py files, the automatic generation of the documentation skips all symbols in those files. If you want to document any in particular you can do so by overriding mkdocs_genfiles: Create a file under docs/api/pydvl/module/index.md and add your documentation there. For example, to document the whole module and every (re-)exported symbol just add this to the file:

::: pydvl.module

Adding new pages¶

Navigation is configured in mkdocs.yaml using the nav section. We use the plugin mkdoc-literate-nav which allows fine-grained control of the navigation structure. However, most pages are explicitly listed and manually arranged in the nav section of the configuration.

Creating stable references for autorefs¶

mkdocstrings includes the plugin autorefs to enable automatic linking across pages with e.g. [a link][to-something]. Anchors are autogenerated from section titles, and are not guaranteed to be unique. In order to ensure that a link will remain valid, add a custom anchor to the section title:

## Some section { #permanent-anchor-to-some-section }

(note the space after the opening brace). You can then refer to it within another markdown file with [Some section][permanent-anchor-to-some-section].

Adding notes about new features, changes or deprecations¶

We use the admonition extension of Mkdocs Material to create admonitions, also known as call-outs, that hold information about when a certain feature was added, changed or deprecated and optionally a description with more details. We put the admonition directly in a module's, a function's or class' docstring.

We use the following syntax:

!!! tip "<Event Type> in version <Version Number>"

    <Optional Description>

The description is useful when the note is about a smaller change such as a parameter.

For a new feature, we use:

!!! tip "New in version <Version Number>"

    <Optional Description>

For a change to an existing feature we use:

!!! tip "Changed in version <Version Number>"

    <Optional Description>

For example, for a change in version 1.2.3 that adds kwargs to a class' constructor we would write:

!!! tip "Changed in version 1.2.3"

    Added kwargs to the constructor.

For a deprecation we use:

!!! tip "Deprecated in version <Version Number>"

    <Optional Description>

Using bibliography¶

Bibliographic citations are managed with the plugin mkdocs-bibtex. To enter a citation first add the entry to docs/pydvl.bib. For team contributor this should be an export of the Zotero folder software/pydvl in the TransferLab Zotero library. All other contributors just add the bibtex data, and a maintainer will add it to the group library upon merging.

To add a citation inside a markdown file, use the notation [@ citekey] (with no space). Alas, because of when mkdocs-bibtex enters the pipeline, it won't process docstrings. For module documentation, we manually inject html into the markdown files. For example, in pydvl.value.shapley.montecarlo we have:

"""
Module docstring...

## References

[^1]: <a name="ghorbani_data_2019"></a>Ghorbani, A., Zou, J., 2019.
    [Data Shapley: Equitable Valuation of Data for Machine
    Learning](https://proceedings.mlr.press/v97/ghorbani19c.html).
    In: Proceedings of the 36th International Conference on Machine Learning,
    PMLR, pp. 2242–2251.
"""

and then later in the file, inside a function's docstring:

    This function implements (Ghorbani and Zou, 2019)<sup><a 
    href="#ghorbani_data_2019">1</a></sup>

Writing mathematics¶

Use LaTeX delimiters $ and $$ for inline and displayed mathematics respectively.

Warning: backslashes must be escaped in docstrings! (although there are exceptions). For simplicity, declare the string as "raw" with the prefix r:

# This will work
def f(x: float) -> float:
    r""" Computes 
    $${ f(x) = \frac{1}{x^2} }$$
    """
    return 1/(x*x)

# This throws an obscure error
def f(x: float) -> float:
    """ Computes 
    $$\frac{1}{x^2}$$
    """
    return 1/(x*x)

Note how there is no space after the dollar signs. This is important! You can use braces for legibility like in the first example.

Abbreviations¶

We keep the abbreviations used in the documentation inside the docs_include/abbreviations.md file.

The syntax for abbreviations is:

*[ABBR]: Abbreviation

CI¶

We use workflows to:

Run the tests.
Publish documentation.
Publish packages to testpypi / pypi.
Mark issues as stale after 30 days. We do this only for issues with the label awaiting-reply which indicates that we have answered a question / feature request / PR and are waiting for the OP to reply / update his work.

Tests¶

We test all algorithms with simple datasets in CI jobs. This can amount to a sizeable amount of time, so care must be taken not to overdo it: 1. All algorithm tests must be on very simple datasets and as quick as possible 2. We try not to trigger CI pipelines when unnecessary (see Skipping CI runs). 3. We split the tests based on their duration into groups and run them in parallel.

For that we use pytest-split to first store the duration of all tests with tox -e tests -- --store-durations --slow-tests in a .test_durations file.

Alternatively, we case use pytest directly pytest --store-durations --slow-tests.

Note This does not have to be done each time a new test or test case is added. For new tests and test cases pytes-split assumes average test execution time(calculated based on the stored information) for every test which does not have duration information stored. Thus, there's no need to store durations after changing the test suite. However, when there are major changes in the suite compared to what's stored in .test_durations, it's recommended to update the duration information with --store-durations to ensure that the splitting is in balance.

Then we can have as many splits as we want:

tox -e tests -- --splits 3 --group 1
tox -e tests -- --splits 3 --group 2
tox -e tests -- --splits 3 --group 3

Alternatively, we case use pytest directly pytest --splits 3 ---group 1.

Each one of these commands should be run in a separate shell/job to run the test groups in parallel and decrease the total runtime.

Running Github Actions locally¶

To run Github Actions locally we use act. It uses the workflows defined in .github/workflows and determines the set of actions that need to be run. It uses the Docker API to either pull or build the necessary images, as defined in our workflow files and finally determines the execution path based on the dependencies that were defined.

Once it has the execution path, it then uses the Docker API to run containers for each action based on the images prepared earlier. The environment variables and filesystem are all configured to match what GitHub provides.

You can install it manually using:

curl -s https://raw.githubusercontent.com/nektos/act/master/install.sh | sudo bash -s -- -d -b ~/bin

And then simply add it to your PATH variable: PATH=~/bin:$PATH

Refer to its official readme for more installation options.

act cheatsheet¶

By default, act will run all workflows in .github/workflows. You can use the -W flag to specify a specific workflow file to run, or you can rely on the job id to be unique (but then you'll see warnings for the workflows without that job id).

# Run only the main tests for python 3.9 after a push event (implicit) 
act -W .github/workflows/run-tests-workflow.yaml \
    -j run-tests \
    --input tests_to_run=base\
    --input python_version=3.9

Other common flags are:

# List all actions for all events:
act -l

# List the actions for a specific event:
act workflow_dispatch -l

# List the actions for a specific job:
act -j lint -l

# Run the default (`push`) event:
act

# Run a specific event:
act pull_request

# Run a specific job:
act -j lint

# Collect artifacts to the /tmp/artifacts folder:
act --artifact-server-path /tmp/artifacts

# Run a job in a specific workflow (useful if you have duplicate job names)
act -j lint -W .github/workflows/publish.yml

# Run in dry-run mode:
act -n

# Enable verbose-logging (can be used with any of the above commands)
act -v

Example¶

To run the publish job (the most difficult one to test) you would simply use:

When triggered by a release:

act release -j publish --eventpath events.json

With events.json containing:

{
  "act": true
}

This will use your current branch. If you want to test a specific branch you have to use the workflow_dispatch event (see below).

To instead run it as if it had been manually triggered (i.e. workflow_dispatch) you would instead use:

act workflow_dispatch -j publish --eventpath events.json

With events.json containing:

{
  "act": true,
  "inputs": {
    "tag_name": "v0.6.0"
  }
}

Skipping CI runs¶

One sometimes would like to skip CI for certain commits (e.g. updating the readme). In order to do this, simply prefix the commit message with [skip ci]. The string can be anywhere, but adding it to the beginning of the commit message makes it more evident when looking at commits in a PR.

Refer to the official GitHub documentation for more information.

Release processes¶

Automatic release process¶

In order to create an automatic release, a few prerequisites need to be satisfied:

The project's virtualenv needs to be active
The repository needs to be on the develop branch
The repository must be clean (including no untracked files)

Then, a new release can be created using the script build_scripts/release-version.sh (leave out the version parameter to have bumpversion automatically derive the next release version by bumping the patch part):

build_scripts/release-version.sh 0.1.6

This will:

Pull latest remote version of develop (fast-forward only) from origin
Create a branch release/v0.1.6
Bump version number: 0.1.5.dev1234 ⟶ 0.1.6
Merge release branch into master locally and on origin
Tag as v0.1.6
Bump version number again to next development pre-release
Merge release branch into develop locally and on origin
Delete release branch

For usage details, pass the -h or --help flags:

build_scripts/release-version.sh --help

If running in interactive mode (without -y|--yes), the script will output a summary of pending changes and ask for confirmation before executing the actions.

Once the script is done, you should then create a GitHub release for the tag that was created. That will a trigger a CI pipeline that will automatically create a package and publish it from CI to PyPI.

Manual release process¶

If the automatic release process doesn't cover your use case, you can also create a new release manually by following these steps:

(Repeat as needed) implement features on feature branches merged into develop. Each merge into develop will publish a new pre-release version to TestPyPI. These versions can be installed using pip install --pre --index-url https://test.pypi.org/simple/.
When ready to release: From the develop branch create the release branch and perform release activities (update changelog, news, ...). For your own convenience, define an env variable for the release version
```
export RELEASE_VERSION="vX.Y.Z"
git checkout develop
git branch release/${RELEASE_VERSION} && git checkout release/${RELEASE_VERSION}
```
Run bumpversion --commit release if the release is only a patch release, otherwise the full version can be specified using bumpversion --commit --new-version X.Y.Z release (the release part is ignored but required by bumpversion ).

Merge the release branch into master, tag the merge commit, and push back to the repo. The CI pipeline publishes the package based on the tagged commit.

git checkout master
git merge --no-ff release/${RELEASE_VERSION}
git tag -a ${RELEASE_VERSION} -m"Release ${RELEASE_VERSION}"
git push --follow-tags origin master

Switch back to the release branch release/vX.Y.Z and pre-bump the version: bumpversion --commit patch. This ensures that develop pre-releases are always strictly more recent than the last published release version from master.

Merge the release branch into develop:

git checkout develop
git merge --no-ff release/${RELEASE_VERSION}
git push origin develop

Delete the release branch if necessary: git branch -d release/${RELEASE_VERSION}
Create a GitHub release for the created tag.
Pour yourself a cup of coffee, you earned it!
A package will be automatically created and published from CI to PyPI.

CI and requirements for publishing¶

In order to publish new versions of the package from the development branch, the CI pipeline requires the following secret variables set up:

TEST_PYPI_USERNAME
TEST_PYPI_PASSWORD
PYPI_USERNAME
PYPI_PASSWORD

The first 2 are used after tests run on the develop branch's CI workflow to automatically publish packages to TestPyPI.

The last 2 are used in the publish.yaml CI workflow to publish packages to PyPI from develop after a GitHub release.

Publish to TestPyPI¶

We use bump2version to bump the build part of the version number without commiting or tagging the change and then publish a package to TestPyPI from CI using Twine. The version has the GitHub run number appended.

For more details refer to the file .github/workflows/publish.yaml.