Contributing to pyDVL¶
The goal of pyDVL is to be a repository of successful algorithms for the valuation of data, in a broader sense. Contributions are welcome from anyone in the form of pull requests, bug reports and feature requests.
We will consider for inclusion any (tested) implementation of an algorithm appearing in a peer-reviewed journal (even if the method does not improve the state of the art, for benchmarking and comparison purposes). We are also open to improvements to the currently implemented methods and other ideas. Please open a ticket with yours.
If you are interested in setting up a similar project, consider the template pymetrius.
Local development¶
This project uses black to format code and pre-commit to invoke it as a git pre-commit hook. Consider installing any of black's IDE integrations to make your life easier.
Run the following to set up the pre-commit git hook to run before pushes:
Additionally, we use Git LFS for some files like images. Install with
Setting up your environment¶
We strongly suggest using some form of virtual environment for working with the library. E.g. with venv:
python -m venv ./venv
. venv/bin/activate # `venv\Scripts\activate` in windows
pip install -r requirements-dev.txt -r requirements-docs.txt
With conda:
conda create -n pydvl python=3.8
conda activate pydvl
pip install -r requirements-dev.txt -r requirements-docs.txt
A very convenient way of working with your library during development is to install it in editable mode into your environment by running
In order to build the documentation locally (which is done as part of the tox suite) you need to install additional non-python dependencies as described in the documentation of mkdocs-material.
In addition, pandoc is required. Except for OSX,
it should be installed automatically as a dependency with
requirements-docs.txt
. Under OSX you can install pandoc
(you'll need at least version 2.11) with:
Remember to mark all autogenerated directories as excluded in your IDE. In
particular docs_build
and .tox
should be marked as excluded to avoid
slowdowns when searching or refactoring code.
If you use remote execution, don't forget to exclude data paths from deployment (unless you really want to sync them).
Testing¶
Automated builds, tests, generation of documentation and publishing are handled
by CI pipelines. Before pushing your changes to the remote we recommend
to execute tox
locally in order to detect mistakes early on and to avoid
failing pipelines. tox will:
* run the test suite
* build the documentation
* build and test installation of the package.
* generate coverage and pylint reports in html, as well as badges.
You can configure pytest, coverage and pylint by adjusting pyproject.toml.
Besides the usual unit tests, most algorithms are tested using pytest. This requires ray for the parallelization and Memcached for caching. Please install both before running the tests. We run tests in CI as well.
It is possible to pass optional command line arguments to pytest, for example to
run only certain tests using patterns (-k
) or marker (-m
).
There are a few important arguments:
--memcached-service
allows to change the default oflocalhost:11211
(memcached's default) to a different address.
Memcached is needed for testing caching as well as speeding certain methods (e.g. Permutation Shapley).
To start memcached locally in the background with Docker use:
-n
sets the number of parallel workers for pytest-xdist.
There are two layers of parallelization in the tests.
An inner one within the tests themselves, i.e. the parallelism in the algorithms,
and an outer one by pytest-xdist. The latter is controlled by the -n
argument.
If you experience segmentation faults with the tests,
try running them with -n 0
to disable parallelization.
-
--slow-tests
enables running slow tests. See below for a description of slow tests. -
--with-cuda
sets the device fixture in tests/influence/torch/conftest.py tocuda
if it is available. Using this fixture within tests, you can run parts of your tests on acuda
device. Be aware, that you still have to take care of the usage of the device manually in a specific test. Setting this flag does not result in running all tests on a GPU.
Markers¶
We use a few different markers to differentiate between tests and runs
groups of them of separately. Use pytest --markers
to get a list and description
of all available markers.
Two important markers are:
pytest.mark.slow
which is used to mark slow tests and skip them by default.
A slow test is any test that takes 45 seconds or more to run and that can be skipped most of the time. In some cases a test is slow, but it is required in order to ensure that a feature works as expected and that are no bugs. In those cases, we should not use this marker.
Slow tests are always run on CI. Locally, they are skipped
by default but can be additionally run using: pytest --slow-tests
.
pytest.mark.torch
which is used to mark tests that require PyTorch.
To test modules that rely on PyTorch, use:
Other Things¶
To test the notebooks separately, run (see below for details):
To create a package locally, run:
Notebooks¶
We use notebooks both as documentation (copied over to docs/examples
) and as
integration tests. All notebooks in the notebooks
directory are executed
during the test run. Because run times are typically too long for large
datasets, you must check for the CI
environment variable to work
with smaller ones. For example, you can select a subset of the data:
# In CI we only use a subset of the training set
if os.environ.get('CI'):
training_data = training_data[:10]
This switching should happen in a separate notebook cell tagged with
hide
to hide the cell's input and output when rendering it as part of
the documents. We want to avoid as much clutter and boilerplate
as possible in the notebooks themselves.
Because we want documentation to include the full dataset, we commit notebooks with their outputs running with full datasets to the repo. The notebooks are then added by CI to the section Examples of the documentation.
Hiding cells in notebooks¶
Switching between CI or not, importing generic modules and plotting results are all examples of boilerplate code irrelevant to a reader interested in pyDVL's functionality. For this reason we choose to isolate this code into separate cells which are then hidden in the documentation.
In order to do this, cells are marked with tags understood by the mkdocs
plugin mkdocs-jupyter
,
namely adding the following to the metadata of the relevant cells:
To hide the cell's input and output.
Or:
To only hide the input and
for hiding the output only.It is important to leave a warning at the top of the document to avoid confusion. Examples for hidden imports and plots are available in the notebooks, e.g. in notebooks/shapley_basic_spotify.ipynb.
Plots in Notebooks¶
If you add a plot to a notebook, which should also render nicely in browser dark mode, add the tag invertible-output, i.e.
This applies a simple CSS-filter to the output image of the cell.Documentation¶
API documentation and examples from notebooks are built with mkdocs, using a number of plugins, including mkdoctrings, with versioning handled by mike.
Notebooks are an integral part of the documentation as well, please read the section on notebooks above.
If you want to build the documentation locally, please make sure you followed the instructions in the section Setting up your environment.
Use the following command to build the documentation the same way it is done in CI:
Locally, you can use this command instead to continuously rebuild documentation
on changes to the docs
and src
folder:
This will rebuild the documentation on changes to .md
files inside docs
,
notebooks and python files.
On OSX, it is possible that the cairo lib file is not properly linked when installed via homebrew. In this case you might encounter an error like this
OSError: no library called "cairo-2" was found
no library called "cairo" was found
no library called "libcairo-2" was found
mkdocs build
or mkdocs serve
. This can be resolved via setting
the environment variable DYLD_FALLBACK_LIBRARY_PATH
:
Adding new pages¶
Navigation is configured in mkdocs.yaml
using the nav section. We use the
plugin mkdoc-literate-nav
which allows fine-grained control of the navigation structure. However, most
pages are explicitly listed and manually arranged in the nav
section of the
configuration.
Creating stable references for autorefs¶
mkdocstrings includes the plugin
autorefs to enable automatic linking
across pages with e.g. [a link][to-something]
. Anchors are autogenerated
from section titles, and are not guaranteed to be unique. In order to ensure
that a link will remain valid, add a custom anchor to the section title:
(note the space after the opening brace). You can then refer to it within
another markdown file with [Some section][permanent-anchor-to-some-section]
.
Adding notes about new features, changes or deprecations¶
We use the admonition extension of Mkdocs Material to create admonitions, also known as call-outs, that hold information about when a certain feature was added, changed or deprecated and optionally a description with more details. We put the admonition directly in a module's, a function's or class' docstring.
We use the following syntax:
The description is useful when the note is about a smaller change such as a parameter.
- For a new feature, we use:
- For a change to an existing feature we use:
For example, for a change in version 1.2.3
that adds kwargs
to a class' constructor we would write:
- For a deprecation we use:
Using bibliography¶
Bibliographic citations are managed with the plugin
mkdocs-bibtex. To enter a citation
first add the entry to docs/pydvl.bib
. For team contributor this should be an
export of the Zotero folder software/pydvl
in the TransferLab Zotero
library. All other
contributors just add the bibtex data, and a maintainer will add it to the group
library upon merging.
To add a citation inside a markdown file, use the notation [@citekey]
. Alas,
because of when mkdocs-bibtex enters the pipeline, it won't process docstrings.
For module documentation, we manually inject html into the markdown files. For
example, in pydvl.value.shapley.montecarlo
we have:
"""
Module docstring...
## References
[^1]: <a name="ghorbani_data_2019"></a>Ghorbani, A., Zou, J., 2019.
[Data Shapley: Equitable Valuation of Data for Machine
Learning](https://proceedings.mlr.press/v97/ghorbani19c.html).
In: Proceedings of the 36th International Conference on Machine Learning,
PMLR, pp. 2242–2251.
"""
and then later in the file, inside a function's docstring:
Writing mathematics¶
Use LaTeX delimiters $
and $$
for inline and displayed mathematics
respectively.
Warning: backslashes must be escaped in docstrings! (although there are
exceptions). For simplicity, declare the string as "raw" with the prefix r
:
# This will work
def f(x: float) -> float:
r""" Computes
$${ f(x) = \frac{1}{x^2} }$$
"""
return 1/(x*x)
# This throws an obscure error
def f(x: float) -> float:
""" Computes
$$\frac{1}{x^2}$$
"""
return 1/(x*x)
Note how there is no space after the dollar signs. This is important! You can use braces for legibility like in the first example.
Abbreviations¶
We keep the abbreviations used in the documentation inside the docs_include/abbreviations.md file.
The syntax for abbreviations is:
CI¶
We use workflows to:
- Run the tests.
- Publish documentation.
- Publish packages to testpypi / pypi.
- Mark issues as stale after 30 days. We do this only for issues with the label
awaiting-reply
which indicates that we have answered a question / feature request / PR and are waiting for the OP to reply / update his work.
Tests¶
We test all algorithms with simple datasets in CI jobs. This can amount to a sizeable amount of time, so care must be taken not to overdo it: 1. All algorithm tests must be on very simple datasets and as quick as possible 2. We try not to trigger CI pipelines when unnecessary (see Skipping CI runs). 3. We split the tests based on their duration into groups and run them in parallel.
For that we use pytest-split
to first store the duration of all tests with
tox -e tests -- --store-durations --slow-tests
in a .test_durations
file.
Alternatively, we case use pytest directly
pytest --store-durations --slow-tests
.
Note This does not have to be done each time a new test or test case is added. For new tests and test cases pytes-split assumes average test execution time(calculated based on the stored information) for every test which does not have duration information stored. Thus, there's no need to store durations after changing the test suite. However, when there are major changes in the suite compared to what's stored in .test_durations, it's recommended to update the duration information with
--store-durations
to ensure that the splitting is in balance.
Then we can have as many splits as we want:
tox -e tests -- --splits 3 --group 1
tox -e tests -- --splits 3 --group 2
tox -e tests -- --splits 3 --group 3
Alternatively, we case use pytest directly
pytest --splits 3 ---group 1
.
Each one of these commands should be run in a separate shell/job to run the test groups in parallel and decrease the total runtime.
Running Github Actions locally¶
To run Github Actions locally we use act.
It uses the workflows defined in .github/workflows
and determines
the set of actions that need to be run. It uses the Docker API
to either pull or build the necessary images, as defined
in our workflow files and finally determines the execution path
based on the dependencies that were defined.
Once it has the execution path, it then uses the Docker API to run containers for each action based on the images prepared earlier. The environment variables and filesystem are all configured to match what GitHub provides.
You can install it manually using:
curl -s https://raw.githubusercontent.com/nektos/act/master/install.sh | sudo bash -s -- -d -b ~/bin
And then simply add it to your PATH variable: PATH=~/bin:$PATH
Refer to its official readme for more installation options.
act cheatsheet¶
By default, act
will run all workflows in .github/workflows
. You can
use the -W
flag to specify a specific workflow file to run, or you can rely
on the job id to be unique (but then you'll see warnings for the workflows
without that job id).
# Run only the main tests for python 3.8 after a push event (implicit)
act -W .github/workflows/run-tests-workflow.yaml \
-j run-tests \
--input tests_to_run=base\
--input python_version=3.8
Other common flags are:
# List all actions for all events:
act -l
# List the actions for a specific event:
act workflow_dispatch -l
# List the actions for a specific job:
act -j lint -l
# Run the default (`push`) event:
act
# Run a specific event:
act pull_request
# Run a specific job:
act -j lint
# Collect artifacts to the /tmp/artifacts folder:
act --artifact-server-path /tmp/artifacts
# Run a job in a specific workflow (useful if you have duplicate job names)
act -j lint -W .github/workflows/tox.yml
# Run in dry-run mode:
act -n
# Enable verbose-logging (can be used with any of the above commands)
act -v
Example¶
To run the publish
job (the most difficult one to test)
you would simply use:
- When triggered by a release:
With events.json
containing:
This will use your current branch. If you want to test a specific branch
you have to use the workflow_dispatch
event (see below).
- To instead run it as if it had been manually triggered (i.e.
workflow_dispatch
) you would instead use:
With events.json
containing:
Skipping CI runs¶
One sometimes would like to skip CI for certain commits (e.g. updating the
readme). In order to do this, simply prefix the commit message with [skip ci]
.
The string can be anywhere, but adding it to the beginning of the commit message
makes it more evident when looking at commits in a PR.
Refer to the official GitHub documentation for more information.
Release processes¶
Automatic release process¶
In order to create an automatic release, a few prerequisites need to be satisfied:
- The project's virtualenv needs to be active
- The repository needs to be on the
develop
branch - The repository must be clean (including no untracked files)
Then, a new release can be created using the script
build_scripts/release-version.sh
(leave out the version parameter to have
bumpversion
automatically derive the next release version by bumping the patch
part):
To find out how to use the script, pass the -h
or --help
flags:
If running in interactive mode (without -y|--yes
), the script will output a
summary of pending changes and ask for confirmation before executing the
actions.
Once this is done, a tag will be created on the repository. You should then create a GitHub release for that tag. That will a trigger a CI pipeline that will automatically create a package and publish it from CI to PyPI.
Manual release process¶
If the automatic release process doesn't cover your use case, you can also create a new release manually by following these steps:
- (Repeat as needed) implement features on feature branches merged into
develop
. Each merge into develop will publish a new pre-release version to TestPyPI. These versions can be installed usingpip install --pre --index-url https://test.pypi.org/simple/
. - When ready to release: From the develop branch create the release branch and perform release activities (update changelog, news, ...). For your own convenience, define an env variable for the release version
- Run
bumpversion --commit release
if the release is only a patch release, otherwise the full version can be specified usingbumpversion --commit --new-version X.Y.Z release
(therelease
part is ignored but required by bumpversion ). - Merge the release branch into
master
, tag the merge commit, and push back to the repo. The CI pipeline publishes the package based on the tagged commit. - Switch back to the release branch
release/vX.Y.Z
and pre-bump the version:bumpversion --commit patch
. This ensures thatdevelop
pre-releases are always strictly more recent than the last published release version frommaster
. - Merge the release branch into
develop
: - Delete the release branch if necessary:
git branch -d release/${RELEASE_VERSION}
- Create a GitHub release for the created tag.
- Pour yourself a cup of coffee, you earned it!
- A package will be automatically created and published from CI to PyPI.
CI and requirements for publishing¶
In order to publish new versions of the package from the development branch, the CI pipeline requires the following secret variables set up:
The first 2 are used after tests run on the develop branch's CI workflow to automatically publish packages to TestPyPI.
The last 2 are used in the publish.yaml CI
workflow to publish packages to PyPI from develop
after
a GitHub release.
Publish to TestPyPI¶
We use bump2version to bump the build part of the version number without commiting or tagging the change and then publish a package to TestPyPI from CI using Twine. The version has the GitHub run number appended.
For more details refer to the files .github/workflows/publish.yaml and .github/workflows/tox.yaml.