Influence calculator
This module provides functionality for calculating influences for large amount of data. The computation is based on a chunk computation model in the form of an instance of [InfluenceFunctionModel][pydvl.influence.base_influence_model.InfluenceFunctionModel], which is mapped over collection of chunks.
DisableClientSingleThreadCheck
¶
This type can be provided to the initialization of a DaskInfluenceCalculator instead of a distributed client object. It is useful in those scenarios, where the user want to disable the checking for thread-safety in the initialization phase, e.g. when using the single machine synchronous scheduler for debugging purposes.
Example
DaskInfluenceCalculator(influence_function_model, converter, client)
¶
This class is designed to compute influences over dask.array.Array collections, leveraging the capabilities of Dask for distributed computing and parallel processing. It requires an influence computation model of type [InfluenceFunctionModel][pydvl.influence.base_influence_model.InfluenceFunctionModel], which defines how influences are computed on a chunk of data. Essentially, this class functions by mapping the influence function model across the various chunks of a dask.array.Array collection.
PARAMETER | DESCRIPTION |
---|---|
influence_function_model |
instance of type [InfluenceFunctionModel][pydvl.influence.base_influence_model.InfluenceFunctionModel], that specifies the computation logic for influence on data chunks. It's a pivotal part of the calculator, determining how influence is computed and applied across the data array.
TYPE:
|
converter |
A utility for converting numpy arrays to TensorType objects, facilitating the interaction between numpy arrays and the influence function model.
TYPE:
|
client |
This parameter accepts either of two types:
During initialization, the system verifies if all workers are operating in
single-threaded mode when the provided influence_function_model is
designated as not thread-safe (indicated by the To intentionally skip this safety check (e.g., for debugging purposes using the single machine synchronous scheduler), you can supply the DisableClientSingleThreadCheck type.
TYPE:
|
Warning
Make sure to set threads_per_worker=1
, when using the distributed scheduler
for computing, if your implementation of
[InfluenceFunctionModel][pydvl.influence.base_influence_model.InfluenceFunctionModel]
is not thread-safe.
Example
import torch
from torch.utils.data import Dataset, DataLoader
from pydvl.influence import DaskInfluenceCalculator
from pydvl.influence.torch import CgInfluence
from pydvl.influence.torch.util import (
torch_dataset_to_dask_array,
TorchNumpyConverter,
)
from distributed import Client
# Possible some out of memory large Dataset
train_data_set: Dataset = LargeDataSet(...)
test_data_set: Dataset = LargeDataSet(...)
train_dataloader = DataLoader(train_data_set)
infl_model = CgInfluence(model, loss, hessian_regularization=0.01)
infl_model = if_model.fit(train_dataloader)
# wrap your input data into dask arrays
chunk_size = 10
da_x, da_y = torch_dataset_to_dask_array(train_data_set, chunk_size=chunk_size)
da_x_test, da_y_test = torch_dataset_to_dask_array(test_data_set,
chunk_size=chunk_size)
# use only one thread for scheduling, due to non-thread safety of some torch
# operations
client = Client(n_workers=4, threads_per_worker=1)
infl_calc = DaskInfluenceCalculator(infl_model,
TorchNumpyConverter(device=torch.device("cpu")),
client)
da_influences = infl_calc.influences(da_x_test, da_y_test, da_x, da_y)
# da_influences is a dask.array.Array
# trigger computation and write chunks to disk in parallel
da_influences.to_zarr("path/or/url")
Source code in src/pydvl/influence/influence_calculator.py
n_parameters
property
¶
Number of trainable parameters of the underlying model used in the batch computation
influence_factors(x, y)
¶
Computes the expression
where the gradients are computed for the chunks of \((x, y)\).
PARAMETER | DESCRIPTION |
---|---|
x |
model input to use in the gradient computations
TYPE:
|
y |
label tensor to compute gradients
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Array
|
dask.array.Array representing the element-wise inverse Hessian matrix vector products for the provided batch. |
Source code in src/pydvl/influence/influence_calculator.py
influences(x_test, y_test, x=None, y=None, mode=InfluenceMode.Up)
¶
Compute approximation of
for the case of up-weighting influence, resp.
for the perturbation type influence case. The computation is done block-wise for the chunks of the provided dask arrays.
PARAMETER | DESCRIPTION |
---|---|
x_test |
model input to use in the gradient computations of \(H^{-1}\nabla_{\theta} \ell(y_{\text{test}}, f_{\theta}(x_{\text{test}}))\)
TYPE:
|
y_test |
label tensor to compute gradients
TYPE:
|
x |
optional model input to use in the gradient computations \(\nabla_{\theta}\ell(y, f_{\theta}(x))\), resp. \(\nabla_{x}\nabla_{\theta}\ell(y, f_{\theta}(x))\), if None, use \(x=x_{\text{test}}\) |
y |
optional label tensor to compute gradients |
mode |
enum value of [InfluenceType][pydvl.influence.base_influence_model.InfluenceType]
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Array
|
dask.array.Array representing the element-wise scalar products for the provided batch. |
Source code in src/pydvl/influence/influence_calculator.py
|
|
influences_from_factors(z_test_factors, x, y, mode=InfluenceMode.Up)
¶
Computation of
for the case of up-weighting influence, resp.
for the perturbation type influence case. The gradient is meant to be per sample of the batch \((x, y)\).
PARAMETER | DESCRIPTION |
---|---|
z_test_factors |
pre-computed array, approximating \(H^{-1}\nabla_{\theta} \ell(y_{\text{test}}, f_{\theta}(x_{\text{test}}))\)
TYPE:
|
x |
optional model input to use in the gradient computations \(\nabla_{\theta}\ell(y, f_{\theta}(x))\), resp. \(\nabla_{x}\nabla_{\theta}\ell(y, f_{\theta}(x))\), if None, use \(x=x_{\text{test}}\)
TYPE:
|
y |
optional label tensor to compute gradients
TYPE:
|
mode |
enum value of [InfluenceType][pydvl.influence.twice_differentiable.InfluenceType]
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Array
|
dask.array.Array representing the element-wise scalar product of the provided batch |
Source code in src/pydvl/influence/influence_calculator.py
430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 |
|
SequentialInfluenceCalculator(influence_function_model)
¶
This class serves as a simple wrapper for processing batches of data in a sequential manner. It is particularly useful in scenarios where parallel or distributed processing is not required or not feasible. The core functionality of this class is to apply a specified influence computation model, of type [InfluenceFunctionModel][pydvl.influence.base_influence_model.InfluenceFunctionModel], to batches of data one at a time.
PARAMETER | DESCRIPTION |
---|---|
influence_function_model |
An instance of type [InfluenceFunctionModel] [pydvl.influence.base_influence_model.InfluenceFunctionModel], that specifies the computation logic for influence on data chunks.
TYPE:
|
Example
from pydvl.influence import SequentialInfluenceCalculator
from pydvl.influence.torch.util import (
NestedTorchCatAggregator,
TorchNumpyConverter,
)
from pydvl.influence.torch import CgInfluence
batch_size = 10
train_dataloader = DataLoader(..., batch_size=batch_size)
test_dataloader = DataLoader(..., batch_size=batch_size)
infl_model = CgInfluence(model, loss, hessian_regularization=0.01)
infl_model = infl_model.fit(train_dataloader)
infl_calc = SequentialInfluenceCalculator(if_model)
# this does not trigger the computation
lazy_influences = infl_calc.influences(test_dataloader, train_dataloader)
# trigger computation and pull the result into main memory, result is the full
# tensor for all combinations of the two loaders
influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())
# or
# trigger computation and write results chunk-wise to disk using zarr in a
# sequential manner
lazy_influences.to_zarr("local_path/or/url", TorchNumpyConverter())
Source code in src/pydvl/influence/influence_calculator.py
influence_factors(data_iterable)
¶
Compute the expression
where the gradient are computed for the chunks \((x, y)\) of the data_iterable in a sequential manner.
PARAMETER | DESCRIPTION |
---|---|
data_iterable |
An iterable that returns tuples of tensors. Each tuple consists of a pair of tensors (x, y), representing input data and corresponding targets. |
RETURNS | DESCRIPTION |
---|---|
LazyChunkSequence
|
A lazy data structure representing the chunks of the resulting tensor |
Source code in src/pydvl/influence/influence_calculator.py
influences(test_data_iterable, train_data_iterable, mode=InfluenceMode.Up)
¶
Compute approximation of
for the case of up-weighting influence, resp.
for the perturbation type influence case. The computation is done block-wise for the chunks of the provided data iterables and aggregated into a single tensor in memory.
PARAMETER | DESCRIPTION |
---|---|
test_data_iterable |
An iterable that returns tuples of tensors. Each tuple consists of a pair of tensors (x, y), representing input data and corresponding targets. |
train_data_iterable |
An iterable that returns tuples of tensors. Each tuple consists of a pair of tensors (x, y), representing input data and corresponding targets. |
mode |
enum value of [InfluenceType][pydvl.influence.base_influence_model.InfluenceType]
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NestedLazyChunkSequence
|
A lazy data structure representing the chunks of the resulting tensor |
Source code in src/pydvl/influence/influence_calculator.py
influences_from_factors(z_test_factors, train_data_iterable, mode=InfluenceMode.Up)
¶
Computation of
for the case of up-weighting influence, resp.
for the perturbation type influence case. The gradient is meant to be per sample of the batch \((x, y)\).
PARAMETER | DESCRIPTION |
---|---|
z_test_factors |
Pre-computed iterable of tensors, approximating \(H^{-1}\nabla_{\theta} \ell(y_{\text{test}}, f_{\theta}(x_{\text{test}}))\)
TYPE:
|
train_data_iterable |
An iterable that returns tuples of tensors. Each tuple consists of a pair of tensors (x, y), representing input data and corresponding targets. |
mode |
enum value of [InfluenceType][pydvl.influence.twice_differentiable.InfluenceType]
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NestedLazyChunkSequence
|
A lazy data structure representing the chunks of the resulting tensor |