pydvl.influence.influence_calculator
¶
This module provides functionality for calculating influences for large amount of data. The computation is based on a chunk computation model in the form of an instance of InfluenceFunctionModel, which is mapped over collection of chunks.
DisableClientSingleThreadCheck
¶
This type can be provided to the initialization of a DaskInfluenceCalculator instead of a distributed client object. It is useful in those scenarios, where the user want to disable the checking for thread-safety in the initialization phase, e.g. when using the single machine synchronous scheduler for debugging purposes.
Example
DaskInfluenceCalculator
¶
DaskInfluenceCalculator(
influence_function_model: InfluenceFunctionModel,
converter: NumpyConverter,
client: Union[Client, Type[DisableClientSingleThreadCheck]],
)
This class is designed to compute influences over dask.array.Array collections, leveraging the capabilities of Dask for distributed computing and parallel processing. It requires an influence computation model of type InfluenceFunctionModel, which defines how influences are computed on a chunk of data. Essentially, this class functions by mapping the influence function model across the various chunks of a dask.array.Array collection.
PARAMETER | DESCRIPTION |
---|---|
influence_function_model |
instance of type InfluenceFunctionModel, that specifies the computation logic for influence on data chunks. It's a pivotal part of the calculator, determining how influence is computed and applied across the data array.
TYPE:
|
converter |
A utility for converting numpy arrays to TensorType objects, facilitating the interaction between numpy arrays and the influence function model.
TYPE:
|
client |
This parameter accepts either of two types:
During initialization, the system verifies if all workers are operating in
single-threaded mode when the provided influence_function_model is
designated as not thread-safe (indicated by the To intentionally skip this safety check (e.g., for debugging purposes using the single machine synchronous scheduler), you can supply the DisableClientSingleThreadCheck type.
TYPE:
|
Warning
Make sure to set threads_per_worker=1
, when using the distributed scheduler
for computing, if your implementation of
InfluenceFunctionModel
is not thread-safe.
Example
import torch
from torch.utils.data import Dataset, DataLoader
from pydvl.influence import DaskInfluenceCalculator
from pydvl.influence.torch import CgInfluence
from pydvl.influence.torch.util import (
torch_dataset_to_dask_array,
TorchNumpyConverter,
)
from distributed import Client
# Possible some out of memory large Dataset
train_data_set: Dataset = LargeDataSet(...)
test_data_set: Dataset = LargeDataSet(...)
train_dataloader = DataLoader(train_data_set)
infl_model = CgInfluence(model, loss, hessian_regularization=0.01)
infl_model = if_model.fit(train_dataloader)
# wrap your input data into dask arrays
chunk_size = 10
da_x, da_y = torch_dataset_to_dask_array(train_data_set, chunk_size=chunk_size)
da_x_test, da_y_test = torch_dataset_to_dask_array(test_data_set,
chunk_size=chunk_size)
# use only one thread for scheduling, due to non-thread safety of some torch
# operations
client = Client(n_workers=4, threads_per_worker=1)
infl_calc = DaskInfluenceCalculator(infl_model,
TorchNumpyConverter(device=torch.device("cpu")),
client)
da_influences = infl_calc.influences(da_x_test, da_y_test, da_x, da_y)
# da_influences is a dask.array.Array
# trigger computation and write chunks to disk in parallel
da_influences.to_zarr("path/or/url")
Source code in src/pydvl/influence/influence_calculator.py
n_parameters
property
¶
Number of trainable parameters of the underlying model used in the batch computation
influence_factors
¶
Computes the expression
where the gradients are computed for the chunks of \((x, y)\).
PARAMETER | DESCRIPTION |
---|---|
x |
model input to use in the gradient computations
TYPE:
|
y |
label tensor to compute gradients
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Array
|
dask.array.Array representing the element-wise inverse Hessian matrix vector products for the provided batch. |
Source code in src/pydvl/influence/influence_calculator.py
influences
¶
influences(
x_test: Array,
y_test: Array,
x: Optional[Array] = None,
y: Optional[Array] = None,
mode: InfluenceMode = InfluenceMode.Up,
) -> Array
Compute approximation of
for the case of up-weighting influence, resp.
for the perturbation type influence case. The computation is done block-wise for the chunks of the provided dask arrays.
PARAMETER | DESCRIPTION |
---|---|
x_test |
model input to use in the gradient computations of \(H^{-1}\nabla_{\theta} \ell(y_{\text{test}}, f_{\theta}(x_{\text{test}}))\)
TYPE:
|
y_test |
label tensor to compute gradients
TYPE:
|
x |
optional model input to use in the gradient computations \(\nabla_{\theta}\ell(y, f_{\theta}(x))\), resp. \(\nabla_{x}\nabla_{\theta}\ell(y, f_{\theta}(x))\), if None, use \(x=x_{\text{test}}\) |
y |
optional label tensor to compute gradients |
mode |
enum value of InfluenceMode
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Array
|
dask.array.Array representing the element-wise scalar products for the provided batch. |
Source code in src/pydvl/influence/influence_calculator.py
302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 |
|
influences_from_factors
¶
influences_from_factors(
z_test_factors: Array,
x: Array,
y: Array,
mode: InfluenceMode = InfluenceMode.Up,
) -> Array
Computation of
for the case of up-weighting influence, resp.
for the perturbation type influence case. The gradient is meant to be per sample of the batch \((x, y)\).
PARAMETER | DESCRIPTION |
---|---|
z_test_factors |
pre-computed array, approximating \(H^{-1}\nabla_{\theta} \ell(y_{\text{test}}, f_{\theta}(x_{\text{test}}))\)
TYPE:
|
x |
optional model input to use in the gradient computations \(\nabla_{\theta}\ell(y, f_{\theta}(x))\), resp. \(\nabla_{x}\nabla_{\theta}\ell(y, f_{\theta}(x))\), if None, use \(x=x_{\text{test}}\)
TYPE:
|
y |
optional label tensor to compute gradients
TYPE:
|
mode |
enum value of InfluenceMode
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Array
|
dask.array.Array representing the element-wise scalar product of the provided batch |
Source code in src/pydvl/influence/influence_calculator.py
430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 |
|
SequentialInfluenceCalculator
¶
SequentialInfluenceCalculator(influence_function_model: InfluenceFunctionModel)
This class serves as a simple wrapper for processing batches of data in a sequential manner. It is particularly useful in scenarios where parallel or distributed processing is not required or not feasible. The core functionality of this class is to apply a specified influence computation model, of type InfluenceFunctionModel, to batches of data one at a time.
PARAMETER | DESCRIPTION |
---|---|
influence_function_model |
An instance of type [InfluenceFunctionModel] [pydvl.influence.base_influence_function_model.InfluenceFunctionModel], that specifies the computation logic for influence on data chunks.
TYPE:
|
Example
from pydvl.influence import SequentialInfluenceCalculator
from pydvl.influence.torch.util import (
NestedTorchCatAggregator,
TorchNumpyConverter,
)
from pydvl.influence.torch import CgInfluence
batch_size = 10
train_dataloader = DataLoader(..., batch_size=batch_size)
test_dataloader = DataLoader(..., batch_size=batch_size)
infl_model = CgInfluence(model, loss, hessian_regularization=0.01)
infl_model = infl_model.fit(train_dataloader)
infl_calc = SequentialInfluenceCalculator(if_model)
# this does not trigger the computation
lazy_influences = infl_calc.influences(test_dataloader, train_dataloader)
# trigger computation and pull the result into main memory, result is the full
# tensor for all combinations of the two loaders
influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())
# or
# trigger computation and write results chunk-wise to disk using zarr in a
# sequential manner
lazy_influences.to_zarr("local_path/or/url", TorchNumpyConverter())
Source code in src/pydvl/influence/influence_calculator.py
influence_factors
¶
influence_factors(
data_iterable: Iterable[Tuple[TensorType, TensorType]]
) -> LazyChunkSequence
Compute the expression
where the gradient are computed for the chunks \((x, y)\) of the data_iterable in a sequential manner.
PARAMETER | DESCRIPTION |
---|---|
data_iterable |
An iterable that returns tuples of tensors. Each tuple consists of a pair of tensors (x, y), representing input data and corresponding targets. |
RETURNS | DESCRIPTION |
---|---|
LazyChunkSequence
|
A lazy data structure representing the chunks of the resulting tensor |
Source code in src/pydvl/influence/influence_calculator.py
influences
¶
influences(
test_data_iterable: Iterable[Tuple[TensorType, TensorType]],
train_data_iterable: Iterable[Tuple[TensorType, TensorType]],
mode: InfluenceMode = InfluenceMode.Up,
) -> NestedLazyChunkSequence
Compute approximation of
for the case of up-weighting influence, resp.
for the perturbation type influence case. The computation is done block-wise for the chunks of the provided data iterables and aggregated into a single tensor in memory.
PARAMETER | DESCRIPTION |
---|---|
test_data_iterable |
An iterable that returns tuples of tensors. Each tuple consists of a pair of tensors (x, y), representing input data and corresponding targets. |
train_data_iterable |
An iterable that returns tuples of tensors. Each tuple consists of a pair of tensors (x, y), representing input data and corresponding targets. |
mode |
enum value of InfluenceMode
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NestedLazyChunkSequence
|
A lazy data structure representing the chunks of the resulting tensor |
Source code in src/pydvl/influence/influence_calculator.py
influences_from_factors
¶
influences_from_factors(
z_test_factors: Iterable[TensorType],
train_data_iterable: Iterable[Tuple[TensorType, TensorType]],
mode: InfluenceMode = InfluenceMode.Up,
) -> NestedLazyChunkSequence
Computation of
for the case of up-weighting influence, resp.
for the perturbation type influence case. The gradient is meant to be per sample of the batch \((x, y)\).
PARAMETER | DESCRIPTION |
---|---|
z_test_factors |
Pre-computed iterable of tensors, approximating \(H^{-1}\nabla_{\theta} \ell(y_{\text{test}}, f_{\theta}(x_{\text{test}}))\)
TYPE:
|
train_data_iterable |
An iterable that returns tuples of tensors. Each tuple consists of a pair of tensors (x, y), representing input data and corresponding targets. |
mode |
enum value of InfluenceMode
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
NestedLazyChunkSequence
|
A lazy data structure representing the chunks of the resulting tensor |