pydvl.influence.array
¶
This module provides classes and utilities for handling large arrays that are chunked and lazily evaluated. It includes abstract base classes for converting between tensor types and NumPy arrays, aggregating blocks of data, and abstract representations of lazy arrays. Concrete implementations are provided for handling chunked lazy arrays (chunked in one resp. two dimensions), with support for efficient storage and retrieval using the Zarr library.
NumpyConverter
¶
SequenceAggregator
¶
Bases: Generic[TensorType]
, ABC
__call__
abstractmethod
¶
__call__(tensor_sequence: LazyChunkSequence)
Aggregates tensors from a sequence.
Implement this method to define how a sequence of tensors, provided by a generator, should be combined.
ListAggregator
¶
Bases: SequenceAggregator
__call__
¶
__call__(tensor_sequence: LazyChunkSequence) -> List[TensorType]
Aggregates tensors from a single-level generator into a list. This method simply collects each tensor emitted by the generator into a single list.
PARAMETER | DESCRIPTION |
---|---|
tensor_sequence |
Object wrapping a generator that yields
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[TensorType]
|
A list containing all the tensors provided by the tensor_generator. |
Source code in src/pydvl/influence/array.py
NestedSequenceAggregator
¶
Bases: Generic[TensorType]
, ABC
__call__
abstractmethod
¶
__call__(nested_sequence_of_tensors: NestedLazyChunkSequence)
Aggregates tensors from a nested sequence of tensors.
Implement this method to specify how tensors, nested in two layers of generators, should be combined. Useful for complex data structures where tensors are not directly accessible in a flat list.
Source code in src/pydvl/influence/array.py
NestedListAggregator
¶
Bases: NestedSequenceAggregator
__call__
¶
__call__(
nested_sequence_of_tensors: NestedLazyChunkSequence,
) -> List[List[TensorType]]
Aggregates tensors from a nested generator structure into a list of lists. Each inner generator is converted into a list of tensors, resulting in a nested list structure.
Args: nested_sequence_of_tensors: Object wrapping a generator of generators, where each inner generator yields TensorType objects.
RETURNS | DESCRIPTION |
---|---|
List[List[TensorType]]
|
A list of lists, where each inner list contains tensors returned from one of the inner generators. |
Source code in src/pydvl/influence/array.py
LazyChunkSequence
¶
LazyChunkSequence(
generator_factory: Callable[[], Generator[TensorType, None, None]],
len_generator: Optional[int] = None,
)
Bases: Generic[TensorType]
A class representing a chunked, and lazily evaluated array, where the chunking is restricted to the first dimension
This class is designed to handle large arrays that don't fit in memory. It works by generating chunks of the array on demand and can also convert these chunks to a Zarr array for efficient storage and retrieval.
ATTRIBUTE | DESCRIPTION |
---|---|
generator_factory |
A factory function that returns a generator. This generator yields chunks of the large array when called.
|
len_generator |
if the number of elements from the generator is known from the context, this optional parameter can be used to improve logging by adding a progressbar.
|
Source code in src/pydvl/influence/array.py
compute
¶
compute(aggregator: Optional[SequenceAggregator] = None)
Computes and optionally aggregates the chunks of the array using the provided aggregator. This method initiates the generation of chunks and then combines them according to the aggregator's logic.
PARAMETER | DESCRIPTION |
---|---|
aggregator |
An optional aggregator for combining the chunks of the array. If None, a default ListAggregator is used to simply collect the chunks into a list.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
The aggregated result of all chunks of the array, the format of which depends on the aggregator used. |
Source code in src/pydvl/influence/array.py
to_zarr
¶
to_zarr(
path_or_url: Union[str, StoreLike],
converter: NumpyConverter,
return_stored: bool = False,
overwrite: bool = False,
) -> Optional[Array]
Converts the array into Zarr format, a storage format optimized for large arrays, and stores it at the specified path or URL. This method is suitable for scenarios where the data needs to be saved for later use or for large datasets requiring efficient storage.
PARAMETER | DESCRIPTION |
---|---|
path_or_url |
The file path or URL where the Zarr array will be stored. Also excepts instances of zarr stores. |
converter |
A converter for transforming blocks into NumPy arrays compatible with Zarr.
TYPE:
|
return_stored |
If True, the method returns the stored Zarr array; otherwise, it returns None.
TYPE:
|
overwrite |
If True, overwrites existing data at the given path_or_url. If False, an error is raised in case of existing data.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Optional[Array]
|
The Zarr array if return_stored is True; otherwise, None. |
Source code in src/pydvl/influence/array.py
NestedLazyChunkSequence
¶
NestedLazyChunkSequence(
generator_factory: Callable[
[], Generator[Generator[TensorType, None, None], None, None]
],
len_outer_generator: Optional[int] = None,
)
Bases: Generic[TensorType]
A class representing chunked, and lazily evaluated array, where the chunking is restricted to the first two dimensions.
This class is designed for handling large arrays where individual chunks are loaded and processed lazily. It supports converting these chunks into a Zarr array for efficient storage and retrieval, with chunking applied along the first two dimensions.
ATTRIBUTE | DESCRIPTION |
---|---|
generator_factory |
A factory function that returns a generator of generators. Each inner generator yields chunks
|
len_outer_generator |
if the number of elements from the outer generator is known from the context, this optional parameter can be used to improve logging by adding a progressbar.
|
Source code in src/pydvl/influence/array.py
compute
¶
compute(aggregator: Optional[NestedSequenceAggregator] = None)
Computes and optionally aggregates the chunks of the array using the provided aggregator. This method initiates the generation of chunks and then combines them according to the aggregator's logic.
PARAMETER | DESCRIPTION |
---|---|
aggregator |
An optional aggregator for combining the chunks of the array. If None, a default NestedListAggregator is used to simply collect the chunks into a list of lists.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
The aggregated result of all chunks of the array, the format of which |
|
depends on the aggregator used. |
Source code in src/pydvl/influence/array.py
to_zarr
¶
to_zarr(
path_or_url: Union[str, StoreLike],
converter: NumpyConverter,
return_stored: bool = False,
overwrite: bool = False,
) -> Optional[Array]
Converts the array into Zarr format, a storage format optimized for large arrays, and stores it at the specified path or URL. This method is suitable for scenarios where the data needs to be saved for later use or for large datasets requiring efficient storage.
PARAMETER | DESCRIPTION |
---|---|
path_or_url |
The file path or URL where the Zarr array will be stored. Also excepts instances of zarr stores. |
converter |
A converter for transforming blocks into NumPy arrays compatible with Zarr.
TYPE:
|
return_stored |
If True, the method returns the stored Zarr array; otherwise, it returns None.
TYPE:
|
overwrite |
If True, overwrites existing data at the given path_or_url. If False, an error is raised in case of existing data.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Optional[Array]
|
The Zarr array if return_stored is True; otherwise, None. |