Skip to content

General

This module contains influence calculation functions for general models, as introduced in (Koh and Liang, 2017)1.

References


  1. Koh, P.W., Liang, P., 2017. Understanding Black-box Predictions via Influence Functions. In: Proceedings of the 34th International Conference on Machine Learning, pp. 1885–1894. PMLR. 

InfluenceType

Bases: str, Enum

Enum representation for the types of influence.

ATTRIBUTE DESCRIPTION
Up

Up-weighting a training point, see section 2.1 of (Koh and Liang, 2017)1

Perturbation

Perturb a training point, see section 2.2 of (Koh and Liang, 2017)1

compute_influence_factors(model, training_data, test_data, inversion_method, *, hessian_perturbation=0.0, progress=False, **kwargs)

Calculates influence factors of a model for training and test data.

Given a test point \(z_{test} = (x_{test}, y_{test})\), a loss \(L(z_{test}, \theta)\) (\(\theta\) being the parameters of the model) and the Hessian of the model \(H_{\theta}\), influence factors are defined as:

\[ s_{test} = H_{\theta}^{-1} \operatorname{grad}_{\theta} L(z_{test}, \theta). \]

They are used for efficient influence calculation. This method first (implicitly) calculates the Hessian and then (explicitly) finds the influence factors for the model using the given inversion method. The parameter hessian_perturbation is used to regularize the inversion of the Hessian. For more info, refer to (Koh and Liang, 2017)1, paragraph 3.

PARAMETER DESCRIPTION
model

A model wrapped in the TwiceDifferentiable interface.

TYPE: TwiceDifferentiable

training_data

DataLoader containing the training data.

TYPE: DataLoaderType

test_data

DataLoader containing the test data.

TYPE: DataLoaderType

inversion_method

Name of method for computing inverse hessian vector products.

TYPE: InversionMethod

hessian_perturbation

Regularization of the hessian.

TYPE: float DEFAULT: 0.0

progress

If True, display progress bars.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
array

An array of size (N, D) containing the influence factors for each dimension (D) and test sample (N).

TYPE: InverseHvpResult

Source code in src/pydvl/influence/general.py
def compute_influence_factors(
    model: TwiceDifferentiable,
    training_data: DataLoaderType,
    test_data: DataLoaderType,
    inversion_method: InversionMethod,
    *,
    hessian_perturbation: float = 0.0,
    progress: bool = False,
    **kwargs: Any,
) -> InverseHvpResult:
    r"""
    Calculates influence factors of a model for training and test data.

    Given a test point \(z_{test} = (x_{test}, y_{test})\), a loss \(L(z_{test}, \theta)\)
    (\(\theta\) being the parameters of the model) and the Hessian of the model \(H_{\theta}\),
    influence factors are defined as:

    \[
    s_{test} = H_{\theta}^{-1} \operatorname{grad}_{\theta} L(z_{test}, \theta).
    \]

    They are used for efficient influence calculation. This method first (implicitly) calculates
    the Hessian and then (explicitly) finds the influence factors for the model using the given
    inversion method. The parameter `hessian_perturbation` is used to regularize the inversion of
    the Hessian. For more info, refer to (Koh and Liang, 2017)<sup><a href="#koh_liang_2017">1</a></sup>, paragraph 3.

    Args:
        model: A model wrapped in the TwiceDifferentiable interface.
        training_data: DataLoader containing the training data.
        test_data: DataLoader containing the test data.
        inversion_method: Name of method for computing inverse hessian vector products.
        hessian_perturbation: Regularization of the hessian.
        progress: If True, display progress bars.

    Returns:
        array: An array of size (N, D) containing the influence factors for each dimension (D) and test sample (N).

    """

    tensor_util: Type[TensorUtilities] = TensorUtilities.from_twice_differentiable(
        model
    )

    stack = tensor_util.stack
    unsqueeze = tensor_util.unsqueeze
    cat_gen = tensor_util.cat_gen
    cat = tensor_util.cat

    def test_grads() -> Generator[TensorType, None, None]:
        for x_test, y_test in maybe_progress(
            test_data, progress, desc="Batch Test Gradients"
        ):
            yield stack(
                [
                    model.grad(inpt, target)
                    for inpt, target in zip(unsqueeze(x_test, 1), y_test)
                ]
            )  # type:ignore

    try:
        # in case input_data is a torch DataLoader created from a Dataset,
        # we can pre-allocate the result tensor to reduce memory consumption
        resulting_shape = (len(test_data.dataset), model.num_params)  # type:ignore
        rhs = cat_gen(
            test_grads(), resulting_shape, model  # type:ignore
        )  # type:ignore
    except Exception as e:
        logger.warning(
            f"Failed to pre-allocate result tensor: {e}\n"
            f"Evaluate all resulting tensor and concatenate"
        )
        rhs = cat(list(test_grads()))

    return solve_hvp(
        inversion_method,
        model,
        training_data,
        rhs,
        hessian_perturbation=hessian_perturbation,
        **kwargs,
    )

compute_influences_up(model, input_data, influence_factors, *, progress=False)

Given the model, the training points, and the influence factors, this function calculates the influences using the up-weighting method.

The procedure involves two main steps: 1. Calculating the gradients of the model with respect to each training sample (\(\operatorname{grad}_{\theta} L\), where \(L\) is the loss of a single point and \(\theta\) are the parameters of the model). 2. Multiplying each gradient with the influence factors.

For a detailed description of the methodology, see section 2.1 of (Koh and Liang, 2017)1.

PARAMETER DESCRIPTION
model

A model that implements the TwiceDifferentiable interface.

TYPE: TwiceDifferentiable

input_data

DataLoader containing the samples for which the influence will be calculated.

TYPE: DataLoaderType

influence_factors

Array containing pre-computed influence factors.

TYPE: TensorType

progress

If set to True, progress bars will be displayed during computation.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
TensorType

An array of shape [NxM], where N is the number of influence factors, and M is the number of input samples.

Source code in src/pydvl/influence/general.py
def compute_influences_up(
    model: TwiceDifferentiable,
    input_data: DataLoaderType,
    influence_factors: TensorType,
    *,
    progress: bool = False,
) -> TensorType:
    r"""
    Given the model, the training points, and the influence factors, this function calculates the
    influences using the up-weighting method.

    The procedure involves two main steps:
    1. Calculating the gradients of the model with respect to each training sample
       (\(\operatorname{grad}_{\theta} L\), where \(L\) is the loss of a single point and \(\theta\) are the
       parameters of the model).
    2. Multiplying each gradient with the influence factors.

    For a detailed description of the methodology, see section 2.1 of (Koh and Liang, 2017)<sup><a href="#koh_liang_2017">1</a></sup>.

    Args:
        model: A model that implements the TwiceDifferentiable interface.
        input_data: DataLoader containing the samples for which the influence will be calculated.
        influence_factors: Array containing pre-computed influence factors.
        progress: If set to True, progress bars will be displayed during computation.

    Returns:
        An array of shape [NxM], where N is the number of influence factors, and M is the number of input samples.
    """

    tensor_util: Type[TensorUtilities] = TensorUtilities.from_twice_differentiable(
        model
    )

    stack = tensor_util.stack
    unsqueeze = tensor_util.unsqueeze
    cat_gen = tensor_util.cat_gen
    cat = tensor_util.cat
    einsum = tensor_util.einsum

    def train_grads() -> Generator[TensorType, None, None]:
        for x, y in maybe_progress(
            input_data, progress, desc="Batch Split Input Gradients"
        ):
            yield stack(
                [model.grad(inpt, target) for inpt, target in zip(unsqueeze(x, 1), y)]
            )  # type:ignore

    try:
        # in case input_data is a torch DataLoader created from a Dataset,
        # we can pre-allocate the result tensor to reduce memory consumption
        resulting_shape = (len(input_data.dataset), model.num_params)  # type:ignore
        train_grad_tensor = cat_gen(
            train_grads(), resulting_shape, model  # type:ignore
        )  # type:ignore
    except Exception as e:
        logger.warning(
            f"Failed to pre-allocate result tensor: {e}\n"
            f"Evaluate all resulting tensor and concatenate"
        )
        train_grad_tensor = cat([x for x in train_grads()])  # type:ignore

    return einsum("ta,va->tv", influence_factors, train_grad_tensor)  # type:ignore

compute_influences_pert(model, input_data, influence_factors, *, progress=False)

Calculates the influence values based on the influence factors and training samples using the perturbation method.

The process involves two main steps: 1. Calculating the gradient of the model with respect to each training sample (\(\operatorname{grad}_{\theta} L\), where \(L\) is the loss of the model for a single data point and \(\theta\) are the parameters of the model). 2. Using the method TwiceDifferentiable.mvp to efficiently compute the product of the influence factors and \(\operatorname{grad}_x \operatorname{grad}_{\theta} L\).

For a detailed methodology, see section 2.2 of (Koh and Liang, 2017)1.

PARAMETER DESCRIPTION
model

A model that implements the TwiceDifferentiable interface.

TYPE: TwiceDifferentiable

input_data

DataLoader containing the samples for which the influence will be calculated.

TYPE: DataLoaderType

influence_factors

Array containing pre-computed influence factors.

TYPE: TensorType

progress

If set to True, progress bars will be displayed during computation.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
TensorType

A 3D array with shape [NxMxP], where N is the number of influence factors, M is the number of input samples, and P is the number of features.

Source code in src/pydvl/influence/general.py
def compute_influences_pert(
    model: TwiceDifferentiable,
    input_data: DataLoaderType,
    influence_factors: TensorType,
    *,
    progress: bool = False,
) -> TensorType:
    r"""
    Calculates the influence values based on the influence factors and training samples using the perturbation method.

    The process involves two main steps:
    1. Calculating the gradient of the model with respect to each training sample
       (\(\operatorname{grad}_{\theta} L\), where \(L\) is the loss of the model for a single data point and \(\theta\)
       are the parameters of the model).
    2. Using the method [TwiceDifferentiable.mvp][pydvl.influence.twice_differentiable.TwiceDifferentiable.mvp]
       to efficiently compute the product of the
       influence factors and \(\operatorname{grad}_x \operatorname{grad}_{\theta} L\).

    For a detailed methodology, see section 2.2 of (Koh and Liang, 2017)<sup><a href="#koh_liang_2017">1</a></sup>.

    Args:
        model: A model that implements the TwiceDifferentiable interface.
        input_data: DataLoader containing the samples for which the influence will be calculated.
        influence_factors: Array containing pre-computed influence factors.
        progress: If set to True, progress bars will be displayed during computation.

    Returns:
        A 3D array with shape [NxMxP], where N is the number of influence factors,
            M is the number of input samples, and P is the number of features.
    """

    tensor_util: Type[TensorUtilities] = TensorUtilities.from_twice_differentiable(
        model
    )
    stack = tensor_util.stack
    tu_slice = tensor_util.slice
    reshape = tensor_util.reshape
    get_element = tensor_util.get_element
    shape = tensor_util.shape

    all_pert_influences = []
    for x, y in maybe_progress(
        input_data,
        progress,
        desc="Batch Influence Perturbation",
    ):
        for i in range(len(x)):
            tensor_x = tu_slice(x, i, i + 1)
            grad_xy = model.grad(tensor_x, get_element(y, i), create_graph=True)
            perturbation_influences = model.mvp(
                grad_xy,
                influence_factors,
                backprop_on=tensor_x,
            )
            all_pert_influences.append(
                reshape(perturbation_influences, (-1, *shape(get_element(x, i))))
            )

    return stack(all_pert_influences, axis=1)  # type:ignore

compute_influences(differentiable_model, training_data, *, test_data=None, input_data=None, inversion_method=InversionMethod.Direct, influence_type=InfluenceType.Up, hessian_regularization=0.0, progress=False, **kwargs)

Calculates the influence of each input data point on the specified test points.

This method operates in two primary stages: 1. Computes the influence factors for all test points concerning the model and its training data. 2. Uses these factors to derive the influences over the complete set of input data.

The influence calculation relies on the twice-differentiable nature of the provided model.

PARAMETER DESCRIPTION
differentiable_model

A model bundled with its corresponding loss in the TwiceDifferentiable wrapper.

TYPE: TwiceDifferentiable

training_data

DataLoader instance supplying the training data. This data is pivotal in computing the Hessian matrix for the model's loss.

TYPE: DataLoaderType

test_data

DataLoader instance with the test samples. Defaults to training_data if None.

TYPE: Optional[DataLoaderType] DEFAULT: None

input_data

DataLoader instance holding samples whose influences need to be computed. Defaults to training_data if None.

TYPE: Optional[DataLoaderType] DEFAULT: None

inversion_method

An enumeration value determining the approach for inverting matrices or computing inverse operations, see [.inversion.InversionMethod]

TYPE: InversionMethod DEFAULT: Direct

progress

A boolean indicating whether progress bars should be displayed during computation.

TYPE: bool DEFAULT: False

influence_type

Determines the methodology for computing influences. Valid choices include 'up' (for up-weighting) and 'perturbation'. For an in-depth understanding, see (Koh and Liang, 2017)1.

TYPE: InfluenceType DEFAULT: Up

hessian_regularization

A lambda value used in Hessian regularization. The regularized Hessian, \( H_{reg} \), is computed as \( H + \lambda \times I \), where \( I \) is the identity matrix and \( H \) is the simple, unmodified Hessian. This regularization is typically utilized for more sophisticated models to ensure that the Hessian remains positive definite.

TYPE: float DEFAULT: 0.0

RETURNS DESCRIPTION
TensorType

The shape of this array varies based on the influence_type. If 'up', the shape is [NxM], where N denotes the number of test points and M denotes the number of training points. Conversely, if the influence_type is 'perturbation', the shape is [NxMxP], with P representing the number of input features.

Source code in src/pydvl/influence/general.py
def compute_influences(
    differentiable_model: TwiceDifferentiable,
    training_data: DataLoaderType,
    *,
    test_data: Optional[DataLoaderType] = None,
    input_data: Optional[DataLoaderType] = None,
    inversion_method: InversionMethod = InversionMethod.Direct,
    influence_type: InfluenceType = InfluenceType.Up,
    hessian_regularization: float = 0.0,
    progress: bool = False,
    **kwargs: Any,
) -> TensorType:  # type: ignore # ToDO fix typing
    r"""
    Calculates the influence of each input data point on the specified test points.

    This method operates in two primary stages:
    1. Computes the influence factors for all test points concerning the model and its training data.
    2. Uses these factors to derive the influences over the complete set of input data.

    The influence calculation relies on the twice-differentiable nature of the provided model.

    Args:
        differentiable_model: A model bundled with its corresponding loss in the `TwiceDifferentiable` wrapper.
        training_data: DataLoader instance supplying the training data. This data is pivotal in computing the
                       Hessian matrix for the model's loss.
        test_data: DataLoader instance with the test samples. Defaults to `training_data` if None.
        input_data: DataLoader instance holding samples whose influences need to be computed. Defaults to
                    `training_data` if None.
        inversion_method: An enumeration value determining the approach for inverting matrices
            or computing inverse operations, see [.inversion.InversionMethod]
        progress: A boolean indicating whether progress bars should be displayed during computation.
        influence_type: Determines the methodology for computing influences.
            Valid choices include 'up' (for up-weighting) and 'perturbation'.
            For an in-depth understanding, see (Koh and Liang, 2017)<sup><a href="#koh_liang_2017">1</a></sup>.
        hessian_regularization: A lambda value used in Hessian regularization. The regularized Hessian, \( H_{reg} \),
            is computed as \( H + \lambda \times I \), where \( I \) is the identity matrix and \( H \)
            is the simple, unmodified Hessian. This regularization is typically utilized for more
            sophisticated models to ensure that the Hessian remains positive definite.

    Returns:
        The shape of this array varies based on the `influence_type`. If 'up', the shape is [NxM], where
            N denotes the number of test points and M denotes the number of training points. Conversely, if the
            influence_type is 'perturbation', the shape is [NxMxP], with P representing the number of input features.
    """

    if input_data is None:
        input_data = deepcopy(training_data)
    if test_data is None:
        test_data = deepcopy(training_data)

    influence_factors, _ = compute_influence_factors(
        differentiable_model,
        training_data,
        test_data,
        inversion_method,
        hessian_perturbation=hessian_regularization,
        progress=progress,
        **kwargs,
    )

    return influence_type_registry[influence_type](
        differentiable_model,
        input_data,
        influence_factors,
        progress=progress,
    )

Last update: 2023-10-14
Created: 2023-10-14