General

This module contains influence calculation functions for general models, as introduced in (Koh and Liang, 2017)¹.

References¶

Koh, P.W., Liang, P., 2017. Understanding Black-box Predictions via Influence Functions. In: Proceedings of the 34th International Conference on Machine Learning, pp. 1885–1894. PMLR. ↩

`InfluenceType` ¶

Bases: str, Enum

Enum representation for the types of influence.

ATTRIBUTE	DESCRIPTION
`Up`	Up-weighting a training point, see section 2.1 of (Koh and Liang, 2017)¹
`Perturbation`	Perturb a training point, see section 2.2 of (Koh and Liang, 2017)¹

`compute_influence_factors(model, training_data, test_data, inversion_method, *, hessian_perturbation=0.0, progress=False, **kwargs)` ¶

Calculates influence factors of a model for training and test data.

Given a test point \(z_{test} = (x_{test}, y_{test})\), a loss \(L(z_{test}, \theta)\) (\(\theta\) being the parameters of the model) and the Hessian of the model \(H_{\theta}\), influence factors are defined as:

\[ s_{test} = H_{\theta}^{-1} \operatorname{grad}_{\theta} L(z_{test}, \theta). \]

They are used for efficient influence calculation. This method first (implicitly) calculates the Hessian and then (explicitly) finds the influence factors for the model using the given inversion method. The parameter hessian_perturbation is used to regularize the inversion of the Hessian. For more info, refer to (Koh and Liang, 2017)¹, paragraph 3.

PARAMETER	DESCRIPTION
`model`	A model wrapped in the TwiceDifferentiable interface. TYPE: `TwiceDifferentiable`
`training_data`	DataLoader containing the training data. TYPE: `DataLoaderType`
`test_data`	DataLoader containing the test data. TYPE: `DataLoaderType`
`inversion_method`	Name of method for computing inverse hessian vector products. TYPE: `InversionMethod`
`hessian_perturbation`	Regularization of the hessian. TYPE: `float` DEFAULT: `0.0`
`progress`	If True, display progress bars. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`array`	An array of size (N, D) containing the influence factors for each dimension (D) and test sample (N). TYPE: `InverseHvpResult`

Source code in src/pydvl/influence/general.py

def compute_influence_factors(
    model: TwiceDifferentiable,
    training_data: DataLoaderType,
    test_data: DataLoaderType,
    inversion_method: InversionMethod,
    *,
    hessian_perturbation: float = 0.0,
    progress: bool = False,
    **kwargs: Any,
) -> InverseHvpResult:
    r"""
    Calculates influence factors of a model for training and test data.

    Given a test point \(z_{test} = (x_{test}, y_{test})\), a loss \(L(z_{test}, \theta)\)
    (\(\theta\) being the parameters of the model) and the Hessian of the model \(H_{\theta}\),
    influence factors are defined as:

    \[
    s_{test} = H_{\theta}^{-1} \operatorname{grad}_{\theta} L(z_{test}, \theta).
    \]

    They are used for efficient influence calculation. This method first (implicitly) calculates
    the Hessian and then (explicitly) finds the influence factors for the model using the given
    inversion method. The parameter `hessian_perturbation` is used to regularize the inversion of
    the Hessian. For more info, refer to (Koh and Liang, 2017)<sup><a href="#koh_liang_2017">1</a></sup>, paragraph 3.

    Args:
        model: A model wrapped in the TwiceDifferentiable interface.
        training_data: DataLoader containing the training data.
        test_data: DataLoader containing the test data.
        inversion_method: Name of method for computing inverse hessian vector products.
        hessian_perturbation: Regularization of the hessian.
        progress: If True, display progress bars.

    Returns:
        array: An array of size (N, D) containing the influence factors for each dimension (D) and test sample (N).

    """

    tensor_util: Type[TensorUtilities] = TensorUtilities.from_twice_differentiable(
        model
    )

    stack = tensor_util.stack
    unsqueeze = tensor_util.unsqueeze
    cat_gen = tensor_util.cat_gen
    cat = tensor_util.cat

    def test_grads() -> Generator[TensorType, None, None]:
        for x_test, y_test in maybe_progress(
            test_data, progress, desc="Batch Test Gradients"
        ):
            yield stack(
                [
                    model.grad(inpt, target)
                    for inpt, target in zip(unsqueeze(x_test, 1), y_test)
                ]
            )  # type:ignore

    try:
        # in case input_data is a torch DataLoader created from a Dataset,
        # we can pre-allocate the result tensor to reduce memory consumption
        resulting_shape = (len(test_data.dataset), model.num_params)  # type:ignore
        rhs = cat_gen(
            test_grads(), resulting_shape, model  # type:ignore
        )  # type:ignore
    except Exception as e:
        logger.warning(
            f"Failed to pre-allocate result tensor: {e}\n"
            f"Evaluate all resulting tensor and concatenate"
        )
        rhs = cat(list(test_grads()))

    return solve_hvp(
        inversion_method,
        model,
        training_data,
        rhs,
        hessian_perturbation=hessian_perturbation,
        **kwargs,
    )

`compute_influences_up(model, input_data, influence_factors, *, progress=False)` ¶

Given the model, the training points, and the influence factors, this function calculates the influences using the up-weighting method.

The procedure involves two main steps: 1. Calculating the gradients of the model with respect to each training sample (\(\operatorname{grad}_{\theta} L\), where \(L\) is the loss of a single point and \(\theta\) are the parameters of the model). 2. Multiplying each gradient with the influence factors.

For a detailed description of the methodology, see section 2.1 of (Koh and Liang, 2017)¹.

PARAMETER	DESCRIPTION
`model`	A model that implements the TwiceDifferentiable interface. TYPE: `TwiceDifferentiable`
`input_data`	DataLoader containing the samples for which the influence will be calculated. TYPE: `DataLoaderType`
`influence_factors`	Array containing pre-computed influence factors. TYPE: `TensorType`
`progress`	If set to True, progress bars will be displayed during computation. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`TensorType`	An array of shape [NxM], where N is the number of influence factors, and M is the number of input samples.

Source code in src/pydvl/influence/general.py

def compute_influences_up(
    model: TwiceDifferentiable,
    input_data: DataLoaderType,
    influence_factors: TensorType,
    *,
    progress: bool = False,
) -> TensorType:
    r"""
    Given the model, the training points, and the influence factors, this function calculates the
    influences using the up-weighting method.

    The procedure involves two main steps:
    1. Calculating the gradients of the model with respect to each training sample
       (\(\operatorname{grad}_{\theta} L\), where \(L\) is the loss of a single point and \(\theta\) are the
       parameters of the model).
    2. Multiplying each gradient with the influence factors.

    For a detailed description of the methodology, see section 2.1 of (Koh and Liang, 2017)<sup><a href="#koh_liang_2017">1</a></sup>.

    Args:
        model: A model that implements the TwiceDifferentiable interface.
        input_data: DataLoader containing the samples for which the influence will be calculated.
        influence_factors: Array containing pre-computed influence factors.
        progress: If set to True, progress bars will be displayed during computation.

    Returns:
        An array of shape [NxM], where N is the number of influence factors, and M is the number of input samples.
    """

    tensor_util: Type[TensorUtilities] = TensorUtilities.from_twice_differentiable(
        model
    )

    stack = tensor_util.stack
    unsqueeze = tensor_util.unsqueeze
    cat_gen = tensor_util.cat_gen
    cat = tensor_util.cat
    einsum = tensor_util.einsum

    def train_grads() -> Generator[TensorType, None, None]:
        for x, y in maybe_progress(
            input_data, progress, desc="Batch Split Input Gradients"
        ):
            yield stack(
                [model.grad(inpt, target) for inpt, target in zip(unsqueeze(x, 1), y)]
            )  # type:ignore

    try:
        # in case input_data is a torch DataLoader created from a Dataset,
        # we can pre-allocate the result tensor to reduce memory consumption
        resulting_shape = (len(input_data.dataset), model.num_params)  # type:ignore
        train_grad_tensor = cat_gen(
            train_grads(), resulting_shape, model  # type:ignore
        )  # type:ignore
    except Exception as e:
        logger.warning(
            f"Failed to pre-allocate result tensor: {e}\n"
            f"Evaluate all resulting tensor and concatenate"
        )
        train_grad_tensor = cat([x for x in train_grads()])  # type:ignore

    return einsum("ta,va->tv", influence_factors, train_grad_tensor)  # type:ignore

`compute_influences_pert(model, input_data, influence_factors, *, progress=False)` ¶

Calculates the influence values based on the influence factors and training samples using the perturbation method.

The process involves two main steps: 1. Calculating the gradient of the model with respect to each training sample (\(\operatorname{grad}_{\theta} L\), where \(L\) is the loss of the model for a single data point and \(\theta\) are the parameters of the model). 2. Using the method TwiceDifferentiable.mvp to efficiently compute the product of the influence factors and \(\operatorname{grad}_x \operatorname{grad}_{\theta} L\).

For a detailed methodology, see section 2.2 of (Koh and Liang, 2017)¹.

PARAMETER	DESCRIPTION
`model`	A model that implements the TwiceDifferentiable interface. TYPE: `TwiceDifferentiable`
`input_data`	DataLoader containing the samples for which the influence will be calculated. TYPE: `DataLoaderType`
`influence_factors`	Array containing pre-computed influence factors. TYPE: `TensorType`
`progress`	If set to True, progress bars will be displayed during computation. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`TensorType`	A 3D array with shape [NxMxP], where N is the number of influence factors, M is the number of input samples, and P is the number of features.

Source code in src/pydvl/influence/general.py

def compute_influences_pert(
    model: TwiceDifferentiable,
    input_data: DataLoaderType,
    influence_factors: TensorType,
    *,
    progress: bool = False,
) -> TensorType:
    r"""
    Calculates the influence values based on the influence factors and training samples using the perturbation method.

    The process involves two main steps:
    1. Calculating the gradient of the model with respect to each training sample
       (\(\operatorname{grad}_{\theta} L\), where \(L\) is the loss of the model for a single data point and \(\theta\)
       are the parameters of the model).
    2. Using the method [TwiceDifferentiable.mvp][pydvl.influence.twice_differentiable.TwiceDifferentiable.mvp]
       to efficiently compute the product of the
       influence factors and \(\operatorname{grad}_x \operatorname{grad}_{\theta} L\).

    For a detailed methodology, see section 2.2 of (Koh and Liang, 2017)<sup><a href="#koh_liang_2017">1</a></sup>.

    Args:
        model: A model that implements the TwiceDifferentiable interface.
        input_data: DataLoader containing the samples for which the influence will be calculated.
        influence_factors: Array containing pre-computed influence factors.
        progress: If set to True, progress bars will be displayed during computation.

    Returns:
        A 3D array with shape [NxMxP], where N is the number of influence factors,
            M is the number of input samples, and P is the number of features.
    """

    tensor_util: Type[TensorUtilities] = TensorUtilities.from_twice_differentiable(
        model
    )
    stack = tensor_util.stack
    tu_slice = tensor_util.slice
    reshape = tensor_util.reshape
    get_element = tensor_util.get_element
    shape = tensor_util.shape

    all_pert_influences = []
    for x, y in maybe_progress(
        input_data,
        progress,
        desc="Batch Influence Perturbation",
    ):
        for i in range(len(x)):
            tensor_x = tu_slice(x, i, i + 1)
            grad_xy = model.grad(tensor_x, get_element(y, i), create_graph=True)
            perturbation_influences = model.mvp(
                grad_xy,
                influence_factors,
                backprop_on=tensor_x,
            )
            all_pert_influences.append(
                reshape(perturbation_influences, (-1, *shape(get_element(x, i))))
            )

    return stack(all_pert_influences, axis=1)  # type:ignore

`compute_influences(differentiable_model, training_data, *, test_data=None, input_data=None, inversion_method=InversionMethod.Direct, influence_type=InfluenceType.Up, hessian_regularization=0.0, progress=False, **kwargs)` ¶

Calculates the influence of each input data point on the specified test points.

This method operates in two primary stages: 1. Computes the influence factors for all test points concerning the model and its training data. 2. Uses these factors to derive the influences over the complete set of input data.

The influence calculation relies on the twice-differentiable nature of the provided model.

PARAMETER	DESCRIPTION
`differentiable_model`	A model bundled with its corresponding loss in the `TwiceDifferentiable` wrapper. TYPE: `TwiceDifferentiable`
`training_data`	DataLoader instance supplying the training data. This data is pivotal in computing the Hessian matrix for the model's loss. TYPE: `DataLoaderType`
`test_data`	DataLoader instance with the test samples. Defaults to `training_data` if None. TYPE: `Optional[DataLoaderType]` DEFAULT: `None`
`input_data`	DataLoader instance holding samples whose influences need to be computed. Defaults to `training_data` if None. TYPE: `Optional[DataLoaderType]` DEFAULT: `None`
`inversion_method`	An enumeration value determining the approach for inverting matrices or computing inverse operations, see [.inversion.InversionMethod] TYPE: `InversionMethod` DEFAULT: `Direct`
`progress`	A boolean indicating whether progress bars should be displayed during computation. TYPE: `bool` DEFAULT: `False`
`influence_type`	Determines the methodology for computing influences. Valid choices include 'up' (for up-weighting) and 'perturbation'. For an in-depth understanding, see (Koh and Liang, 2017)¹. TYPE: `InfluenceType` DEFAULT: `Up`
`hessian_regularization`	A lambda value used in Hessian regularization. The regularized Hessian, \( H_{reg} \), is computed as \( H + \lambda \times I \), where \( I \) is the identity matrix and \( H \) is the simple, unmodified Hessian. This regularization is typically utilized for more sophisticated models to ensure that the Hessian remains positive definite. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`TensorType`	The shape of this array varies based on the `influence_type`. If 'up', the shape is [NxM], where N denotes the number of test points and M denotes the number of training points. Conversely, if the influence_type is 'perturbation', the shape is [NxMxP], with P representing the number of input features.

Source code in src/pydvl/influence/general.py

def compute_influences(
    differentiable_model: TwiceDifferentiable,
    training_data: DataLoaderType,
    *,
    test_data: Optional[DataLoaderType] = None,
    input_data: Optional[DataLoaderType] = None,
    inversion_method: InversionMethod = InversionMethod.Direct,
    influence_type: InfluenceType = InfluenceType.Up,
    hessian_regularization: float = 0.0,
    progress: bool = False,
    **kwargs: Any,
) -> TensorType:  # type: ignore # ToDO fix typing
    r"""
    Calculates the influence of each input data point on the specified test points.

    This method operates in two primary stages:
    1. Computes the influence factors for all test points concerning the model and its training data.
    2. Uses these factors to derive the influences over the complete set of input data.

    The influence calculation relies on the twice-differentiable nature of the provided model.

    Args:
        differentiable_model: A model bundled with its corresponding loss in the `TwiceDifferentiable` wrapper.
        training_data: DataLoader instance supplying the training data. This data is pivotal in computing the
                       Hessian matrix for the model's loss.
        test_data: DataLoader instance with the test samples. Defaults to `training_data` if None.
        input_data: DataLoader instance holding samples whose influences need to be computed. Defaults to
                    `training_data` if None.
        inversion_method: An enumeration value determining the approach for inverting matrices
            or computing inverse operations, see [.inversion.InversionMethod]
        progress: A boolean indicating whether progress bars should be displayed during computation.
        influence_type: Determines the methodology for computing influences.
            Valid choices include 'up' (for up-weighting) and 'perturbation'.
            For an in-depth understanding, see (Koh and Liang, 2017)<sup><a href="#koh_liang_2017">1</a></sup>.
        hessian_regularization: A lambda value used in Hessian regularization. The regularized Hessian, \( H_{reg} \),
            is computed as \( H + \lambda \times I \), where \( I \) is the identity matrix and \( H \)
            is the simple, unmodified Hessian. This regularization is typically utilized for more
            sophisticated models to ensure that the Hessian remains positive definite.

    Returns:
        The shape of this array varies based on the `influence_type`. If 'up', the shape is [NxM], where
            N denotes the number of test points and M denotes the number of training points. Conversely, if the
            influence_type is 'perturbation', the shape is [NxMxP], with P representing the number of input features.
    """

    if input_data is None:
        input_data = deepcopy(training_data)
    if test_data is None:
        test_data = deepcopy(training_data)

    influence_factors, _ = compute_influence_factors(
        differentiable_model,
        training_data,
        test_data,
        inversion_method,
        hessian_perturbation=hessian_regularization,
        progress=progress,
        **kwargs,
    )

    return influence_type_registry[influence_type](
        differentiable_model,
        input_data,
        influence_factors,
        progress=progress,
    )

Last update: 2023-10-14
Created: 2023-10-14

General

References¶

InfluenceType ¶

compute_influence_factors(model, training_data, test_data, inversion_method, *, hessian_perturbation=0.0, progress=False, **kwargs) ¶

compute_influences_up(model, input_data, influence_factors, *, progress=False) ¶

compute_influences_pert(model, input_data, influence_factors, *, progress=False) ¶

compute_influences(differentiable_model, training_data, *, test_data=None, input_data=None, inversion_method=InversionMethod.Direct, influence_type=InfluenceType.Up, hessian_regularization=0.0, progress=False, **kwargs) ¶

`InfluenceType` ¶

`compute_influence_factors(model, training_data, test_data, inversion_method, *, hessian_perturbation=0.0, progress=False, **kwargs)` ¶

`compute_influences_up(model, input_data, influence_factors, *, progress=False)` ¶

`compute_influences_pert(model, input_data, influence_factors, *, progress=False)` ¶

`compute_influences(differentiable_model, training_data, *, test_data=None, input_data=None, inversion_method=InversionMethod.Direct, influence_type=InfluenceType.Up, hessian_regularization=0.0, progress=False, **kwargs)` ¶