Influence functions for outlier detection ¶

This notebook shows how to calculate influences on a NN model using pyDVL for an arbitrary dataset, and how this can be used to find anomalous or corrupted data points.

It uses the wine dataset from sklearn: given a set of 13 different input parameters regarding a particular bottle, each related to some physical property (e.g. concentration of magnesium, malic acidity, alcoholic percentage, etc.), the model will need to predict to which of 3 classes the wine belongs to. For more details, please refer to the sklearn documentation .

If you are reading this in the documentation, some boilerplate has been omitted for convenience.

Let's start by loading the imports, the dataset and splitting it into train, validation and test sets. We will use a large test set to have a less noisy estimate of the average influence.

Imports ¶

%matplotlib inline

import os
import random

import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn.functional as F
from scipy.stats import pearsonr, spearmanr
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, f1_score
from support.common import plot_losses
from support.influence import TorchMLP, fit_torch_model
from support.shapley import load_wine_dataset
from torch.optim import Adam, lr_scheduler
from torch.utils.data import DataLoader, TensorDataset

from pydvl.influence.torch import (
    ArnoldiInfluence,
    CgInfluence,
    DirectInfluence,
    EkfacInfluence,
    LissaInfluence,
    NystroemSketchInfluence,
)

plt.rcParams["figure.figsize"] = (16, 8)
plt.rcParams["font.size"] = 12
plt.rcParams["xtick.labelsize"] = 12
plt.rcParams["ytick.labelsize"] = 10
plt.rcParams["axes.facecolor"] = (1, 1, 1, 0)
plt.rcParams["figure.facecolor"] = (1, 1, 1, 0)

Dataset ¶

training_data, val_data, test_data, feature_names = load_wine_dataset(
    train_size=0.6, test_size=0.3
)

We will corrupt some of the training points by flipping their labels

num_corrupted_idxs = 10
training_data[1][:num_corrupted_idxs] = torch.tensor(
    [(val + 1) % 3 for val in training_data[1][:num_corrupted_idxs]]
)

and let's wrap it in a pytorch data loader

training_data_loader = DataLoader(
    TensorDataset(*training_data), batch_size=32, shuffle=False
)
val_data_loader = DataLoader(TensorDataset(*val_data), batch_size=32, shuffle=False)
test_data_loader = DataLoader(TensorDataset(*test_data), batch_size=32, shuffle=False)

Fit a neural network to the data ¶

We will train a 2-layer neural network. PyDVL has some convenience wrappers to initialize a pytorch NN. If you already have a model loaded and trained, you can skip this section.

feature_dimension = 13
num_classes = 3
network_size = [16, 16]
layers_size = [feature_dimension, *network_size, num_classes]
num_epochs = 300
lr = 0.005
weight_decay = 0.01

nn_model = TorchMLP(layers_size)
nn_model.to(device)

optimizer = Adam(params=nn_model.parameters(), lr=lr, weight_decay=weight_decay)
scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_epochs)

losses = fit_torch_model(
    model=nn_model,
    training_data=training_data_loader,
    val_data=val_data_loader,
    loss=F.cross_entropy,
    optimizer=optimizer,
    scheduler=scheduler,
    num_epochs=num_epochs,
    device=device,
)

Let's check that the training has found a stable minimum by plotting the training and validation loss

plot_losses(losses)

No description has been provided for this image

Since it is a classification problem, let's also take a look at the confusion matrix on the test set

And let's compute the f1 score of the model

f1_score(test_data[1], pred_y_test, average="weighted")

0.9633110554163186

Let's now move to calculating influences of each point on the total score.

Calculating influences for small neural networks ¶

The following cell calculates the influences of each training data point on the neural network. Neural networks have typically a very bumpy parameter space, which, during training, is explored until the configuration that minimises the loss is found. There is an important assumption in influence functions that the model lays at a (at least local) minimum of such loss, and if this is not fulfilled many issues can arise. In order to avoid this scenario, a regularisation term should be used whenever dealing with big and noisy models.

influence_model = DirectInfluence(
    nn_model,
    F.cross_entropy,
    regularization=0.1,
)
influence_model = influence_model.fit(training_data_loader)
train_influences = influence_model.influences(*test_data, *training_data, mode="up")

the returned matrix, train_influences, has a quantity of columns equal to the points in the training set, and a number of rows equal to the points in the test set. At each element \(a_{i,j}\) it stores the influence that training point \(j\) has on the classification of test point \(i\) .

If we take the average across every column of the influences matrix, we obtain an estimate of the overall influence of a training point on the total accuracy of the network.

mean_train_influences = np.mean(train_influences.cpu().numpy(), axis=0)
mean_train_influences.shape

(106,)

The following histogram shows that there are big differences in score within the training set (notice the log-scale on the y axis).

We can see that the corrupted points tend to have a negative effect on the model, as expected

Average influence of corrupted points:  -1.076019
Average influence of other points:  0.10683939

Influence of training features ¶

We have seen how to calculate the influence of single training points on each test point using influence_type 'up'. Using influence_type 'perturbation' we can also calculate the influence of the input features of each point. In the next cell we will calculate the average influence of each feature on training and test points, and ultimately assess which are the most relevant to model performance.

influence_model.regularization = 1.0
feature_influences = influence_model.influences(
    *test_data, *training_data, mode="perturbation"
)

Speeding up influences for big models ¶

The explicit calculation of the Hessian matrix is numerically challenging, and due to the high memory need infeasible for larger models. PyDVL allows to use several approximation methods for the action of the inverse Hessian matrix to overcome this bottleneck:

Iteration-based:
- Conjugate Gradients (Cg)
- Linear time Stochastic Second-Order Approximation ( LiSSA )
Low-rank Approximations:
- Arnoldi
- Nyström Sketch-and-Solve (Nyström)
Factorization-based:
- Eigenvalue-corrected Kronecker Factorization ( EKFAC )

In the following, we show the usage of these approximation methods and investigate their performance.

Cg ¶

Since the Hessian is symmetric and positive definite (at least after applying a sufficient regularization), we can utilize the Conjugate Gradients Algorithm to approximately solve the equations

\[ (H + \lambda \operatorname{I}) x = b\]

Most importantly, the algorithm do not require the computation of the full Hessian matrix, but only requires the implementation of Hessian vector products. pyDVL implements a stable block variant of preconditioned conjugate gradients algorithm.

from pydvl.influence.torch.preconditioner import NystroemPreconditioner

nn_model.to("cpu")
cg_influence_model = CgInfluence(
    nn_model,
    F.cross_entropy,
    regularization=0.1,
    progress=True,
    solve_simultaneously=True,
    preconditioner=NystroemPreconditioner(rank=5),
)
cg_influence_model = cg_influence_model.fit(training_data_loader)
cg_train_influences = cg_influence_model.influences(
    *test_data, *training_data, mode="up"
)
mean_cg_train_influences = np.mean(cg_train_influences.numpy(), axis=0)

Let's compare the results obtained through conjugate gradient with those from the direct method

Percentage error of Cg over direct method:30.799928307533264 %

Pearson Correlation Cg vs direct 0.9975120139679169
Spearman Correlation Cg vs direct 0.9968058039650353

Lissa ¶

The LiSSA method is a stochastic approximation of the inverse Hessian vector product. Compared to conjugate gradient it is faster but less accurate and typically suffers from instability.

In order to find the solution of the HVP, LiSSA iteratively approximates the inverse of the Hessian matrix with the following update:

\[H^{-1}_{j+1} b = b + (I - d) \ H - \frac{H^{-1}_j b}{s},\]

where \(d\) and \(s\) are a dampening and a scaling factor.

lissa_influence_model = LissaInfluence(
    nn_model,
    F.cross_entropy,
    regularization=0.1,
    progress=True,
)
lissa_influence_model = lissa_influence_model.fit(training_data_loader)
lissa_train_influences = lissa_influence_model.influences(
    *test_data, *training_data, mode="up"
)
mean_lissa_train_influences = np.mean(lissa_train_influences.numpy(), axis=0)

Percentage error of Lissa over direct method:43.32989752292633 %

Pearson Correlation Lissa vs direct 0.9940533986762107
Spearman Correlation Lissa vs direct 0.9942061112930448

Arnoldi ¶

The Arnoldi method leverages a low rank approximation of the Hessian matrix to reduce the memory requirements. It is generally much faster than the conjugate gradient method and can achieve similar accuracy.

arnoldi_influence_model = ArnoldiInfluence(
    nn_model,
    F.cross_entropy,
    rank=30,
    regularization=0.1,
)
arnoldi_influence_model = arnoldi_influence_model.fit(training_data_loader)
arnoldi_train_influences = arnoldi_influence_model.influences(
    *test_data, *training_data, mode="up"
)
mean_arnoldi_train_influences = np.mean(arnoldi_train_influences.numpy(), axis=0)

Percentage error of Arnoldi over direct method:44.838228821754456 %

Pearson Correlation Arnoldi vs direct 0.9876463041029632
Spearman Correlation Arnoldi vs direct 0.9772879562687357

Nyström ¶

Similar to the Arnoldi method. the Nyström method uses a low-rank approximation, which is computed from random projections of the Hessian matrix. In general the approximation is expected to be worse then the Arnoldi approximation, but is cheaper to compute.

nystroem_influence_model = NystroemSketchInfluence(
    nn_model,
    F.cross_entropy,
    rank=30,
    regularization=0.1,
)
nystroem_influence_model = nystroem_influence_model.fit(training_data_loader)
nystroem_train_influences = nystroem_influence_model.influences(
    *test_data, *training_data, mode="up"
)
mean_nystroem_train_influences = np.mean(nystroem_train_influences.numpy(), axis=0)

Percentage error of Nyström over direct method:52.49669551849365 %

Pearson Correlation Nyström vs direct 0.9949392350441726
Spearman Correlation Nyström vs direct 0.9883719172733456

EKFAC ¶

The EKFAC method is a more recent technique that leverages the Kronecker product structure of the Hessian matrix to reduce the memory requirements. It is generally much faster than iterative methods like conjugate gradient and Arnoldi and it allows for an easier handling of memory. Therefore, it is the only technique that can scale to very large models (e.g. billions of parameters). Its accuracy is however much worse. Let's see how it performs on our example.

ekfac_influence_model = EkfacInfluence(
    nn_model,
    update_diagonal=True,
    hessian_regularization=0.1,
)
ekfac_influence_model = ekfac_influence_model.fit(training_data_loader)
ekfac_train_influences = ekfac_influence_model.influences(
    *test_data, *training_data, mode="up"
)
mean_ekfac_train_influences = np.mean(ekfac_train_influences.numpy(), axis=0)

Percentage error of EK-FAC over direct method:1528.7002563476562 %

The accuracy is not good, and it is not recommended to use this method for small models. Nevertheless, a look at the actual influence values reveals that the EK-FAC estimates are not completely off.

The above plot shows a good correlation between the EK-FAC and the direct method. Corrupted points have been circled in red, and in both the direct and approximate case they are correcly identified as having negative influence on the model's accuracy. This is confirmed by explicit calculation of the Pearson and Spearman correlation coefficients.

Pearson Correlation EK-FAC vs direct 0.9579606771217988
Spearman Correlation EK-FAC vs direct 0.895971987807643

The correlation between the EK-FAC and the direct method is quite good, and it improves significantly if we just keep top-20 highest absolute influences.

Pearson Correlation EK-FAC vs direct - top-20 influences 0.9729110282877721
Spearman Correlation EK-FAC vs direct - top-20 influences 0.9759398496240601

When we calculate influence scores, typically we are more interested in assessing which training points have the highest or lowest impact on the model rather than having a precise estimate of the influence value. EK-FAC then provides a fast and memory-efficient way to calculate a coarse influence ranking of the training points which scales very well even to the largest neural networks.

Conclusions ¶

This was a quick introduction to the pyDVL interface for influence functions. Despite their speed and simplicity, influence functions are known to be a very noisy estimator of data quality, as pointed out in the paper "Influence functions in deep learning are fragile" . The size of the network, the weight decay, the inversion method used for calculating influences, the size of the test set: they all add up to the total amount of noise. Experiments may therefore give quantitative and qualitatively different results if not averaged across several realisations. Shapley values, on the contrary, have shown to be a more robust, but this comes at the cost of high computational requirements. PyDVL employs several parallelization and caching techniques to optimize such calculations.