Influence functions for outlier detection ¶
This notebook shows how to calculate influences on a NN model using pyDVL for an arbitrary dataset, and how this can be used to find anomalous or corrupted data points.
It uses the wine dataset from sklearn: given a set of 13 different input parameters regarding a particular bottle, each related to some physical property (e.g. concentration of magnesium, malic acidity, alcoholic percentage, etc.), the model will need to predict to which of 3 classes the wine belongs to. For more details, please refer to the sklearn documentation .
Let's start by loading the imports, the dataset and splitting it into train, validation and test sets. We will use a large test set to have a less noisy estimate of the average influence.
Imports ¶
%autoreload
%matplotlib inline
import os
import random
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn.functional as F
from support.common import plot_losses
from support.torch import TorchMLP, fit_torch_model
from pydvl.influence.torch import (
DirectInfluence,
CgInfluence,
ArnoldiInfluence,
EkfacInfluence,
NystroemSketchInfluence,
LissaInfluence,
)
from pydvl.influence import InfluenceMode
from support.shapley import load_wine_dataset
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, f1_score
from torch.optim import Adam, lr_scheduler
from torch.utils.data import DataLoader, TensorDataset
from scipy.stats import pearsonr, spearmanr
Dataset ¶
We will corrupt some of the training points by flipping their labels
and let's wrap it in a pytorch data loader
Fit a neural network to the data ¶
We will train a 2-layer neural network. PyDVL has some convenience wrappers to initialize a pytorch NN. If you already have a model loaded and trained, you can skip this section.
feature_dimension = 13
num_classes = 3
network_size = [16, 16]
layers_size = [feature_dimension, *network_size, num_classes]
num_epochs = 300
lr = 0.005
weight_decay = 0.01
nn_model = TorchMLP(layers_size)
nn_model.to(device)
optimizer = Adam(params=nn_model.parameters(), lr=lr, weight_decay=weight_decay)
scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_epochs)
losses = fit_torch_model(
model=nn_model,
training_data=training_data_loader,
val_data=val_data_loader,
loss=F.cross_entropy,
optimizer=optimizer,
scheduler=scheduler,
num_epochs=num_epochs,
device=device,
)
Let's check that the training has found a stable minimum by plotting the training and validation loss
Since it is a classification problem, let's also take a look at the confusion matrix on the test set
And let's compute the f1 score of the model
Let's now move to calculating influences of each point on the total score.
Calculating influences for small neural networks ¶
The following cell calculates the influences of each training data point on the neural network. Neural networks have typically a very bumpy parameter space, which, during training, is explored until the configuration that minimises the loss is found. There is an important assumption in influence functions that the model lays at a (at least local) minimum of such loss, and if this is not fulfilled many issues can arise. In order to avoid this scenario, a regularisation term should be used whenever dealing with big and noisy models.
the returned matrix, train_influences, has a quantity of columns equal to the points in the training set, and a number of rows equal to the points in the test set. At each element \(a_{i,j}\) it stores the influence that training point \(j\) has on the classification of test point \(i\) .
If we take the average across every column of the influences matrix, we obtain an estimate of the overall influence of a training point on the total accuracy of the network.
The following histogram shows that there are big differences in score within the training set (notice the log-scale on the y axis).
We can see that the corrupted points tend to have a negative effect on the model, as expected
Influence of training features ¶
We have seen how to calculate the influence of single training points on each test point using influence_type 'up'. Using influence_type 'perturbation' we can also calculate the influence of the input features of each point. In the next cell we will calculate the average influence of each feature on training and test points, and ultimately assess which are the most relevant to model performance.
Speeding up influences for big models ¶
The explicit calculation of the Hessian matrix is numerically challenging, and due to the high memory need infeasible for larger models. PyDVL allows to use several approximation methods for the action of the inverse Hessian matrix to overcome this bottleneck:
-
Iteration-based:
- Conjugate Gradients (Cg)
- Linear time Stochastic Second-Order Approximation ( LiSSA )
-
Low-rank Approximations:
- Arnoldi
- Nyström Sketch-and-Solve (Nyström)
-
Factorization-based:
- Eigenvalue-corrected Kronecker Factorization ( EKFAC )
In the following, we show the usage of these approximation methods and investigate their performance.
Cg ¶
Since the Hessian is symmetric and positive definite (at least after applying a sufficient regularization), we can utilize the Conjugate Gradients Algorithm to approximately solve the equations
Most importantly, the algorithm do not require the computation of the full Hessian matrix, but only requires the implementation of Hessian vector products. pyDVL implements a stable block variant of preconditioned conjugate gradients algorithm.
from pydvl.influence.torch.preconditioner import NystroemPreconditioner
nn_model.to("cpu")
cg_influence_model = CgInfluence(
nn_model,
F.cross_entropy,
regularization=0.1,
progress=True,
solve_simultaneously=True,
preconditioner=NystroemPreconditioner(rank=5),
)
cg_influence_model = cg_influence_model.fit(training_data_loader)
cg_train_influences = cg_influence_model.influences(
*test_data, *training_data, mode="up"
)
mean_cg_train_influences = np.mean(cg_train_influences.numpy(), axis=0)
Let's compare the results obtained through conjugate gradient with those from the direct method
Lissa ¶
The LiSSA method is a stochastic approximation of the inverse Hessian vector product. Compared to conjugate gradient it is faster but less accurate and typically suffers from instability.
In order to find the solution of the HVP, LiSSA iteratively approximates the inverse of the Hessian matrix with the following update:
where \(d\) and \(s\) are a dampening and a scaling factor.
lissa_influence_model = LissaInfluence(
nn_model,
F.cross_entropy,
regularization=0.1,
progress=True,
)
lissa_influence_model = lissa_influence_model.fit(training_data_loader)
lissa_train_influences = lissa_influence_model.influences(
*test_data, *training_data, mode="up"
)
mean_lissa_train_influences = np.mean(lissa_train_influences.numpy(), axis=0)
Arnoldi ¶
The Arnoldi method leverages a low rank approximation of the Hessian matrix to reduce the memory requirements. It is generally much faster than the conjugate gradient method and can achieve similar accuracy.
arnoldi_influence_model = ArnoldiInfluence(
nn_model,
F.cross_entropy,
rank=30,
regularization=0.1,
)
arnoldi_influence_model = arnoldi_influence_model.fit(training_data_loader)
arnoldi_train_influences = arnoldi_influence_model.influences(
*test_data, *training_data, mode="up"
)
mean_arnoldi_train_influences = np.mean(arnoldi_train_influences.numpy(), axis=0)
Nyström ¶
Similar to the Arnoldi method. the Nyström method uses a low-rank approximation, which is computed from random projections of the Hessian matrix. In general the approximation is expected to be worse then the Arnoldi approximation, but is cheaper to compute.
nystroem_influence_model = NystroemSketchInfluence(
nn_model,
F.cross_entropy,
rank=30,
regularization=0.1,
)
nystroem_influence_model = nystroem_influence_model.fit(training_data_loader)
nystroem_train_influences = nystroem_influence_model.influences(
*test_data, *training_data, mode="up"
)
mean_nystroem_train_influences = np.mean(nystroem_train_influences.numpy(), axis=0)
EKFAC ¶
The EKFAC method is a more recent technique that leverages the Kronecker product structure of the Hessian matrix to reduce the memory requirements. It is generally much faster than iterative methods like conjugate gradient and Arnoldi and it allows for an easier handling of memory. Therefore, it is the only technique that can scale to very large models (e.g. billions of parameters). Its accuracy is however much worse. Let's see how it performs on our example.
ekfac_influence_model = EkfacInfluence(
nn_model,
update_diagonal=True,
hessian_regularization=0.1,
)
ekfac_influence_model = ekfac_influence_model.fit(training_data_loader)
ekfac_train_influences = ekfac_influence_model.influences(
*test_data, *training_data, mode="up"
)
mean_ekfac_train_influences = np.mean(ekfac_train_influences.numpy(), axis=0)
The accuracy is not good, and it is not recommended to use this method for small models. Nevertheless, a look at the actual influence values reveals that the EK-FAC estimates are not completely off.
The above plot shows a good correlation between the EK-FAC and the direct method. Corrupted points have been circled in red, and in both the direct and approximate case they are correcly identified as having negative influence on the model's accuracy. This is confirmed by explicit calculation of the Pearson and Spearman correlation coefficients.
The correlation between the EK-FAC and the direct method is quite good, and it improves significantly if we just keep top-20 highest absolute influences.
When we calculate influence scores, typically we are more interested in assessing which training points have the highest or lowest impact on the model rather than having a precise estimate of the influence value. EK-FAC then provides a fast and memory-efficient way to calculate a coarse influence ranking of the training points which scales very well even to the largest neural networks.
Conclusions ¶
This was a quick introduction to the pyDVL interface for influence functions. Despite their speed and simplicity, influence functions are known to be a very noisy estimator of data quality, as pointed out in the paper "Influence functions in deep learning are fragile" . The size of the network, the weight decay, the inversion method used for calculating influences, the size of the test set: they all add up to the total amount of noise. Experiments may therefore give quantitative and qualitatively different results if not averaged across several realisations. Shapley values, on the contrary, have shown to be a more robust, but this comes at the cost of high computational requirements. PyDVL employs several parallelization and caching techniques to optimize such calculations.