Skip to content

Influence functions for outlier detection

This notebook shows how to calculate influences on a NN model using pyDVL for an arbitrary dataset, and how this can be used to find anomalous or corrupted data points.

It uses the wine dataset from sklearn: given a set of 13 different input parameters regarding a particular bottle, each related to some physical property (e.g. concentration of magnesium, malic acidity, alcoholic percentage, etc.), the model will need to predict to which of 3 classes the wine belongs to. For more details, please refer to the sklearn documentation.

If you are reading this in the documentation, some boilerplate has been omitted for convenience.

Let's start by loading the imports, the dataset and splitting it into train, validation and test sets. We will use a large test set to have a less noisy estimate of the average influence.

Imports

%autoreload
%matplotlib inline

import os
import random

import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn.functional as F
from support.common import plot_losses
from support.torch import TorchMLP, fit_torch_model
from pydvl.influence.torch import (
    DirectInfluence,
    CgInfluence,
    ArnoldiInfluence,
    EkfacInfluence,
)
from support.shapley import load_wine_dataset
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, f1_score
from torch.optim import Adam, lr_scheduler
from torch.utils.data import DataLoader, TensorDataset
from scipy.stats import pearsonr, spearmanr

Dataset

training_data, val_data, test_data, feature_names = load_wine_dataset(
    train_size=0.6, test_size=0.3
)

We will corrupt some of the training points by flipping their labels

num_corrupted_idxs = 10
training_data[1][:num_corrupted_idxs] = torch.tensor(
    [(val + 1) % 3 for val in training_data[1][:num_corrupted_idxs]]
)

and let's wrap it in a pytorch data loader

training_data_loader = DataLoader(
    TensorDataset(*training_data), batch_size=32, shuffle=False
)
val_data_loader = DataLoader(TensorDataset(*val_data), batch_size=32, shuffle=False)
test_data_loader = DataLoader(TensorDataset(*test_data), batch_size=32, shuffle=False)

Fit a neural network to the data

We will train a 2-layer neural network. PyDVL has some convenience wrappers to initialize a pytorch NN. If you already have a model loaded and trained, you can skip this section.

feature_dimension = 13
num_classes = 3
network_size = [16, 16]
layers_size = [feature_dimension, *network_size, num_classes]
num_epochs = 300
lr = 0.005
weight_decay = 0.01
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

nn_model = TorchMLP(layers_size)
nn_model.to(device)

optimizer = Adam(params=nn_model.parameters(), lr=lr, weight_decay=weight_decay)
scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_epochs)

losses = fit_torch_model(
    model=nn_model,
    training_data=training_data_loader,
    val_data=val_data_loader,
    loss=F.cross_entropy,
    optimizer=optimizer,
    scheduler=scheduler,
    num_epochs=num_epochs,
)

Let's check that the training has found a stable minimum by plotting the training and validation loss

plot_losses(losses)
No description has been provided for this image

Since it is a classification problem, let's also take a look at the confusion matrix on the test set

No description has been provided for this image

And let's compute the f1 score of the model

f1_score(test_data[1], pred_y_test, average="weighted")
0.9633110554163186

Let's now move to calculating influences of each point on the total score.

Calculating influences for small neural networks

The following cell calculates the influences of each training data point on the neural network. Neural networks have typically a very bumpy parameter space, which, during training, is explored until the configuration that minimises the loss is found. There is an important assumption in influence functions that the model lays at a (at least local) minimum of such loss, and if this is not fulfilled many issues can arise. In order to avoid this scenario, a regularisation term should be used whenever dealing with big and noisy models.

influence_model = DirectInfluence(
    nn_model,
    F.cross_entropy,
    hessian_regularization=0.1,
)
influence_model = influence_model.fit(training_data_loader)
train_influences = influence_model.influences(*test_data, *training_data, mode="up")

the returned matrix, train_influences, has a quantity of columns equal to the points in the training set, and a number of rows equal to the points in the test set. At each element \(a_{i,j}\) it stores the influence that training point \(j\) has on the classification of test point \(i\).

If we take the average across every column of the influences matrix, we obtain an estimate of the overall influence of a training point on the total accuracy of the network.

mean_train_influences = np.mean(train_influences.numpy(), axis=0)

The following histogram shows that there are big differences in score within the training set (notice the log-scale on the y axis).

No description has been provided for this image

We can see that the corrupted points tend to have a negative effect on the model, as expected

Average influence of corrupted points:  -1.0782924
Average influence of other points:  0.10896263

Influence of training features

We have seen how to calculate the influence of single training points on each test point using influence_type 'up'. Using influence_type 'perturbation' we can also calculate the influence of the input features of each point. In the next cell we will calculate the average influence of each feature on training and test points, and ultimately assess which are the most relevant to model performance.

influence_model.hessian_regularization = 1.0
feature_influences = influence_model.influences(
    *test_data, *training_data, mode="perturbation"
)
No description has been provided for this image

Speeding up influences for big models

The calculation of the Hessian matrix (necessary to calculate the influences) can be quite numerically challenging, but there are some techniques to speed up its calculation. PyDVL allows to use the full method ("direct") or the conjugate gradient method ("cg"). The first one should be used only for very small networks (like our current example), while for bigger ones "cg" is advisable.

cg_influence_model = CgInfluence(
    nn_model,
    F.cross_entropy,
    hessian_regularization=0.1,
    progress=True,
)
cg_influence_model = cg_influence_model.fit(training_data_loader)
cg_train_influences = cg_influence_model.influences(
    *test_data, *training_data, mode="up"
)
mean_cg_train_influences = np.mean(cg_train_influences.numpy(), axis=0)

Let's compare the results obtained through conjugate gradient with those from the direct method

Percentage error of cg over direct method:0.00014557354006683454 %

Arnoldi

The Arnoldi method leverages a low rank approximation of the Hessian matrix to reduce the memory requirements. It is generally much faster than the conjugate gradient method and can achieve similar accuracy.

arnoldi_influence_model = ArnoldiInfluence(
    nn_model,
    F.cross_entropy,
    rank_estimate=30,
    hessian_regularization=0.1,
)
arnoldi_influence_model = arnoldi_influence_model.fit(training_data_loader)
arnoldi_train_influences = arnoldi_influence_model.influences(
    *test_data, *training_data, mode="up"
)
mean_arnoldi_train_influences = np.mean(arnoldi_train_influences.numpy(), axis=0)
Percentage error of Arnoldi over direct method:95.7504391670227 %

EK-FAC

The EK-FAC method is a more recent technique that leverages the Kronecker product structure of the Hessian matrix to reduce the memory requirements. It is generally much faster than iterative methods like conjugate gradient and Arnoldi and it allows for an easier handling of memory. Therefore, it is the only technique that can scale to very large models (e.g. billions of parameters). Its accuracy is however much worse. Let's see how it performs on our example.

ekfac_influence_model = EkfacInfluence(
    nn_model,
    update_diagonal=True,
    hessian_regularization=0.1,
)
ekfac_influence_model = ekfac_influence_model.fit(training_data_loader)
ekfac_train_influences = ekfac_influence_model.influences(
    *test_data, *training_data, mode="up"
)
mean_ekfac_train_influences = np.mean(ekfac_train_influences.numpy(), axis=0)
Percentage error of EK-FAC over direct method:1927.0627975463867 %

The accuracy is not good, and it is not recommended to use this method for small models. Nevertheless, a look at the actual influence values reveals that the EK-FAC estimates are not completely off.

No description has been provided for this image

The above plot shows a good correlation between the EK-FAC and the direct method. Corrupted points have been circled in red, and in both the direct and approximate case they are correcly identified as having negative influence on the model's accuracy. This is confirmed by explicit calculation of the Pearson and Spearman correlation coefficients.

print(
    f"Pearson Correlation EK-FAC vs direct",
    pearsonr(mean_ekfac_train_influences, mean_train_influences).statistic,
)
print(
    f"Spearman Correlation EK-FAC vs direct",
    spearmanr(mean_ekfac_train_influences, mean_train_influences).statistic,
)
Pearson Correlation EK-FAC vs direct 0.9608164875442669
Spearman Correlation EK-FAC vs direct 0.8946217598307178

The correlation between the EK-FAC and the direct method is quite good, and it improves significantly if we just keep top-20 highest absolute influences.

Pearson Correlation EK-FAC vs direct - top-20 influences 0.9901775015427601
Spearman Correlation EK-FAC vs direct - top-20 influences 0.9428571428571428

When we calculate influence scores, typically we are more interested in assessing which training points have the highest or lowest impact on the model rather than having a precise estimate of the influence value. EK-FAC then provides a fast and memory-efficient way to calculate a coarse influence ranking of the training points which scales very well even to the largest neural networks.

Conclusions

This was a quick introduction to the pyDVL interface for influence functions. Despite their speed and simplicity, influence functions are known to be a very noisy estimator of data quality, as pointed out in the paper "Influence functions in deep learning are fragile". The size of the network, the weight decay, the inversion method used for calculating influences, the size of the test set: they all add up to the total amount of noise. Experiments may therefore give quantitative and qualitatively different results if not averaged across several realisations. Shapley values, on the contrary, have shown to be a more robust, but this comes at the cost of high computational requirements. PyDVL employs several parallelization and caching techniques to optimize such calculations.