Influence Function Model
In almost every practical application it is not possible to construct, even less invert the complete Hessian in memory. pyDVL offers several implementations of the interface InfluenceFunctionModel , which do not compute the full Hessian (in contrast to DirectInfluence ).
Conjugate Gradient¶
This classical procedure for solving linear systems of equations is an iterative method that does not require the explicit inversion of the Hessian. Instead, it only requires the calculation of Hessian-vector products, making it a good choice for large datasets or models with many parameters. It is nevertheless much slower to converge than the direct inversion method and not as accurate.
More info on the theory of conjugate gradient can be found on Wikipedia, or in text books such as (Trefethen and Bau, 1997, Lecture 38)1.
pyDVL also implements a stable block variant of the conjugate gradient method, defined in (Ji and Li, 2017)2, which solves several right hand sides simultaneously.
Optionally, the user can provide a pre-conditioner to improve convergence, such as a Jacobi preconditioner , which is a simple diagonal pre-conditioner based on Hutchinson's diagonal estimator (Bekas et al., 2007)3, or a Nyström approximation based preconditioner , described in (Frangella et al., 2023)4.
from pydvl.influence.torch import CgInfluence, BlockMode, SecondOrderMode
from pydvl.influence.torch.preconditioner import NystroemPreconditioner
if_model = CgInfluence(
model,
loss,
regularization=0.0,
rtol=1e-7,
atol=1e-7,
maxiter=None,
solve_simultaneously=True,
preconditioner=NystroemPreconditioner(rank=10),
block_structure=BlockMode.FULL,
second_order_mode=SecondOrderMode.HESSIAN
)
if_model.fit(train_loader)
The additional optional parameters rtol
, atol
, maxiter
,
solve_simultaneously
and preconditioner
are respectively, the relative
tolerance, the absolute tolerance, the maximum number of iterations,
a flag indicating whether to use a variant of cg to
simultaneously solving the system for several right hand sides and an optional
preconditioner.
This implementation is capable of using a block-diagonal approximation, see Block-diagonal approximation, and can handle Gauss-Newton approximation.
Linear time Stochastic Second-Order Approximation (LiSSA)¶
The LiSSA method is a stochastic approximation of the inverse Hessian vector product. Compared to conjugate gradient it is faster but less accurate and typically suffers from instability.
In order to find the solution of the HVP, LiSSA iteratively approximates the inverse of the Hessian matrix with the following update:
where \(d\) and \(s\) are a dampening and a scaling factor, which are essential for the convergence of the method and they need to be chosen carefully, and I is the identity matrix. More info on the theory of LiSSA can be found in the original paper (Agarwal et al., 2017)5.
from pydvl.influence.torch import LissaInfluence, BlockMode, SecondOrderMode
if_model = LissaInfluence(
model,
loss,
regularization=0.0,
maxiter=1000,
dampen=0.0,
scale=10.0,
rtol=1e-4,
block_structure=BlockMode.FULL,
second_order_mode=SecondOrderMode.GAUSS_NEWTON
)
if_model.fit(train_loader)
with the additional optional parameters maxiter
, dampen
, scale
, and
rtol
,
being the maximum number of iterations, the dampening factor, the scaling
factor and the relative tolerance,
respectively. This implementation is capable of using a block-matrix
approximation, see
Block-diagonal approximation, and can handle
Gauss-Newton approximation.
Arnoldi¶
The Arnoldi method is a Krylov subspace method for approximating dominating eigenvalues and eigenvectors. Under a low rank assumption on the Hessian at a minimizer (which is typically observed for deep neural networks), this approximation captures the essential action of the Hessian. More concretely, for \(Hx=b\) the solution is approximated by
where \(D\) is a diagonal matrix with the top (in absolute value) eigenvalues of the Hessian and \(V\) contains the corresponding eigenvectors. See also (Schioppa et al., 2022)6.
from pydvl.influence.torch import ArnoldiInfluence, BlockMode, SecondOrderMode
if_model = ArnoldiInfluence(
model,
loss,
regularization=0.0,
rank=10,
tol=1e-6,
block_structure=BlockMode.FULL,
second_order_mode=SecondOrderMode.HESSIAN
)
if_model.fit(train_loader)
Eigenvalue Corrected K-FAC¶
K-FAC, short for Kronecker-Factored Approximate Curvature, is a method that approximates the Fisher information matrix FIM of a model. It is possible to show that for classification models with appropriate loss functions the FIM is equal to the Hessian of the model’s loss over the dataset. In this restricted but nonetheless important context K-FAC offers an efficient way to approximate the Hessian and hence the influence scores. For more info and details refer to the original paper (Martens and Grosse, 2015)7.
The K-FAC method is implemented in the class EkfacInfluence . The following code snippet shows how to use the K-FAC method to calculate the influence function of a model. Note that, in contrast to the other methods for influence function calculation, K-FAC does not require the loss function as an input. This is because the current implementation is only applicable to classification models with a cross entropy loss function.
from pydvl.influence.torch import EkfacInfluence
if_model = EkfacInfluence(
model,
hessian_regularization=0.0,
)
if_model.fit(train_loader)
A further improvement of the K-FAC method is the Eigenvalue Corrected
K-FAC (EKFAC) method (George et al., 2018)8, which allows to further re-fit the
eigenvalues of the Hessian, thus providing a more accurate approximation.
On top of the K-FAC method, the EKFAC method is implemented by setting
update_diagonal=True
when initialising EkfacInfluence
.
The following code snippet shows how to use the EKFAC method to calculate the
influence function of a model.
from pydvl.influence.torch import EkfacInfluence
if_model = EkfacInfluence(
model,
update_diagonal=True,
hessian_regularization=0.0,
)
if_model.fit(train_loader)
Nyström Sketch-and-Solve¶
This approximation is based on a Nyström low-rank approximation of the form
where \((\cdot)^{\dagger}\) denotes the Moore-Penrose inverse, in combination with the Sherman–Morrison–Woodbury formula to calculate the action of its inverse:
see also (Hataya and Yamada, 2023)9 and (Frangella et al., 2023)4. The essential parameter is the rank of the approximation.
from pydvl.influence.torch import NystroemSketchInfluence, BlockMode, SecondOrderMode
if_model = NystroemSketchInfluence(
model,
loss,
rank=10,
regularization=0.0,
block_structure=BlockMode.FULL,
second_order_mode=SecondOrderMode.HESSIAN
)
if_model.fit(train_loader)
Inverse Harmonic Mean¶
This implementation replaces the inverse Hessian matrix in the influence computation with an approximation of the inverse Gauss-Newton vector product and was proposed in (Kwon et al., 2023)10.
The approximation method comprises the following steps:
-
Replace the Hessian \(H(\theta)\) with the Gauss-Newton matrix \(G(\theta)\):
\[\begin{equation*} G(\theta)=n^{-1} \sum_{i=1}^n \nabla_{\theta}\ell_i\nabla_{\theta}\ell_i^T \end{equation*}\]which results in
\[\begin{equation*} \mathcal{I}(z_{t}, z) \approx \nabla_{\theta} \ell(z_{t}, \theta)^T (G(\theta) + \lambda I_d)^{-1} \nabla_{\theta} \ell(z, \theta) \end{equation*}\] -
Simplify the problem by breaking it down into a block diagonal structure, where each block \(G_l(\theta)\) corresponds to the l-th block:
\[\begin{equation*} G_{l}(\theta) = n^{-1} \sum_{i=1}^n \nabla_{\theta_l} \ell_i \nabla_{\theta_l} \ell_i^{T} + \lambda_l I_{d_l}, \end{equation*}\]which leads to
\[\begin{equation*} \mathcal{I}(z_{t}, z) \approx \nabla_{\theta} \ell(z_{t}, \theta)^T \operatorname{diag}(G_1(\theta)^{-1}, \dots, G_L(\theta)^{-1}) \nabla_{\theta} \ell(z, \theta) \end{equation*}\] -
Substitute the arithmetic mean of the rank-\(1\) updates in \(G_l(\theta)\), with the inverse harmonic mean \(R_l(\theta)\) of the rank-1 updates:
\[\begin{align*} G_l(\theta)^{-1} &= \left( n^{-1} \sum_{i=1}^n \nabla_{\theta_l} \ell(z_i, \theta) \nabla_{\theta_l} \ell(z_i, \theta)^{T} + \lambda_l I_{d_l}\right)^{-1} \\\ R_{l}(\theta)&= n^{-1} \sum_{i=1}^n \left( \nabla_{\theta_l} \ell(z_i, \theta) \nabla_{\theta_l} \ell(z_i, \theta)^{T} + \lambda_l I_{d_l} \right)^{-1} \end{align*}\] -
Use the Sherman–Morrison formula to get an explicit representation of the inverses in the definition of \(R_l(\theta):\)
\[\begin{align*} R_l(\theta) &= n^{-1} \sum_{i=1}^n \left( \nabla_{\theta_l} \ell_i \nabla_{\theta_l} \ell_i^{T} + \lambda_l I_{d_l}\right)^{-1} \\\ &= n^{-1} \sum_{i=1}^n \lambda_l^{-1} \left(I_{d_l} - \frac{\nabla_{\theta_l} \ell_i \nabla_{\theta_l} \ell_i^{T}}{\lambda_l + \\|\nabla_{\theta_l} \ell_i\\|_2^2}\right) , \end{align*}\]which means application of \(R_l(\theta)\) boils down to computing \(n\) rank-\(1\) updates.
from pydvl.influence.torch import InverseHarmonicMeanInfluence, BlockMode
if_model = InverseHarmonicMeanInfluence(
model,
loss,
regularization=1e-1,
block_structure=BlockMode.LAYER_WISE
)
if_model.fit(train_loader)
These implementations represent the calculation logic on in memory tensors. To scale up to large collection of data, we map these influence function models over these collections. For a detailed discussion see the documentation page Scaling Computation.
-
Trefethen, L.N., Bau, D., Iii, 1997. Numerical Linear Algebra. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898719574 ↩
-
Ji, H., Li, Y., 2017. A breakdown-free block conjugate gradient method. Bit Numer Math 57, 379–403. https://doi.org/10.1007/s10543-016-0631-z ↩
-
Bekas, C., Kokiopoulou, E., Saad, Y., 2007. An estimator for the diagonal of a matrix. Applied Numerical Mathematics, Numerical Algorithms, Parallelism and Applications (2) 57, 1214–1229. https://doi.org/10.1016/j.apnum.2007.01.003 ↩
-
Frangella, Z., Tropp, J.A., Udell, M., 2023. Randomized Nyström Preconditioning. SIAM J. Matrix Anal. Appl. 44, 718–752. https://doi.org/10.1137/21M1466244 ↩↩
-
Agarwal, N., Bullins, B., Hazan, E., 2017. Second-Order Stochastic Optimization for Machine Learning in Linear Time. JMLR 18, 1–40. ↩
-
Schioppa, A., Zablotskaia, P., Vilar, D., Sokolov, A., 2022. Scaling Up Influence Functions. Proc. AAAI Conf. Artif. Intell. 36, 8179–8186. https://doi.org/10.1609/aaai.v36i8.20791 ↩
-
Martens, J., Grosse, R., 2015. Optimizing Neural Networks with Kronecker-factored Approximate Curvature, in: Proceedings of the 32nd International Conference on Machine Learning. Presented at the International Conference on Machine Learning, PMLR, pp. 2408–2417. ↩
-
George, T., Laurent, C., Bouthillier, X., Ballas, N., Vincent, P., 2018. Fast Approximate Natural Gradient Descent in a Kronecker Factored Eigenbasis, in: Advances in Neural Information Processing Systems. Curran Associates, Inc. ↩
-
Hataya, R., Yamada, M., 2023. Nyström Method for Accurate and Scalable Implicit Differentiation, in: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Presented at the International Conference on Artificial Intelligence and Statistics, PMLR, pp. 4643–4654. ↩
-
Kwon, Y., Wu, E., Wu, K., Zou, J., 2023. DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models. Presented at the The Twelfth International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2310.00902 ↩