Does the computation of log-determinant of a matrix ever occur in the computational graph of a neural network?
It does, and one place it shows up is the Sparse Variational Gaussian Process (SVGP).
In SVGP, our goal is to maximize a quantity known as the evidence lower bound (ELBO). The ELBO comprises two terms, which basically say:
One could check, e.g. Wei Yi's excellent tutorial or James Hensman's Gaussian Process for big data for the derivation of ELBO for SVGP. In short, the KL divergence term in ELBO is expressed as:
(here and are the mean and covariance of a variational distribution, and are part of the model parameters in SVGP). We will disregard the grey terms, since Zygote handles them easily.
When is stored in CPU memory, the term involving the (positive definite) matrix works fine with Zygote.
using Zygote
X = randn(3,3); XᵀX = X'X
f(x) = logdet(x)
@show gradient(f, XᵀX)[1] This gives:
(gradient(f, XᵀX))[1] = [0.8429186329377117 -0.4507909324777994 -0.7811665008998808; -0.45079093247779933 0.48173303692414393 0.47267755816965557; -0.7811665008998809 0.4726775581696556 1.261152638854635]
But interestingly, it does not work at all when is stored in GPU memory.
using CUDA
CUDA.allowscalar(false)
@show gradient(f, cu(XᵀX))[1]
This will return:
Error: Scalar indexing is disallowed.
Let's try to get around this, using the Cholesky decomposition to calculate the log determinant of :
cholesky_log_det(X) = begin
C = cholesky(X)
return 2*sum(log.(diag(C.L)))
end
@show gradient(cholesky_log_det, cu(XᵀX))[1]
But this again gives:
Error: Scalar indexing is disallowed.
It turns out that the derivative of the log-determinant of a matrix has a very simple formula, i.e., for an invertible matrix , we have that
Using this handy fact, let's go ahead and make a customized adjoint:
using Zygote: @adjoint
function log_determinant(Q::CuMatrix)
A = cholesky(Q)
return 2*sum(log.(diag(A.L)))
end
@adjoint function log_determinant(Q::CuMatrix)
# Q positive definite so Q = LLᵀ by cholesky and thus Q⁻¹ = L⁻ᵀL⁻¹
# numerically stable way to invert a covariance matrix: https://mathoverflow.net/questions/9001/inverting-a-covariance-matrix-numerically-stable
A = cholesky(Q)
L_inv = inv(A.L)
A_inv = L_inv'L_inv
return 2*sum(log.(diag(A.L))), △ -> (△ * A_inv', )
end
@show gradient(log_determinant, cu(XᵀX))[1]
And now we have:
(gradient(log_determinant, cu(XᵀX)))[1] = Float32[0.8429185 -0.4507908 -0.78116626; -0.4507908 0.48173293 0.4726774; -0.78116626 0.4726774 1.2611524]
This works. Checking against the CPU version confirms that our adjoint returns the correct gradient of the log determinant of .
A side note: you may have noticed the line L_inv = inv(A.L). Inverting the triangular matrix A.L still has quadratic time complexity, which is pretty darn slow for big matrices. Fortunately, in SVGP the matrix (the input Q above) is defined from what's called the "inducing points", which keeps Q small. And since the inducing points are themselves model parameters of SVGP, you actually get to control the size of .
An adjoint example is here by Pbellive.
Note: This post is done on Zygote version 0.6.61 and Julia version 1.9.
The following code using ChainRulesCore produces the same result.
function log_determinant(Q::CuMatrix)
A = cholesky(Q)
return 2*sum(log.(diag(A.L)))
end
function ChainRulesCore.rrule(::typeof(log_determinant), Q::CuMatrix)
A = cholesky(Q)
L_inv = inv(A.L)
A_inv = L_inv' * L_inv
function log_determinant_pullback(R̄)
f̄ = NoTangent()
Q̄ = R̄ * A_inv'
return f̄, Q̄
end
return 2*sum(log.(diag(A.L))), log_determinant_pullback
end