We've all written the following test for our gradient code (known as the finitedifference approximation).
where \(\varepsilon > 0\) and \(\boldsymbol{e_i}\) is a vector of zeros except at \(i\) where it is \(1\). This approximation is exact in the limit, and accurate to \(o(\varepsilon)\) additive error.
This is a specific instance of a more general approximation! The dot product of the gradient and any (conformable) vector \(\boldsymbol{d}\) can be approximated with the following formula,
We get the special case above when \(\boldsymbol{d}=\boldsymbol{e_i}\). This also exact in the limit and just as accurate.
Runtime? Finitedifference approximation is probably too slow for approximating a highdimensional gradient because the number of function evaluations required is \(2 n\) where \(n\) is the dimensionality of \(x\). However, if the end goal is to approximate a gradientvector product, a mere \(2\) function evaluations is probably faster than specialized code for computing the gradient.
How to set \(\varepsilon\)? The second approach is more sensitive to \(\varepsilon\) because \(\boldsymbol{d}\) is arbitrary, unlike \(\boldsymbol{e_i}\), which is a simple unitnorm vector. Luckily some guidance is available. Andrei (2009) reccommends
where \(\epsilon_{\text{mach}}\) is
machine epsilon. (Numpy users:
numpy.finfo(x.dtype).eps
).
Why do I care?

Well, I tend to work on sparse, but highdimensional problems where finitedifference would be too slow. Thus, my usual solution is to only test several randomly selected dimensions\(\)biasing samples toward dimensions which should be nonzero. With the new trick, I can effectively test more dimensions at once by taking random vectors \(\boldsymbol{d}\). I recommend sampling \(\boldsymbol{d}\) from a spherical Gaussian so that we're uniform on the angle of the vector.

Sometimes the gradientvector dot product is the end goal. This is the case with Hessianvector products, which arises in many optimization algorithms, such as stochastic meta descent. Hessianvector products are an instance of the gradientvector dot product because the Hessian is just the gradient of the gradient.
Hessianvector product
Hessianvector products are an instance of the gradientvector dot product because since the Hessian is just the gradient of the gradient! Now you only need to remember one formula!
With this trick you never have to actually compute the gnarly Hessian! More on Justin Domke's blog