Automatic Differentiation: Forward and Reverse

sillysaurusx · on May 11, 2022

If you want to learn how to build autodiff from scratch in python, see autodidax: https://jax.readthedocs.io/en/latest/autodidax.html

It's light on math and heavy on simplicity. I've been studying it for about 9 months and I still learn new things.

sritchie · on May 11, 2022

If you want to try a literate programming take on forward-mode automatic differentiation in Lisp (Clojure), check out the Differential implementation in my sicmutils library: https://github.com/sicmutils/sicmutils/blob/main/src/sicmuti...

Here's the org-mode export of the file, with all TeX rendered: https://samritchie.io/dual-numbers-and-automatic-differentia...

I worked on this with Gerald Sussman, and it has the somewhat unique feature of being able to take derivatives of higher-order functions. The system is also extensible, so this feature could have been installed outside of the core library, and in fact would need to be to support, say, functions that returns JS futures, scala functions etc.

Enjoy!

FabHK · on May 11, 2022

Investment banks use this in production. You have a complex product whose price is a function of many inputs: P = f(x,y,z,...)

Your risks are basically the changes of the price when one of the inputs changes (delta is dP/dS where S is spot, vega is dP/dσ where σ is vol, etc.).

Using finite differences, you'd have to bump each of the inputs. With reverse AD, you get all the risks in one go.

KMag · on May 11, 2022

Expanding a bit more, risk calculations are basically calculating the partial derivatives of the current value of your books with respect to inputs observable in the outside world ("observables", almost exclusively current market prices). In other words, it's calculating the Jacobian matrix for the total value of everything on your books.

I worked on a project to compile a declarative description of complex financial products into GPU code to perform Monte Carlo simulations to estimate the fair market price of those products. We found that having a function return its present value and its N-dimensional first-order linear approximation costs about twice as much computing power as just calculating the present value. Approximating the first-order linearization by bumping inputs is exponential in the number of dimensions.

In particular, with local stochastic volatility (LSV) models, you're using these linearizations to work out the values of some hidden variables (implied volatilities at various strike prices, and implied volatilities of those volatilities ("vol on vol") going forward). That gives you a smooth, self-consistent volatility surface that you can use to estimate the value of things that don't trade very often based on the current prices of things that do trade very often. A lot of financial modelling isn't about predicting the future, but instead "predicting the present" that you can't see based on the present that you can see.

On a side note, before I did work with financial models, I didn't understand why so many models assume Gaussian distributions even though the distributions have been known for decades to be non-Gaussian. It turns out that most non-Gaussian distributions don't give the same distribution (with different parameters) if you take a linear combination of them. (i.e. most distributions aren't stable distributions) The Levy distribution fits most financial data better than the Gaussian distribution, and is stable, but has infinite variance. So, in many financial modelling situations, the most efficient way to model behavior is to make piecewise Gaussian approximations of your actual distribution. So, it sounds crazy at first to model something as a Gaussian distribution with different variances at different points, there is a method to the madness.

Side note: did we work together in Hong Kong at an investment bank? My YAMS initials were the same as my HN username. (Though, when I started, my YAMS initials were KAM, as we were limited to 3 initials at that time.)

KMag · on May 11, 2022

Err... sorry... bumping once in each of the directions is linear in the number of dimensions, not exponential. You don't bump all combinations, just one dimension at a time. In any case, the savings from automatic differentiation are often significant, but not exponential, unless you're doing something very wrong.

FabHK · on May 11, 2022

(I was FAB on YAMS, but based in London.)

fgh · on May 11, 2022

Just to add to this that AD was also a game changer for the banks in terms of necessary computing power and the time it needs to do these calculations. Antoine Savine describes this well in "Modern Computational Finance: AAD and Parallel Simulations", if someone is interested in this topic.

troelsSteegin · on May 11, 2022

This page is very nicely presented and was a pleasure to read.

"Automatic differentiation, on the other hand, is a solution to the problem of calculating derivatives without the downfalls of symbolic differentiation and finite differences. The key idea behind AD is to decompose calculations into elementary steps that form an evaluation trace, and combine each step’s derivative together through the chain rule."

JarreNael · on May 11, 2022

In PyTorch:

  a = torch.tensor([2., 3.], requires_grad=True)
  b = torch.tensor([6., 4.], requires_grad=True)
  Q = 3*a**3 - b**2
  external_grad = torch.tensor([1., 1.])
  Q.backward(gradient=external_grad)
  print(a.grad, b.grad) # the computed gradients.

All this is done on the GPU. Automatic Differentiation is the workhorse of modern NN.

chpatrick · on May 11, 2022

Sure, that's how you use it but it doesn't explain how it works, unlike the article. :)

hansvm · on May 12, 2022

Fun things to add:

(1) Despite being classified as separate things in TFA, symbolic differentiation really is the same thing as AD. The apparent explosion in terms comes not from some innate difference but because many symbolic differentiators don't allow you to express term reuse.

(2) It's easy to write locally (in code) non-differentiable functions that compose to something smooth, and AD usually has a bad time with that. Anything involving a branch on the symbols to be differentiated has a decent chance of blowing up this way. Consider x==0 ? x^2+x : x^2. A typical AD framework will either refuse to produce a derivative or will derive the first term and incorrectly produce 1 instead of 0. This sort of issue makes it challenging to derive variable-length algorithms even if they're known to have smooth results (like almost anything with a stable optimization subroutine).

ur-whale · on May 11, 2022

> I learned a lot by actually implementing the techniques, instead of just going over the mathematics.

This is such an absolute and profound insight and applies to so much of maths in general it would deserve an article on its own.

omnicognate · on May 11, 2022

I've found algorithmic (I prefer this to "automatic") differentiation strangely slippery to explain, given the concept is essentially rather simple. I think the main reason for this is the way people often think of what the "derivative" of a function is.

In high school you're taught that the derivative of the function f(x) = x^2 is the function f'(x) = 2x. That is, the derivative of the function is another function that computes its gradient. Algorithmic differentation is very confusing if you think in these terms.

When learning multivariable calculus, and when getting your head around AD, it's better to think of the derivative as a linear approximation of the original function around a particular point. In the single-variable case that means we interpret the derivative not as (a function that gives you) the gradient of the tangent line at a point, but as the tangent line itself at a particular point.

In the case of f(x) = x^2, then, the derivative at x=3 is the tangent line to the curve at the point x=3, y=9. It's best to define this in terms of offsets from the point where we're evaluating the derivative, so (y - 9) = 2 * (x - 3).

Key points to note:

1. This derivative is a linear function. Specifically, in this case in defining the tangent line we get y-9 as a linear function of x-3. (Incidentally an alternative and perhaps more obvious representation of this tangent line would give y in terms of x, i.e. y = 2x+3. This is a less useful representation though because y is not a linear function of x, it's merely an affine function - i.e. the line doesn't pass through the origin.)

2. The linear function we get depends on which value of x we evaluate it for. We have a different derivative at each point.

3. When we evaluate this linear function we must give it an offset, a value of (x-3) for which to compute the corresponding value of (y-9).[1]

This concept of derivative generalises naturally to multiple input and output variables. The derivative is still a linear function that approximates the original function around a particular (multidimensional) point. Being a linear function it can be represented as a matrix: the Jacobian matrix. It's worth nothing, though, that the Jacobian matrix is merely a representation of the linear function we are calling the derivative, and it is far from the only one possible. In a computer we can (and usually do in AD) represent the derivative as a (computer) function that takes an offset and returns an offset, and make a call to it to evaluate it rather than ever forming a matrix.

With this way of looking at things, it's relatively easy to understand AD. The key insight is that derivatives of functions compose in the same way as the original functions. If we compose two functions and we know their individual derivatives (also functions), we can evaluate the derivative of the composition by composing the two derivatives in the same way. More generally, if you have a program consisting of a call graph of numeric functions and you can make derivative functions that correspond to those individual functions they will form a call graph with the same structure, allowing you to evaluate the derivative function of the entire program. This is forward mode AD.[2]

This is related to the chain rule but in AD calling it that somewhat obscures the point and hides its intuitive obviousness. We shouldn't be surprised that derivatives compose in the same way as the original functions given they are linear approximations of those original functions!

Anyway, personally I found this to be the insight that let me get my head round AD. I hope it helps someone!

[1] This input exists in AD code when we evaluate the derivative, and its meaning is often very unclear to beginners. In forward mode AD we actually evaluate this linear approximation function. That is, we choose a point around which to linearly approximate and an offset from that point for which to evaluate the derivative function. See [2] for the purpose we use that offset for.

[2] Adjoint/reverse mode uses a different linear function which composes the opposite way round to the original function, making it less convenient to compute but bringing big computational complexity benefits in typical real-world cases where you have a small number of inputs but a large number of outputs. In forward mode a single evaluation of the graph of derivatives can give you the sensitivities of a single output to each of the inputs. If we want the sensitivities of all the outputs (i.e. to compute the entire Jacobian) we need one evaluation for each output. The purpose we use the offset discussed in [1] for is selecting which output we're currently computing the sensitivities for. In adjoint mode this is reversed, and we need to perform one evaluation of the graph for each input, of which there are usually many fewer than outputs.

derbOac · on May 11, 2022

One of the best explanations for me — maybe one of the first I encountered — was a paper posted here on HN about AD applied to abstract types. Getting away from numeric derivatives forced me to think about them differently.

Incidentally, this is also one area where prior experience with lisp can be helpful.

omnicognate · on May 12, 2022

Oops, realised I got my footnote [2] the wrong way round here, and it's too late to edit it. In forward mode we need one evaluation for each input to get all the sensitivities. In reverse/adjoint mode it's one per output. In real world situations we're often performing calculations on a large set of input numbers to produce a small set of output numbers we're interested in, so there may be orders of magnitude fewer outputs than inputs.

jhgb · on May 11, 2022

> I've found algorithmic (I prefer this to "automatic") differentiation

Isn't every differentiation algorithmic? It's typically integration where you often use intuition in substitution, unless you're using something like Risch's algorithm, but one learns to differentiate in school from the very beginning with what can only be described as an algorithm: at every step, there's a clear procedure to decide on what to do next, and when to terminate.

omnicognate · on May 11, 2022

There are 3 ways of doing differentiation in a computer: numeric, X and symbolic, where X is the technique we're discussing. Personally I don't think "automatic" really captures what's different about this technique. It's no more "automatic" than the other options.

"Algorithmic" differentiation is intended to evoke the idea that we are "differentiating the algorithm" piece by piece, although it's certainly true that it can be misunderstood as saying that the differentiation process itself is an algorithm, which is again no different to the other techniques.

It's a shame the terminology in this area is so fragmented. I assume it's because the technique has been rediscovered many times. In machine learning, for example, what I would call "adjoint algorithmic differentiation" is called "backpropagation".

jhgb · on May 11, 2022

Ah, I see what you mean, but if you contrast "automatic differentiation" and "algorithmic differentiation" with each other, it's counterintuitive to me that these terms should be "oriented" differently. Either I imagine that the differentiation is automated or algorithmized, or I could imagine that it's an automaton or algorithm being differentiated. But it's confusing to me if it's algorithm being differentiated vs. differentiation being automated. Not sure if that explains sufficiently clearly what confused me.

tomstuart · on May 11, 2022

Thank you! This is the clearest explanation I’ve read and it definitely shook something loose in my brain.

omnicognate · on May 11, 2022

Glad to hear it!

nsajko · on May 11, 2022

Too bad there's no chain rule for derivatives of (weighted) regular expressions!

mcovalt · on May 11, 2022

A weighted automata can be expressed as a sparse matrix and the matrix-vector product can be used to traverse the states. I think you could get a derivative out of that.