I've found algorithmic (I prefer this to "automatic") differentiation strangely slippery to explain, given the concept is essentially rather simple. I think the main reason for this is the way people often think of what the "derivative" of a function is.
In high school you're taught that the derivative of the function f(x) = x^2 is the function f'(x) = 2x. That is, the derivative of the function is another function that computes its gradient. Algorithmic differentation is very confusing if you think in these terms.
When learning multivariable calculus, and when getting your head around AD, it's better to think of the derivative as a linear approximation of the original function around a particular point. In the single-variable case that means we interpret the derivative not as (a function that gives you) the gradient of the tangent line at a point, but as the tangent line itself at a particular point.
In the case of f(x) = x^2, then, the derivative at x=3 is the tangent line to the curve at the point x=3, y=9. It's best to define this in terms of offsets from the point where we're evaluating the derivative, so (y - 9) = 2 * (x - 3).
Key points to note:
1. This derivative is a linear function. Specifically, in this case in defining the tangent line we get y-9 as a linear function of x-3. (Incidentally an alternative and perhaps more obvious representation of this tangent line would give y in terms of x, i.e. y = 2x+3. This is a less useful representation though because y is not a linear function of x, it's merely an affine function - i.e. the line doesn't pass through the origin.)
2. The linear function we get depends on which value of x we evaluate it for. We have a different derivative at each point.
3. When we evaluate this linear function we must give it an offset, a value of (x-3) for which to compute the corresponding value of (y-9).[1]
This concept of derivative generalises naturally to multiple input and output variables. The derivative is still a linear function that approximates the original function around a particular (multidimensional) point. Being a linear function it can be represented as a matrix: the Jacobian matrix. It's worth nothing, though, that the Jacobian matrix is merely a representation of the linear function we are calling the derivative, and it is far from the only one possible. In a computer we can (and usually do in AD) represent the derivative as a (computer) function that takes an offset and returns an offset, and make a call to it to evaluate it rather than ever forming a matrix.
With this way of looking at things, it's relatively easy to understand AD. The key insight is that derivatives of functions compose in the same way as the original functions. If we compose two functions and we know their individual derivatives (also functions), we can evaluate the derivative of the composition by composing the two derivatives in the same way. More generally, if you have a program consisting of a call graph of numeric functions and you can make derivative functions that correspond to those individual functions they will form a call graph with the same structure, allowing you to evaluate the derivative function of the entire program. This is forward mode AD.[2]
This is related to the chain rule but in AD calling it that somewhat obscures the point and hides its intuitive obviousness. We shouldn't be surprised that derivatives compose in the same way as the original functions given they are linear approximations of those original functions!
Anyway, personally I found this to be the insight that let me get my head round AD. I hope it helps someone!
[1] This input exists in AD code when we evaluate the derivative, and its meaning is often very unclear to beginners. In forward mode AD we actually evaluate this linear approximation function. That is, we choose a point around which to linearly approximate and an offset from that point for which to evaluate the derivative function. See [2] for the purpose we use that offset for.
[2] Adjoint/reverse mode uses a different linear function which composes the opposite way round to the original function, making it less convenient to compute but bringing big computational complexity benefits in typical real-world cases where you have a small number of inputs but a large number of outputs. In forward mode a single evaluation of the graph of derivatives can give you the sensitivities of a single output to each of the inputs. If we want the sensitivities of all the outputs (i.e. to compute the entire Jacobian) we need one evaluation for each output. The purpose we use the offset discussed in [1] for is selecting which output we're currently computing the sensitivities for. In adjoint mode this is reversed, and we need to perform one evaluation of the graph for each input, of which there are usually many fewer than outputs.
One of the best explanations for me — maybe one of the first I encountered — was a paper posted here on HN about AD applied to abstract types. Getting away from numeric derivatives forced me to think about them differently.
Incidentally, this is also one area where prior experience with lisp can be helpful.
Oops, realised I got my footnote [2] the wrong way round here, and it's too late to edit it. In forward mode we need one evaluation for each input to get all the sensitivities. In reverse/adjoint mode it's one per output. In real world situations we're often performing calculations on a large set of input numbers to produce a small set of output numbers we're interested in, so there may be orders of magnitude fewer outputs than inputs.
> I've found algorithmic (I prefer this to "automatic") differentiation
Isn't every differentiation algorithmic? It's typically integration where you often use intuition in substitution, unless you're using something like Risch's algorithm, but one learns to differentiate in school from the very beginning with what can only be described as an algorithm: at every step, there's a clear procedure to decide on what to do next, and when to terminate.
There are 3 ways of doing differentiation in a computer: numeric, X and symbolic, where X is the technique we're discussing. Personally I don't think "automatic" really captures what's different about this technique. It's no more "automatic" than the other options.
"Algorithmic" differentiation is intended to evoke the idea that we are "differentiating the algorithm" piece by piece, although it's certainly true that it can be misunderstood as saying that the differentiation process itself is an algorithm, which is again no different to the other techniques.
It's a shame the terminology in this area is so fragmented. I assume it's because the technique has been rediscovered many times. In machine learning, for example, what I would call "adjoint algorithmic differentiation" is called "backpropagation".
Ah, I see what you mean, but if you contrast "automatic differentiation" and "algorithmic differentiation" with each other, it's counterintuitive to me that these terms should be "oriented" differently. Either I imagine that the differentiation is automated or algorithmized, or I could imagine that it's an automaton or algorithm being differentiated. But it's confusing to me if it's algorithm being differentiated vs. differentiation being automated. Not sure if that explains sufficiently clearly what confused me.
In high school you're taught that the derivative of the function f(x) = x^2 is the function f'(x) = 2x. That is, the derivative of the function is another function that computes its gradient. Algorithmic differentation is very confusing if you think in these terms.
When learning multivariable calculus, and when getting your head around AD, it's better to think of the derivative as a linear approximation of the original function around a particular point. In the single-variable case that means we interpret the derivative not as (a function that gives you) the gradient of the tangent line at a point, but as the tangent line itself at a particular point.
In the case of f(x) = x^2, then, the derivative at x=3 is the tangent line to the curve at the point x=3, y=9. It's best to define this in terms of offsets from the point where we're evaluating the derivative, so (y - 9) = 2 * (x - 3).
Key points to note:
1. This derivative is a linear function. Specifically, in this case in defining the tangent line we get y-9 as a linear function of x-3. (Incidentally an alternative and perhaps more obvious representation of this tangent line would give y in terms of x, i.e. y = 2x+3. This is a less useful representation though because y is not a linear function of x, it's merely an affine function - i.e. the line doesn't pass through the origin.)
2. The linear function we get depends on which value of x we evaluate it for. We have a different derivative at each point.
3. When we evaluate this linear function we must give it an offset, a value of (x-3) for which to compute the corresponding value of (y-9).[1]
This concept of derivative generalises naturally to multiple input and output variables. The derivative is still a linear function that approximates the original function around a particular (multidimensional) point. Being a linear function it can be represented as a matrix: the Jacobian matrix. It's worth nothing, though, that the Jacobian matrix is merely a representation of the linear function we are calling the derivative, and it is far from the only one possible. In a computer we can (and usually do in AD) represent the derivative as a (computer) function that takes an offset and returns an offset, and make a call to it to evaluate it rather than ever forming a matrix.
With this way of looking at things, it's relatively easy to understand AD. The key insight is that derivatives of functions compose in the same way as the original functions. If we compose two functions and we know their individual derivatives (also functions), we can evaluate the derivative of the composition by composing the two derivatives in the same way. More generally, if you have a program consisting of a call graph of numeric functions and you can make derivative functions that correspond to those individual functions they will form a call graph with the same structure, allowing you to evaluate the derivative function of the entire program. This is forward mode AD.[2]
This is related to the chain rule but in AD calling it that somewhat obscures the point and hides its intuitive obviousness. We shouldn't be surprised that derivatives compose in the same way as the original functions given they are linear approximations of those original functions!
Anyway, personally I found this to be the insight that let me get my head round AD. I hope it helps someone!
[1] This input exists in AD code when we evaluate the derivative, and its meaning is often very unclear to beginners. In forward mode AD we actually evaluate this linear approximation function. That is, we choose a point around which to linearly approximate and an offset from that point for which to evaluate the derivative function. See [2] for the purpose we use that offset for.
[2] Adjoint/reverse mode uses a different linear function which composes the opposite way round to the original function, making it less convenient to compute but bringing big computational complexity benefits in typical real-world cases where you have a small number of inputs but a large number of outputs. In forward mode a single evaluation of the graph of derivatives can give you the sensitivities of a single output to each of the inputs. If we want the sensitivities of all the outputs (i.e. to compute the entire Jacobian) we need one evaluation for each output. The purpose we use the offset discussed in [1] for is selecting which output we're currently computing the sensitivities for. In adjoint mode this is reversed, and we need to perform one evaluation of the graph for each input, of which there are usually many fewer than outputs.