While some people here are citing 10% as "large" and 1% as "normal", there are o...

While some people here are citing 10% as "large" and 1% as "normal", there are optimizations like partial inlining of doubly recursive Fibonacci that can reduce the actual work (and so also time) exponentially (factors >10X for double-digit arguments or 1000s of %, technically exponential in the difference of the recursion depth, not the problem size [1]).

C compilers can also be very finicky about their code inlining metrics. So, whether that enormous speed-up is realized can be very sensitive to the shape of your code.

So, while part of the problem is definitely that CPUs have gotten quite fancy/complex, another aspect is that compilers "beyond -O0 or -O1" have also gotten fancy/complex.

The article here, while good & worth reading, is also but one of many examples of how two complex things interacting can lead to very surprising results (and this is true outside of computing). People have a rather strong tendency to oversimplify -- no matter how many times this lesson shows up.

EDIT: Also, the article at least uses two CPUs, Intel & Apple M1 and two compilers (gcc & clang), but there are realistic deployment scenarios across many more generations/impls of Intel, AMD, and ARM and probably other compilers, too. So, it only even samples a small part of the total complexity. Also, if you want to be more scientific, esp. for differences like "1.01X" then time measurements should really have error bars of some kind (either std.dev of the mean or maybe better for a case like this std.dev of the min[2]) and to minimize those measurement errors you probably want CPU core affinity scheduling in the OS.

[1] https://stackoverflow.com/questions/360748/computational-com...

[2] https://github.com/c-blake/bu/blob/main/doc/tim.md