I'm surprised no one has chimed in on how the program is "processing" 40GB of instructions with only 800 MB/s * 10s of disk read.
If I had to hazard a guess, there's some kind of smart caching going on at the OS level, but that would entail the benchmark of "n close to 2^32" not being run correctly.
...Or that the CPU is smart enough to jump millions of instructions ahead.
> my beefy gaming rig with a whopping 31.8 GB of memory
So a rerun of the program should only need to load ~8GB if the filesystem caching is somewhat loop/scan resistant.
My first thought was "probably the math is wrong", but looks like it adds up to something reasonable - especially as all numbers are rather vague / rounded (e.g. let it be 12s) and the number was just high, not absolute maximum.
My guess is either compression or stuff lingering in RAM. The CPU can't be smart here since it doesn't know what any of the future ifs will be. It doesn't know they're in order, or unique, or even valid instructions. You could (theoretically; the OS probably wouldn't let you) replace an if with an infinite loop while the program is running.
> Perhaps something like 'the branch for input x is somewhere around base+10X.
That's unlikely. Branch predictors are essentially hash tables that track statistics per branch. Since every branch is unique and only evaluated once, there's no chance for the BP to learn a sophisticated pattern. One thing that could be happening here is BP aliasing. Essentially all slots of the branch predictor are filled with entries saying "not taken".
So it's likely the BP tells the speculative execution engine "never take a branch", and we're jumping as fast as possible to the end of the code. The hardware prefetcher can catch on to the streaming loads, which helps mask load latency. Per-core bandwidth usually bottlenecks around 6-20GB/s, depending on if it's a server/desktop system, DRAM latency, and microarchitecture (because that usually determines the degree of memory parallelism). So assuming most of the file is in the kernel's page cache, those numbers check out.
I doubt it, branch predictors just predict where one instruction branches to, not the result of executing many branches in a row.
Even if they could, it wouldn’t matter, as branch prediction just lets you start speculatively executing the right instruction sooner. The branches need to be fully resolved before the following instructions can actually be retired. (All instructions on x86 are retired in order).
It can't be the CPU since it is actually memory-mapped code, and the branch predictor surely cannot cause page faults and so cannot page in the next page of code.
Truly curious. I guess the linear access pattern helps but 800 MiB/s?
It mmaps the program, unused pages then only take a page table entry and are not loaded. The only page actually loaded is the one he directly jumps into. Neat trick.
If I had to hazard a guess, there's some kind of smart caching going on at the OS level, but that would entail the benchmark of "n close to 2^32" not being run correctly.
...Or that the CPU is smart enough to jump millions of instructions ahead.