I'm surprised no one has chimed in on how the program is "processing" 40GB of in...

treffer · on Dec 28, 2023

> my beefy gaming rig with a whopping 31.8 GB of memory

So a rerun of the program should only need to load ~8GB if the filesystem caching is somewhat loop/scan resistant.

My first thought was "probably the math is wrong", but looks like it adds up to something reasonable - especially as all numbers are rather vague / rounded (e.g. let it be 12s) and the number was just high, not absolute maximum.

winwang · on Dec 29, 2023

I guess this would be the expected, though boring, answer -- "incorrect" benching (i.e. clear the cache first).

drewtato · on Dec 28, 2023

My guess is either compression or stuff lingering in RAM. The CPU can't be smart here since it doesn't know what any of the future ifs will be. It doesn't know they're in order, or unique, or even valid instructions. You could (theoretically; the OS probably wouldn't let you) replace an if with an infinite loop while the program is running.

rocqua · on Dec 28, 2023

Could it be the branch predictor of the CPU somehow catching on to this? Perhaps something like 'the branch for input x is somewhere around base+10X.

dist1ll · on Dec 28, 2023

> Perhaps something like 'the branch for input x is somewhere around base+10X.

That's unlikely. Branch predictors are essentially hash tables that track statistics per branch. Since every branch is unique and only evaluated once, there's no chance for the BP to learn a sophisticated pattern. One thing that could be happening here is BP aliasing. Essentially all slots of the branch predictor are filled with entries saying "not taken".

So it's likely the BP tells the speculative execution engine "never take a branch", and we're jumping as fast as possible to the end of the code. The hardware prefetcher can catch on to the streaming loads, which helps mask load latency. Per-core bandwidth usually bottlenecks around 6-20GB/s, depending on if it's a server/desktop system, DRAM latency, and microarchitecture (because that usually determines the degree of memory parallelism). So assuming most of the file is in the kernel's page cache, those numbers check out.

umanwizard · on Dec 28, 2023

I doubt it, branch predictors just predict where one instruction branches to, not the result of executing many branches in a row.

Even if they could, it wouldn’t matter, as branch prediction just lets you start speculatively executing the right instruction sooner. The branches need to be fully resolved before the following instructions can actually be retired. (All instructions on x86 are retired in order).

rzzzt · on Dec 28, 2023

There's also anticipatory paging, the OS can guess which pages will be requested the next time around.

toxik · on Dec 28, 2023

It can't be the CPU since it is actually memory-mapped code, and the branch predictor surely cannot cause page faults and so cannot page in the next page of code.

Truly curious. I guess the linear access pattern helps but 800 MiB/s?

dist1ll · on Dec 28, 2023

Very likely the majority of the file is paged in, and can be prefetched just fine. The author doesn't clear/disable the page cache.

jeffreygoesto · on Dec 28, 2023

It mmaps the program, unused pages then only take a page table entry and are not loaded. The only page actually loaded is the one he directly jumps into. Neat trick.

Liquid_Fire · on Dec 28, 2023

It's not jumping directly, it's executing sequentially, comparing against each consecutive number in sequence.

jeffreygoesto · on Dec 28, 2023

You are right, my bad. Thanks for clarifying it.

bob1029 · on Dec 28, 2023

I assume modern branch predictors are capable of picking up a trivial pattern like this.