Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm surprised no one has chimed in on how the program is "processing" 40GB of instructions with only 800 MB/s * 10s of disk read.

If I had to hazard a guess, there's some kind of smart caching going on at the OS level, but that would entail the benchmark of "n close to 2^32" not being run correctly.

...Or that the CPU is smart enough to jump millions of instructions ahead.



> my beefy gaming rig with a whopping 31.8 GB of memory

So a rerun of the program should only need to load ~8GB if the filesystem caching is somewhat loop/scan resistant.

My first thought was "probably the math is wrong", but looks like it adds up to something reasonable - especially as all numbers are rather vague / rounded (e.g. let it be 12s) and the number was just high, not absolute maximum.


I guess this would be the expected, though boring, answer -- "incorrect" benching (i.e. clear the cache first).


My guess is either compression or stuff lingering in RAM. The CPU can't be smart here since it doesn't know what any of the future ifs will be. It doesn't know they're in order, or unique, or even valid instructions. You could (theoretically; the OS probably wouldn't let you) replace an if with an infinite loop while the program is running.


Could it be the branch predictor of the CPU somehow catching on to this? Perhaps something like 'the branch for input x is somewhere around base+10X.


> Perhaps something like 'the branch for input x is somewhere around base+10X.

That's unlikely. Branch predictors are essentially hash tables that track statistics per branch. Since every branch is unique and only evaluated once, there's no chance for the BP to learn a sophisticated pattern. One thing that could be happening here is BP aliasing. Essentially all slots of the branch predictor are filled with entries saying "not taken".

So it's likely the BP tells the speculative execution engine "never take a branch", and we're jumping as fast as possible to the end of the code. The hardware prefetcher can catch on to the streaming loads, which helps mask load latency. Per-core bandwidth usually bottlenecks around 6-20GB/s, depending on if it's a server/desktop system, DRAM latency, and microarchitecture (because that usually determines the degree of memory parallelism). So assuming most of the file is in the kernel's page cache, those numbers check out.


I doubt it, branch predictors just predict where one instruction branches to, not the result of executing many branches in a row.

Even if they could, it wouldn’t matter, as branch prediction just lets you start speculatively executing the right instruction sooner. The branches need to be fully resolved before the following instructions can actually be retired. (All instructions on x86 are retired in order).


There's also anticipatory paging, the OS can guess which pages will be requested the next time around.


It can't be the CPU since it is actually memory-mapped code, and the branch predictor surely cannot cause page faults and so cannot page in the next page of code.

Truly curious. I guess the linear access pattern helps but 800 MiB/s?


Very likely the majority of the file is paged in, and can be prefetched just fine. The author doesn't clear/disable the page cache.


It mmaps the program, unused pages then only take a page table entry and are not loaded. The only page actually loaded is the one he directly jumps into. Neat trick.


It's not jumping directly, it's executing sequentially, comparing against each consecutive number in sequence.


You are right, my bad. Thanks for clarifying it.


I assume modern branch predictors are capable of picking up a trivial pattern like this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: