> He then goes on to rebut other comments with simple bald assertions (like the ...

DannyBee · on April 17, 2015

These numbers appear to be done are without any profiling data, while the hand optimized version has in fact, had profiling data guiding it (the human profiled it).

Give me numbers with profile data, and file bugs about the differences in assembly generation, and i bet it could be pretty easily fixed.

Again, we've done this before for other interpreters.

haberman · on April 17, 2015

> Give me numbers with profile data

Because the interests me, I took a few minutes to try this out.

I ran this test on GCC 4.8.2-19ubuntu1, since it was the newest official release I could get my hands on without compiling my own GCC.

Here are my raw numbers (methodology below):

    LuaJIT 2.0.2 (JIT disabled): 1.675s
    Lua 5.1.5:                   5.787s (3.45x)
    Lua 5.1.5 w/FDO:             5.280s (3.15x)
    Lua 5.1.5 -O3:               6.536s (3.90x)
    Lua 5.1.5 -O3 w/FDO:         4.288s (2.56x)

For a benchmark I ran the fannkuch benchmark with N=11 (https://github.com/headius/luaj/blob/master/test/lua/perf/fa...).

My machine is a Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz.

To test LuaJIT with the JIT disabled I ran:

    $ time luajit -j off benchmark.lua

To test regular and FDO builds for Lua 5.1.5 I ran (in the "src" directory of a Lua 5.1.5 tree):

    $ make all
    $ time ./lua benchmark.lua
    $ make clean
    $ make all MYCFLAGS=-fprofile-arcs MYLIBS=-fprofile-arcs
    $ ./lua benchmark.lua
    $ make clean (note: does not delete *.gcda)
    $ make all MYCFLAGS=-fbranch-probabilities
    $ time ./lua benchmark.lua

Because Lua's Makefiles use -O2 by default, I edited the Makefile to try -O3 also.

> and file bugs about the differences in assembly generation

It would be pretty hard to file bugs that specific since the two interpreters use different byte-code.

It would be an interesting exercise to write a C interpreter for the LuaJIT bytecode. That would make it easier to file the kinds of performance bugs you were mentioning.

mikemike · on April 18, 2015

Thank you for taking the time to perform these tests!

One thing that people advocating FDO often forget: this is statically tuning the code for a specific use case. Which is not what you want for an interpreter that has many, many code paths and is supposed to run a wide variety of code.

You won't get a 30% FDO speedup in any practical scenario. It does little for most other benchmarks and it'll pessimize quite a few of them, for sure.

Ok, so feed it with a huge mix of benchmarks that simulate typical usage. But then the profile gets flatter and FDO becomes much less effective.

Anyway, my point still stands: a factor of 1.1x - 1.3x is doable. Fine. But we're talking about a 3x speedup for my hand-written machine vs. what the C compiler produces. And that's only a comparatively tiny speedup you get from applying domain-specific knowledge. Just ask the people writing video codecs about their opinion on C vector intrinsics sometime.

I write machine code, so you don't have to. The fact that I have to do it at all is disappointing. Especially from my perspective as a compiler writer.

But DJB is of course right: the key problem is not the compiler. We don't have a source language that's at the right level to express our domain-specific knowledge while leaving the implementation details to the compiler (or the hardware).

And I'd like to add: we probably don't have the CPU architectures that would fit that hypothetical language.

See my ramblings about preserving programmer intent, I made in the past: http://www.freelists.org/post/luajit/Ramblings-on-languages-...

haberman · on April 18, 2015

> One thing that people advocating FDO often forget: this is statically tuning the code for a specific use case.

Yes, I meant to mention this but forgot. The numbers I posted are a best-case example for FDO, because the FDO is specific to the one benchmark I'm testing.

> Ok, so feed it with a huge mix of benchmarks that simulate typical usage. But then the profile gets flatter and FDO becomes much less effective.

True, though I think one clear win with FDO is helping the compiler tell fast-paths from slow-paths in each opcode. I would expect this distinction to be relatively universal regardless of workload.

The theory of fast-path/slow-path optimization would say that fast-paths are much more common than slow-paths. Otherwise optimizing fast-paths would be useless because slow-paths would dominate.

The fact that a compiler can't statically tell a fast-path from a slow-path is a big weakness. FDO does seem like it should be able to help mitigate that weakness.