It's true that these are completely different implementations. But Lua 5.1 is one of the fastest dynamic language interpreters already. There is not room to optimize it 1.5-5x further, like you would need to catch LuaJIT's interpreter. And as the link above shows, the LuaJIT 2.0 interpreter beats Mike's own LuaJIT 1.x JIT compiler in some cases.
Mike's post made lots of specific and concrete arguments for why it's hard for C compilers to compete. Most notably, Mike's hand-written interpreter keeps all important data in registers for all fast-paths, without spilling to the stack. My experience looking at GCC output is that it is not nearly so good at this.
Look at luaV_execute() here and tell me that GCC is really going to be able to keep the variable "pc" in a register, without spilling, in all fast paths, between iterations of the loop: http://www.lua.org/source/5.1/lvm.c.html
I don't agree with the talk's overall point, but if you are skeptical about pretty much anything Mike Pall says regarding performance, you need to look harder.
These numbers appear to be done are without any profiling data, while the hand optimized version has in fact, had profiling data guiding it (the human profiled it).
Give me numbers with profile data, and file bugs about the differences in assembly generation, and i bet it could be pretty easily fixed.
Again, we've done this before for other interpreters.
My machine is a Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz.
To test LuaJIT with the JIT disabled I ran:
$ time luajit -j off benchmark.lua
To test regular and FDO builds for Lua 5.1.5 I ran (in the "src" directory of a Lua 5.1.5 tree):
$ make all
$ time ./lua benchmark.lua
$ make clean
$ make all MYCFLAGS=-fprofile-arcs MYLIBS=-fprofile-arcs
$ ./lua benchmark.lua
$ make clean (note: does not delete *.gcda)
$ make all MYCFLAGS=-fbranch-probabilities
$ time ./lua benchmark.lua
Because Lua's Makefiles use -O2 by default, I edited the Makefile to try -O3 also.
> and file bugs about the differences in assembly generation
It would be pretty hard to file bugs that specific since the two interpreters use different byte-code.
It would be an interesting exercise to write a C interpreter for the LuaJIT bytecode. That would make it easier to file the kinds of performance bugs you were mentioning.
Thank you for taking the time to perform these tests!
One thing that people advocating FDO often forget: this is statically tuning the code for a specific use case. Which is not what you want for an interpreter that has many, many code paths and is supposed to run a wide variety of code.
You won't get a 30% FDO speedup in any practical scenario. It does little for most other benchmarks and it'll pessimize quite a few of them, for sure.
Ok, so feed it with a huge mix of benchmarks that simulate typical usage. But then the profile gets flatter and FDO becomes much less effective.
Anyway, my point still stands: a factor of 1.1x - 1.3x is doable. Fine. But we're talking about a 3x speedup for my hand-written machine vs. what the C compiler produces. And that's only a comparatively tiny speedup you get from applying domain-specific knowledge. Just ask the people writing video codecs about their opinion on C vector intrinsics sometime.
I write machine code, so you don't have to. The fact that I have to do it at all is disappointing. Especially from my perspective as a compiler writer.
But DJB is of course right: the key problem is not the compiler. We don't have a source language that's at the right level to express our domain-specific knowledge while leaving the implementation details to the compiler (or the hardware).
And I'd like to add: we probably don't have the CPU architectures that would fit that hypothetical language.
> One thing that people advocating FDO often forget: this is statically tuning the code for a specific use case.
Yes, I meant to mention this but forgot. The numbers I posted are a best-case example for FDO, because the FDO is specific to the one benchmark I'm testing.
> Ok, so feed it with a huge mix of benchmarks that simulate typical usage. But then the profile gets flatter and FDO becomes much less effective.
True, though I think one clear win with FDO is helping the compiler tell fast-paths from slow-paths in each opcode. I would expect this distinction to be relatively universal regardless of workload.
The theory of fast-path/slow-path optimization would say that fast-paths are much more common than slow-paths. Otherwise optimizing fast-paths would be useless because slow-paths would dominate.
The fact that a compiler can't statically tell a fast-path from a slow-path is a big weakness. FDO does seem like it should be able to help mitigate that weakness.
I assure you, the LuaJIT case is real.
Here is data about LuaJIT's interpreter (in assembly) vs. Lua 5.1's interpreter (in C):
http://luajit.org/performance_x86.html
It's true that these are completely different implementations. But Lua 5.1 is one of the fastest dynamic language interpreters already. There is not room to optimize it 1.5-5x further, like you would need to catch LuaJIT's interpreter. And as the link above shows, the LuaJIT 2.0 interpreter beats Mike's own LuaJIT 1.x JIT compiler in some cases.
Mike's post made lots of specific and concrete arguments for why it's hard for C compilers to compete. Most notably, Mike's hand-written interpreter keeps all important data in registers for all fast-paths, without spilling to the stack. My experience looking at GCC output is that it is not nearly so good at this.
Look at luaV_execute() here and tell me that GCC is really going to be able to keep the variable "pc" in a register, without spilling, in all fast paths, between iterations of the loop: http://www.lua.org/source/5.1/lvm.c.html
I don't agree with the talk's overall point, but if you are skeptical about pretty much anything Mike Pall says regarding performance, you need to look harder.