I like the simplicity of this (very Go-like). However, what are the performance ...

jerf · on July 2, 2019

You'll get the overhead of a non-inlineable vtable-based method call for each pixel. How badly that hurts you depends on the ratio of expense of that call vs. how expensive the pixel is to generate. If you've already got all your pixel values manifested in memory and you're just indexing into an array, it's going to be fairly expensive. If you're generating noise with a random number generator, it's going to be noticeable but not necessarily catastrophic (since "generating a random number" and "making a method call" are somewhat comparable in size, varying based on the number generator in question). If you're generating a fractal the overhead will rapidly be lost in the noise.

But I'd also point out that the Go standard library does not necessarily claim to the "last word" for any give task; it's generally more an 80/20 sort of thing. If you've got a case where that's an unacceptable performance loss, go get or write a specialized library. There's nothing "un-Go-ic" about that.

diroussel · on July 2, 2019

I would expect the stable based dispatch to be handled quite well with branch prediction. And surely the cache misses from those nested loops would have a much worse impact. Even if a few extra instructions have to run per pixel it's going to be quicker than a fetch from main memory.

All the examples, in each language, could be rewritten as a cache oblivious algorithm to optimise cache usage. This would speed them all up. See https://en.wikipedia.org/wiki/Cache-oblivious_algorithm

jerf · on July 3, 2019

In general, I tend to agree there's a lot of people who have kind of picked up "vtables always bad and slow" and overestimate the actual overhead.

But I have actually benchmarked this before, and it is possible to have a function body so small (like, for example, a single slice array index lookup and return of a register-sized value like a machine word) that the function call overhead can dominate even so.

(Languages like Erlang and Go that emphasize concurrency have a constant low-level stream of posts on their forums from people who do an even more extreme version, when they try to "parallelize" the task of adding a list of integers together, and replace a highly-pipelineable int add operation that can actually come out to less than one cycle per add with spawning a new execution context, sending over the integers to add, adding them in the new context, and then synchronizing on sending them back. Then they wonder why Erlang/Go/threading in general sucks so much because this new program is literally hundreds of times slower than the original.)

But it is true this amortizes away fairly quickly, because the overhead isn't that large. Even the larger random number generators like the Mersenne Twister will be a long ways towards dominating the function call overhead. I don't even begin to worry about function call overhead unless I can see I'm trying to do several million per second, because generally, you can't do several million per second on a single core because the function bodies themselves are too large and doing too much stuff, such that even if function call overhead was 0 it would still be impossible in the particular program to do it.

bmurphy1976 · on July 2, 2019

Does it matter? Rendering a game at 60fps, sure. Saving a JPEG though? This is a great example where the simple approach is probably better.

If you need speed don't use the standard library, use something specialized.

This is a great example I didn't even know about that reinforces why I like Go. We don't have to over engineer everything.

gameswithgo · on July 2, 2019

images tend to be very large now. 4k monitors are normal, 16k monitors are planned in the future. So yeah I think it matters. Not always but often!

Crinus · on July 2, 2019

4K monitors are very rare, in regular desktop stats they are so insignificant statistically that they do not even show up:

http://gs.statcounter.com/screen-resolution-stats/desktop/wo...

In gaming they are just at 1.61% right below 1280x1024 and the increase is so low (+0.01) that might as well be zero (compare with 1080p's +2.04% which is the one increasing the most):

https://store.steampowered.com/hwsurvey/

Tech minded people are a bubble, gamers are a tiny bubble among those and /r/pcmasterrace 4K-or-die boasters are a tiny bubble among gamers. 4K, or even 1440p, matters way less in practice than tech minded people think.

lalaithion · on July 2, 2019

16K monitors? What size of monitor do you expect to use? If you sit 2 feet away from your monitor, you'd need a 57 inch monitor to resolve the pixels.

bitwize · on July 2, 2019

There is basically no upper bound to the display resolution people want, even if their eyes can't physically resolve it. Graphic designers and gamers will still swear there's a difference. It's like audiophilia for the visual system.

whatshisface · on July 2, 2019

At a certain point people will prefer large and more monitors over increased resolution. If I could buy two 16k monitors instead of one 32k monitor I'd do it, and that puts a soft upper limit on the resolution.

bmurphy1976 · on July 3, 2019

Again, where's the proof that the performance is a problem? The standard library should solve for the 80% case. I suspect it is well within "fast enough."

tedunangst · on July 2, 2019

Cast to image.RGBA (or whatever) and bang away on .Pix []uint8.

rwj · on July 2, 2019

This is another way that Go's approach works well. In practice, you will see library code check for common image formats, and then dispatch to optimized code. The advantage here being that optimization is a library concern. Callers still maintain flexibility.

hunterloftis · on July 2, 2019

You can certainly operate directly on pixel arrays as well. When the original example here is changed to operate on .Pix instead of using .At(), it runs about 2x faster:

https://github.com/nicolashahn/diffimg-go/pull/2

enriquto · on July 2, 2019

With a single modern CPU you can do a lot of operations at each pixel of a high resolution screen and still get a framerate that your monitor cannot churn. Then, you typically have a few more cores around to do other stuff.

Unless your language introduces an unreasonable overhead, a for loop over the pixels is perfectly appropriate and fast.

nemo1618 · on July 2, 2019

The problem in this case is not the looping over each pixel, but the overhead of invoking a dynamic method on each pixel. For example, if you're iterating over a []byte and setting each value to zero, the compiler can optimize that to a single memclr. Using an interface masks the underlying representation and consequently prevents any sort of inlining.

enriquto · on July 2, 2019

> Using an interface masks the underlying representation and consequently prevents any sort of inlining.

This sounds like a limitation of a particular optimizing compiler/interpreter rather than a problem of the language itself. For example, the plain lua interpreter incurs quite a lot of overhead for this, but the luajit interpreter is oblivious. The standard python interpreter definitely adds a lot of overhead.

jchw · on July 2, 2019

You can just optimize for underlying concrete types! I believe even the standard library does this.