I recently learned how Doom was ported to the SNES. It's quite impressive. The SNES hardware was nowhere near fast enough to do all the trig calculations needed for the game but cartridge based games had a trick up their sleeve: they could include actual hardware inside the cart that the game code could make use of. It was more expensive but if you expected to sell a boatload of copies, it could be worth it. However, even using extra hardware wasn't enough in this case. So they pre-calculated lookup tables for sine, cosine, tangent etc. for every angle at the necessary precision. They were helped by the fact that the game resolution in this case was fairly low.
If you're interested, you can peruse the C code that was used to generate the tables. Here's the file for sine/cosine:
Yup, I remember watching a video about how the RAM bus is the bottleneck when running Super Mario 64 on the N64. The original implementation used trig lookup tables, but the person optimized it by instead using Taylor series (I think) and some negation / shifting.
Also enjoy his discovery that, while the game got flak online for years due to the release build have files that were not optimized, it turns out most of the optimizations that were done were actually for the worse due the low instruction cache/ram (I forget the exact details.) Things like unrolling loops just increased the code size and required more slow code paging.
Abramowitz and Stegun's Handbook of Mathematical Functions[0] - I still use it to test whenever I need to implement any kind of fundamental math function that's not built into whatever environment I'm using. Haven't used it in some time but it comes in handy.
It turns out that SNES DOOM missed out on a big optimization that people figured out later on. If you use the SNES's Mosaic feature combined with scrolling tricks, you can nearly double the fill rate. Rather than outputting two pixels, you let the SNES's mosaic hardware do the pixel doubling.
The approach is older than that. I remember my grandfather's engineering books from 1950's – nearly each of them had a large addendum with the said pre-calculated sine, cosine, tangent and logarithm lookup tables. And there was at least one book that only had such tables and no other information.
That is how engineers used to calculate before the advent of computers.
The classic of this field of books is Abramowitz and Stegun's "Handbook of Mathematical Functions" - although the two listed names are merely those of the compilation editors, as the calculations of the numerous tables of values (and sheets of mathematical identities) required hundreds of human computers operating for years.
Ironically, on publication in 1964 it was just in time to see the dawn of the electronic computer age that would supplant it.
Once I tested lookup tables for a path tracer (ray tracer).
It is interesting that you can get very decent results even with low quality tables. Of course there will be artifacts but due to the randomness of a path tracer this is not always very noticeable.
I always wonder when hearing about these old optimizations why they aren't used in contemporary code. Wouldn't you want to squeeze every bit of performance even on modern hardware?
The "processor-memory performance gap" is a big reason why lookup tables aren't as clear of a win on modern hardware as they were on the SNES.
If it takes two CPU cycles to read from RAM, a lookup table will basically always be faster than doing the math at runtime. If it takes fifty cycles (because, while your RAM may be faster, your CPU is a lot faster), and your processor has more advanced hardware that can do more math per cycle, maybe just doing the math is faster.
Other optimizations are now applied automatically by compilers. For example, all modern compilers optimize integer division by compile-time constants, here’s an example: https://godbolt.org/z/1b8r5c5MG
Squeezing performance out of modern hardware requires doing very different things.
Here’s an example about numerical computations. On paper, each core of my CPU can do 64 single-precision FLOPs each cycle. In reality, to achieve that performance a program needs to spam _mm256_fmadd_ps instructions while only loading at most 1 AVX vector per FMA, and only storing at most 1 AVX vector per two FMAs.
Artifacts are ugly. So why force it on modern hardware when GPUs are extremely fast?
For reference: I was doing a path tracer in PHP :) so yeah, that renders like ancient hardware.
(The browser requested different buckets of an image. A PHP script then rendered and returned that bucket. So it was a kind of multi-threading but still very slow.)
> However, even using extra hardware wasn't enough in this case. So they pre-calculated lookup tables for sine, cosine, tangent etc. for every angle at the necessary precision.
Is this really the order of events? I imagine the pre-calculated route is what you'd try first, and only go for extra hardware if that failed somehow.
If you're interested, you can peruse the C code that was used to generate the tables. Here's the file for sine/cosine:
https://github.com/RandalLinden/DOOM-FX/blob/master/source/m...