Nothing major, just some oddball decisions here and there.
Fused compare-and-branch only extends to the base integer instructions. Anything else needs to generate a value that feeds into a compare-and-branch. Since all branches are compare-and-branch, they all need two register operands, which impairs their reach to a mere +/- 4 kB.
The reach for position-independent code instructions (AUIPC + any load or store) is not quite +/- 2 GB. There is a hole on either end of the reach that is a consequence of using a sign-extended 12-bit offset for loads and stores, and a sign-extended high 20-bit offset for AIUPC. ARM's adrp (address of page) + unsigned offsets is more uniform.
RV32 isn't a proper subset of RV64, which isn't a proper subset of RV128. If they were proper subsets, then RV64 programs could run unmodified on RV128 hardware. Not that its going to ever happen, but if it did, the processor would have to mode-switch, not unlike the x86-64 transition of yore.
Floating point arithmetic spends three bits in the instruction encoding to support static rounding modes. I can count on zero hands the number of times I've needed that.
The integer ISA design goes to great lengths to avoid any instructions with three source operands, in order to simplify the datapaths on tiny machines. But... the floating point extension correctly includes fused multiply-add. So big chunks of any high-end processor will need three-operand datapaths anyway.
The base ISA is entirely too basic, and a classic failure of 90% design. Just because most code doesn't need all those other instructions doesn't mean that most systems don't. RISC-V is gathering extensions like a Katamari to fill in all those holes (B, Zfa, etc).
None of those things make it bad, I just don't think its nearly as shiny as the hype. ARM64+SVE and x86-64+AVX512 are just better.
> Floating point arithmetic spends three bits in the instruction encoding to support static rounding modes.
IMO this is way better than the alternative in x86 and ARM. The reason no one deals with rounding modes is because changing the mode is really slow and you always need to change it back or else everything breaks. Being able to do it in the instruction allows you to do operations with non-standard modes much more simply. For example, round-to-nearest-ties-to-odd can be incredibly useful to prevent double rounding.
You can't even express static rounding in C. You can't even express them in the LLVM language-independent IR. Any attempt to use the static rounding modes will necessarily involve intrinsics and/or assembly.
There's a chicken and egg problem here. C can't express this, and there's no LLVM IR for this because up till now, everyone has had global registers for configuring fpus which make them slow and useless.
C probably won't support this for decades (ISO tends to be pretty conservative), but other languages (e.g Rust/Julia) might support this soonish (especially if LLVM adds support)
IMO this is very wrong. The base ISA is excellent for micro-controllers and teaching, but the ~90% of real implementations can add the extra 20 extensions to make a modern, fully featured CPU.
It's not complete nonsense, but it conveniently leaves out key relevant details and includes key pieces of misinformation needed to make the talking points make sense (as you would expect from rage bait.)
> Early on, SLS designers made the catastrophic decision to reuse Shuttle hardware
The law that mandated that NASA built the SLS also required that they re-use that hardware. This wasn't a choice made by NASA designers but by a bipartisan congress and it wasn't designed so much to advance our space program so much as a way to keep funneling money to space contractors with the end of the Shuttle program.
Any article that proposed to discuss the "lunacy" of Artemis without ever mentioning Congress's role in that lunacy is pretty clearly rage bait.
ISPC suffers from poor scatter and gather support in hardware. The direct result is that it is hard to make programs that scale in complexity without resorting to shenanigans.
An ideal scatter-read or gather-store instruction should take time proportional to the number of cache lines that it touches. If all of the lane accesses are sequential and cache line aligned it should take the same amount of time as an aligned vector load or store. If the accesses have high cache locality such that only two cache lines are touched, it should cost exactly the same as loading those two cache lines and shuffling the results into place. That isn't what we have on x86-AVX512. They are microcoded with inefficient lane-at-a-time implementations. If you know that there is good locality of reference in the access, then it can be faster to hand-code your own cache line-at-a-time load/shuffle/masked-merge loop than to rely on the hardware. This makes me sad.
ISPC's varying variables have no way to declare that they are sequential among all lanes. Therefore, without extensive inlining to expose the caller's access pattern, it issues scatters and gathers at the drop of a hat. You might like to write your program with a naive x[y] (x a uniform pointer, y a varying index) in a subroutine, but ISPC's language cannot infer that y is sequential along lanes. So, you have to carefully re-code it to say that y is actually a uniform offset into the array, and write x[y + programIndex]. Error-prone, yet utterly essential for decent performance. I resorted to munging my naming conventions for such indexes, not unlike the Hungarian notation of yesteryear.
Rewriting critical data structures in SoA format instead of AoS format is non-trivial, and a prerequisite to get decent performance from ISPC. You cannot "just" replace some subroutines with ISPC routines, you need to make major refactorings that touch the rest of the program as well. This is neutral in an ISPC-versus-intrinsics (or even ISPC-versus-GPU) shootout, but it is worth mentioning only to point out that ISPC is also not a silver bullet in this regards, either.
Non-minor nit: The ISPC math library gives up far too much precision by default in the name of speed. Fortunately, Sleef is not terribly difficult to integrate and use for the 1-ulp max rounding error that I've come to expect from a competent libm.
Another: The ISPC calling convention adheres rather strictly to the C calling convention... which doesn't provide any callee-saved vector registers, not even for the execution mask. So if you like to decompose your program across multiple compilation units, you will also notice much more register save and restore traffic than you would like or expect.
I want to like it, I can get some work done in it, and I did get significant performance improvements over scalar code when using it. But the resulting source code and object code are not great. They are merely acceptable.
> introduced by PWM dimming, but why would that be a low enough frequency to bother people?
The human fovea has a much lower effective refresh rate than your peripheral vision. So you might notice the flickering of tail lights (and daytime running lights) seen out of the corner of your eye even though you can't notice when looking directly at them.
For sure. I summarized my particular sensitivity too aggressively earlier, but I tend to see flicker straight on up to 65 Hz and peripherally up to 120 Hz if it's particularly egregious (i.e., long valleys) but usually up to something less. In any case, I've never noticed car tail lights flickering even peripherally, despite video artifacts revealing that they do flicker.
clang-format and clang-tidy are both excellent for C and C++ (and protobuf, if your group uses it). Since they are based on the clang front-end, they naturally have full support for both languages and all of their complexity.
LUTs are commonly used in geodesy applications on or near the Earth's surface. The full multipole model is used for orbital applications to account for the way that local lumpiness in Earth's mass distribution is smoothed out with increasing distance from the surface. It might be reasonable to build a 3D LUT for use at Starlink scale or higher, but certainly not for individual satellites.
Exactly what order and degree were you using to evaluate the model? Variations in drag and solar pressure are more significant than the uncertainty in the gravity field for objects in LEO somewhere much less than 127th order (40 microseconds on my machine, your smileage may vary), so you can safely truncate the model for simulations. GRACE worked by making many passes such that they could average out those perturbations to make their measurement. But for practical applications, those tiny terms are irrelevant.
Fused compare-and-branch only extends to the base integer instructions. Anything else needs to generate a value that feeds into a compare-and-branch. Since all branches are compare-and-branch, they all need two register operands, which impairs their reach to a mere +/- 4 kB.
The reach for position-independent code instructions (AUIPC + any load or store) is not quite +/- 2 GB. There is a hole on either end of the reach that is a consequence of using a sign-extended 12-bit offset for loads and stores, and a sign-extended high 20-bit offset for AIUPC. ARM's adrp (address of page) + unsigned offsets is more uniform.
RV32 isn't a proper subset of RV64, which isn't a proper subset of RV128. If they were proper subsets, then RV64 programs could run unmodified on RV128 hardware. Not that its going to ever happen, but if it did, the processor would have to mode-switch, not unlike the x86-64 transition of yore.
Floating point arithmetic spends three bits in the instruction encoding to support static rounding modes. I can count on zero hands the number of times I've needed that.
The integer ISA design goes to great lengths to avoid any instructions with three source operands, in order to simplify the datapaths on tiny machines. But... the floating point extension correctly includes fused multiply-add. So big chunks of any high-end processor will need three-operand datapaths anyway.
The base ISA is entirely too basic, and a classic failure of 90% design. Just because most code doesn't need all those other instructions doesn't mean that most systems don't. RISC-V is gathering extensions like a Katamari to fill in all those holes (B, Zfa, etc).
None of those things make it bad, I just don't think its nearly as shiny as the hype. ARM64+SVE and x86-64+AVX512 are just better.