> I can only assume all these ideas have been widely implemented at all signific...

pureagave · 2025-10-02T16:35:00 1759422900

This is what killed IBM PowerPC in the ML market. Tried to get in with a faster CPU with NVLink embedded hoping that would win market share. But what won wasn't a faster machine or better architecture. A platform with more developers that has fewer bugs and everyone knows wins almost all the time. ML/AI developers are less rare today but still rare.

versteegen · 2025-10-02T14:27:22 1759415242

I'm very willing to believe that. When I hear that they just don't have enough staff for it I get the impression is that they set their hiring bar for engineers too high. Optimising CUDA is quite different from having experience training LLMs.

sailingparrot · 2025-10-02T14:40:55 1759416055

> they set their hiring bar for engineers too high

Not sure I agree, if you look at the head count growth of companies like OpenAI, Anthropic etc, it is super fast, its already pretty hard to keep everything working smoothly with that rate of employee growth, so going faster than that seems very risky.

Ultimately I think it's mostly caused by the field still being so new. Everything still needs to be optimized and there just aren't that many very good CUDA programmers to start with, then you need to find one that also has deep knowledge of ML and transformers architectures, which further drains the pool. And then when you do find one of them, there is 50 different things they could be working on instead of what's in the article, all equally or more impactful. The architectures being constantly evolving also make it hard/not a great ROI to go super super deep in single digit % optimization when there is new stuff coming out all the time that can be made an order of magnitude faster.

A good example of that is flash attention: it is maybe the most significant/impactful optimization in ML of the last few years. Tl;dr is how do you fuse the entire attention pipeline together to make it much faster and avoid massive tensor materialization. The bottleneck was obvious to anyone that profiled a Transformer-based model, but there was no obvious solution because of how softmax works. Yet the paper that ultimately unblock this was published back in 2019 [1], but it took 3 years for a team to connect the dots. Most people in pure ML engineering didn't know about the paper and don't have good enough CUDA knowledge/ GPU arch understanding, most people with good CUDA knowledge don't understand ML well enough, and even the author of that 2019 paper said "[we] hypothesize that this reduction in memory accesses should improve Softmax performance on actual hardware" but didn't have the technical skills to test this or to see how that could be part of a bigger breakthrough because it requires understanding core concepts in how GPU worked and compute/memory imbalance.

[1]: https://arxiv.org/pdf/1805.02867

almostgotcaught · 2025-10-02T16:41:08 1759423268

> I get the impression is that they set their hiring bar for engineers too high.

whenever anyone says this they should be required to disclose whether they've actually 1) been employed to do this work 2) how many LC rounds they've failed during their last job search ..... lol

JumpCrisscross · 2025-10-02T16:06:34 1759421194

> they set their hiring bar for engineers too high

You chase away your top engineers when you glom up the system with dumbfucks.