If AMD fixes or open sources their proprietary firmware blob[0]. Geohot streamed all weekend on Twitch, reverse engineering the AMD firmware. It was quite entertaining learning about how that low level hardware firmware works[1] and his rants about AMD of course.
Geohot is wrangling with unsupported consumer hardware.
The datacenter stuff is on a different architecture and driver stack.
The number one supercomputer on the top500 list (frontier at ORNL) is based on AMD GPUs and AMD is probably more invested in supporting that.
I work with Frontier and ORNL/OLCF. They have had and continue to have issues with AMD/ROCm but yes, they do of course get excellent support from AMD. The entire team at OLCF is incredible as well (obviously) and they do amazing work.
Frontier certainly has some unique quirks but the documentation is online[0] and most of these quirks are inherent to the kinds of fundamental issues you'll see on any system in the space (SLURM, etc).
However, most of the issues are fundamentally ROCm and you'll run into them on any MIxxx anywhere. I run into them frequently with supported and unsupported consumer gear all the way up.
I mean, that's kinda nvidia's whole shtick: anyone can play around synthesizing cat pictures on their gaming GPU and if they make a breakthrough, the same software will transfer to X million dollar supercomputers.
Subscriber only videos, so nobody can confirm that he did that, nor archive whatever valuable information he released. At least not without paying some money in the next 7-14 days before they're deleted.
Geohot doesn't know what he's talking about and I'm kinda ashamed to see this lazy thinking leak onto HN. There was an article a couple weeks back on AMD open sourcing drivers in the Linux kernel tree that you should look into.
Firmware crashes => days long "open source it and I'll fix it. no? why does AMD hate its customers?"
I got an appointment and have exactly one minute till I have to leave, apologies for brevity: they can't open source the full driver because then they'd have to release HDMI spec stuff that the consortium says they can't. (I don't support any of that, my only intent is to communicate George isn't really locked in here when he starts casting aspersions or claiming AMD doesn't care)
And AMD has ROCm. pytorch is standard and pytorch has ROCm support. And the Google TPU v5 also has pytorch support.
We do have a couple of H100's, but I'd love to replace them with AMD's