What business usecase requires a single cluster with thousands of pods? Wouldn't having multiple clusters, each hosting a few namespaces, be a better architecture?
This. I may not work with AI training workflows, but I struggle to understand why they supposedly require launching a thousand pods per second to use GPUs that need to fundamentally be installed across different baremetal machines. Once the GPUs are on different machines, if there are 1k+ such machines, just start putting them on different Kubernetes clusters. Build a scheduling layer above the Kubernetes control plane to decide which Kubernetes cluster to schedule the pod onto.
The whole thing stinks of, AI investors are throwing money at AI companies, so go to GCP and tell them to solve the problem at any price so that they can keep scaling without needing to build the scheduling layer above the Kubernetes control planes.
Yep, it's just saying "you should now launch 1000's pods in a single cluster, just because we said it makes sense, and please don't look at the costs, business sense and operational issues."