You are entirely mistaken. The TPU and GPU are organized very differently partic...

You are entirely mistaken. The TPU and GPU are organized very differently particularly with how the memory subsystem works.

In the big picture, TPUs are systolic arrays. They don't have threading, divergence, or similar.

GPUs in the big picture are SIMT, a hybrid of SIMD and multithreading where individual data streams in SIMD are relaxed to allow them to diverge somewhat.

Memory Wise the TPU can keep the partial products right in the array. Parameters and weights are held in large on die scratch memories, and backing those are streams coming from the HBM. The TPU acts as a single giant CISC coprocessor, and has much more predictable memory and communication patterns vs a GPU, which its design exploits for higher efficiency at inference vs GPUs.

So even if they use the word "tensor" and both have HBM based memory systems, how those are actually architected is very different.