On modern AWS instance types so much is offloaded to dedicated hardware that the only shared (noisy) components between VMs is memory bandwidth and higher levels of CPU cache (and I think graviton doesn't even share CPU cache now)
I would suspect your performance difference is mostly likely showing that on metal you are sharing the same software wider so not polluting caches as much as a vm neighbour running unrelated software.
The virtual instance is the exact same size as the metal one, which covers an entire physical machine - I guess this is pure overhead rather than noisy neighbors.
That's a surprisingly large overhead. I've not measured that large an impact on AMD, particularly for compute heavy.
Did you profile at all? And have you observed if it's not compute-bound? If it's memory or IO bound it can be due to other virtualization overheads, such as memory encryption.
We've been running some compute heavy workloads on AWS, with some running on metal instances, and some running on virtualized instances of equal size.
Both were intel 192 core machines.
Virtualized instances tended to perform 20-25% worse in terms of CPU throughput, which is quite significant, and more than I'd have assumed.
Where does the performance go? Is this an AWS thing, does the performance get lost in the software stack, or is it a CPU-level issue?
I haven't tried with other vendors tbh, but would it be possible to mitigate this by switching to another architecture/vendor like AMD or Graviton?