In the "test setup", it says: "a t3a.micro" and "a t4g.micro".
To me, this implies they used a single ec2 instance of each size. However, ec2 instance p99s or so can be impacted by "noisy neighbors", especially on the burstable types which are intentionally oversubscribed.
It's still useful to know if, for example, t4gs are more prone to noisy neighbors, but with only 1 instance as a datapoint, you simply can't tell if it was bad luck or not.
I think this test would be much better with either only dedicated instance types, or by running it with a large n such that an individual unlucky/noisy-neighbor doesn't influence the results overtly.
This is the first thought I had as well. The t3/t4 class is notoriously problematic when you get to reasonable loads because of their burstable nature. Still an interesting test, but it would be better on a general instance class.
Aren't 't' instances burst instances? They need to be under constant load for a long time before their burst credits for CPU, memory, network and EBS run out, after which they fall back on their baseline performance.
> It does appear that the Arm-based instances can’t consistently maintain the same performance at high request rates.
I'm unwilling to trust that statement at face value for now given it's been tested against a 't' instance.
EDIT: Removed note about network burst credits in compute and memory optimized instances. I'm not sure if these instances have that.
The T3 instances are all burstable, but they come with the "unlimited" feature toggled on by default. The T2 instances had this feature toggled off by default. So with this instance type you never really run out of credits, but you could quickly end up paying more than with other general purpose instance types, like M series, if you burn through a lot of credits.
Thanks for pointing that out, I never noticed that setting. I just tried to launch an EC2 instance looking for this setting and here's what I found.
> A credit specification is only available for T2, T3, and T3a instances. Selecting Unlimited for the credit specification allows applications to burst beyond the baseline for as long as needed at any time.
So that would mean Unlimited is not a setting available for T4g (ARM instance) and therefore may explain inconsistent behavior in the ARM instance.
> T4g instances start in unlimited mode by default, giving users the ability to sustain high CPU performance over any desired time frame while keeping cost as low as possible. For most general purpose workloads, T4g instances in unlimited mode provide ample performance without any additional charges. If the average CPU utilization of a T4g instance is lower than the baseline over a 24-hour period, the hourly instance price automatically covers all interim spikes in usage. In the cases when a T4g instance needs to run at higher CPU utilization for a prolonged period, it can do so for a small additional charge of $0.04 per vCPU-hour. In standard mode, a T4g instance can burst until it uses up all of its earned credits. For more details on T4g credits, please see the EC2 documentation page.
That note did not show up for me when launching the EC2 instance. Probably because I was switching between ARM and AMD quite a bit which caused the info pop ups to stop refreshing and not display info for T4g and were stuck on T2 and T3.
I don't recall ever seeing pop-ups for that. I've always handled that via the "advanced details" section when launching instances manually through the web console.
I've just launched a test t3 instance and it didn't pop up anything.
My bad, not pop ups, instead there's an info symbol next to each option under "advanced details". I was being lazy with the terminology, I don't know what such info buttons are called.
How is that implemented? If a particular machine is 2x oversubscribed and everyone selects “unlimited,” I doubt AWS has special hardware that can run twice as fast by sticking a bit of money in the machine. I assume some customer workloads get migrated, which has its own performance cost.
I don't think they document how how they're doing things in the background. Based on personal experience I've been notified that EBS-backed instances have been rebooted due to hardware failure. For instances with ephemeral storage you're given some time to migrate for scheduled maintenance, for hardware failures you're tough out of luck. I've never been notified about scheduled maintenance on an EBS-backed instance, which leads me to think that they just migrate it without telling you.
They're using KVM under the hood, which supports things like live migration, interrupts to facilitate memory transfer, and so on. At home I'm doing live migration with less than 300ms of interrupts with high memory pressure. It would make sense for them to move things around, just for things like maintenance and distributing load.
Personal experience: We moved multiple PostgreSQL servers including a large one using 32 vCPUs to the equivalent ARM based instances, and the performance was about the same, but of course ARM instances are less expensive.
Given the title, I would have expected a price/perf comparison across multiple tiers of servers. Focusing on two random (but similar) low performance instances makes it hard to generalize.
Instead of a histogram, a violin plot (especially "split") might be an excellent way to compare the distributions. And that solves the bucket size problem too IMO.
Highly recommend ARM-based instances for RDS and ElastiCache in particular. That's a easy instance type switch and nearly idiot proof. Switching Kubernetes cluster worker nodes is another story (though ARM built container adoption is getting better).
I was wondering as well. I am pretty sure it is a typo and `95.8` should be `94.8`, that would fit the 0.1ms to 0.2ms diff the `99.99999` and `max` seem to have in the article.
We leverage Arm instances in Depot [0] to power native Docker image builds for Arm and I would say we see a lot of performance improvement in with machine start, requests per instance, and overall response rate. Granted, we aren't throwing the number of requests at our instances that this test is looking at. But, we are throwing multiple concurrent Docker image builds onto instances and generally speaking, they do great.
All of that to say, I think the t3/t4 instance used in this test is a bit problematic for getting a true idea of performance.
The r6g Arm vCPUs we tried in our AWS Neptune performance testing always seemed to perform worse than the equivalent-in-price r5d.4xlarge we normally use. Unfortunately I didn't have time to really dig into what it was about our design/workload that caused the different results. I wish I could have dug deeper, especially since now there's more types of vCPU available than when we ran our tests: x2g, r6i, and x2iedn.
Right now the web framework of choice in Rust tends to be Axum. Also, no data on CPU utilization which can be different when targeting ARM.
You may also want to include .NET which has really good support for ARM64.
Also 2, t4g instances use Graviton 2 which has, relatively speaking, weaker cores. To get best experience you would need to compare versus Graviton 3 (but these are more expensive, but you can deploy to them in a denser manner).
I find it more impressive that Intel/AMD vcpu (i.e. each "vcpu" is actually an SMT thread) competes with or usually beats a Graviton vcpu (which afaik is always a real single Arm core).
A lot, and that's why cloud providers push them so hard. I'd bet that AWS has more margin on those Arm instances than on x86. Much cheaper to run, a discount for the client, and everybody saves money.
Random numbers:
- an Ampere Altra Max Q80-30 (80 cores) has a 210W TDP
- for the same power draw, you get an AMD EPYC 9334 which is a 32c64t monster
At 2.5x the core count, even if the ampere ones are less powerful, there still is a huge gap. Those are not the CPUs used by AWS but this still gives us a ballpark: there is a substantial performance gain at the same power target, or a substantial power savings at the same performance target.
To me, this implies they used a single ec2 instance of each size. However, ec2 instance p99s or so can be impacted by "noisy neighbors", especially on the burstable types which are intentionally oversubscribed.
It's still useful to know if, for example, t4gs are more prone to noisy neighbors, but with only 1 instance as a datapoint, you simply can't tell if it was bad luck or not.
I think this test would be much better with either only dedicated instance types, or by running it with a large n such that an individual unlucky/noisy-neighbor doesn't influence the results overtly.