Workloads on Arm-based AWS instances

TheDong · on Oct 9, 2023

In the "test setup", it says: "a t3a.micro" and "a t4g.micro".

To me, this implies they used a single ec2 instance of each size. However, ec2 instance p99s or so can be impacted by "noisy neighbors", especially on the burstable types which are intentionally oversubscribed.

It's still useful to know if, for example, t4gs are more prone to noisy neighbors, but with only 1 instance as a datapoint, you simply can't tell if it was bad luck or not.

I think this test would be much better with either only dedicated instance types, or by running it with a large n such that an individual unlucky/noisy-neighbor doesn't influence the results overtly.

kylegalbraith · on Oct 9, 2023

This is the first thought I had as well. The t3/t4 class is notoriously problematic when you get to reasonable loads because of their burstable nature. Still an interesting test, but it would be better on a general instance class.

iknownothow · on Oct 9, 2023

Aren't 't' instances burst instances? They need to be under constant load for a long time before their burst credits for CPU, memory, network and EBS run out, after which they fall back on their baseline performance.

> It does appear that the Arm-based instances can’t consistently maintain the same performance at high request rates.

I'm unwilling to trust that statement at face value for now given it's been tested against a 't' instance.

EDIT: Removed note about network burst credits in compute and memory optimized instances. I'm not sure if these instances have that.

vegardx · on Oct 9, 2023

The T3 instances are all burstable, but they come with the "unlimited" feature toggled on by default. The T2 instances had this feature toggled off by default. So with this instance type you never really run out of credits, but you could quickly end up paying more than with other general purpose instance types, like M series, if you burn through a lot of credits.

iknownothow · on Oct 9, 2023

Thanks for pointing that out, I never noticed that setting. I just tried to launch an EC2 instance looking for this setting and here's what I found.

> A credit specification is only available for T2, T3, and T3a instances. Selecting Unlimited for the credit specification allows applications to burst beyond the baseline for as long as needed at any time.

So that would mean Unlimited is not a setting available for T4g (ARM instance) and therefore may explain inconsistent behavior in the ARM instance.

vladvasiliu · on Oct 9, 2023

It's the same for t4g.

    > T4g instances start in unlimited mode by default, giving users the ability to sustain high CPU performance over any desired time frame while keeping cost as low as possible. For most general purpose workloads, T4g instances in unlimited mode provide ample performance without any additional charges. If the average CPU utilization of a T4g instance is lower than the baseline over a 24-hour period, the hourly instance price automatically covers all interim spikes in usage. In the cases when a T4g instance needs to run at higher CPU utilization for a prolonged period, it can do so for a small additional charge of $0.04 per vCPU-hour. In standard mode, a T4g instance can burst until it uses up all of its earned credits. For more details on T4g credits, please see the EC2 documentation page.

https://aws.amazon.com/ec2/instance-types/t4/

iknownothow · on Oct 9, 2023

That note did not show up for me when launching the EC2 instance. Probably because I was switching between ARM and AMD quite a bit which caused the info pop ups to stop refreshing and not display info for T4g and were stuck on T2 and T3.

vladvasiliu · on Oct 9, 2023

I don't recall ever seeing pop-ups for that. I've always handled that via the "advanced details" section when launching instances manually through the web console.

I've just launched a test t3 instance and it didn't pop up anything.

iknownothow · on Oct 9, 2023

My bad, not pop ups, instead there's an info symbol next to each option under "advanced details". I was being lazy with the terminology, I don't know what such info buttons are called.

amluto · on Oct 9, 2023

How is that implemented? If a particular machine is 2x oversubscribed and everyone selects “unlimited,” I doubt AWS has special hardware that can run twice as fast by sticking a bit of money in the machine. I assume some customer workloads get migrated, which has its own performance cost.

electroly · on Oct 9, 2023

Yes, it's done by instance migration behind the scenes.

everfrustrated · on Oct 9, 2023

Citation Needed

vegardx · on Oct 10, 2023

I don't think they document how how they're doing things in the background. Based on personal experience I've been notified that EBS-backed instances have been rebooted due to hardware failure. For instances with ephemeral storage you're given some time to migrate for scheduled maintenance, for hardware failures you're tough out of luck. I've never been notified about scheduled maintenance on an EBS-backed instance, which leads me to think that they just migrate it without telling you.

They're using KVM under the hood, which supports things like live migration, interrupts to facilitate memory transfer, and so on. At home I'm doing live migration with less than 300ms of interrupts with high memory pressure. It would make sense for them to move things around, just for things like maintenance and distributing load.

customizable · on Oct 9, 2023

Personal experience: We moved multiple PostgreSQL servers including a large one using 32 vCPUs to the equivalent ARM based instances, and the performance was about the same, but of course ARM instances are less expensive.

LatticeAnimal · on Oct 9, 2023

Given the title, I would have expected a price/perf comparison across multiple tiers of servers. Focusing on two random (but similar) low performance instances makes it hard to generalize.

Espressosaurus · on Oct 9, 2023

A couple recommendations for your visualization:

1) More fine-grained bins to help show the shape of the distribution (are there performance cliffs?). Try using vertical lines to denote % cutoffs.

2) Given the wide range between your bins, a log scale might be a good idea instead of raw frequency.

3) Try some other method of visualization. I'm not sure a histogram is useful for what you're trying to convey, at least the way it's being used here.

As it stands, the visual information is so dominated by the 99.5% case that the plots don't help illustrate your tabular data.

wyldfire · on Oct 9, 2023

Agreed, these graphs were challenging to read.

Instead of a histogram, a violin plot (especially "split") might be an excellent way to compare the distributions. And that solves the bucket size problem too IMO.

https://seaborn.pydata.org/generated/seaborn.violinplot.html

nodesocket · on Oct 9, 2023

Highly recommend ARM-based instances for RDS and ElastiCache in particular. That's a easy instance type switch and nearly idiot proof. Switching Kubernetes cluster worker nodes is another story (though ARM built container adoption is getting better).

opentokix · on Oct 9, 2023

I think the discrepancies can be attributed to the choice of the t-style instances. They are generally over committed.

upon_drumhead · on Oct 9, 2023

I don’t understand how 99.99999 is larger than max.

soxocx · on Oct 9, 2023

I was wondering as well. I am pretty sure it is a typo and `95.8` should be `94.8`, that would fit the 0.1ms to 0.2ms diff the `99.99999` and `max` seem to have in the article.

kylegalbraith · on Oct 9, 2023

We leverage Arm instances in Depot [0] to power native Docker image builds for Arm and I would say we see a lot of performance improvement in with machine start, requests per instance, and overall response rate. Granted, we aren't throwing the number of requests at our instances that this test is looking at. But, we are throwing multiple concurrent Docker image builds onto instances and generally speaking, they do great.

All of that to say, I think the t3/t4 instance used in this test is a bit problematic for getting a true idea of performance.

[0] - https://depot.dev/

bloopernova · on Oct 9, 2023

The r6g Arm vCPUs we tried in our AWS Neptune performance testing always seemed to perform worse than the equivalent-in-price r5d.4xlarge we normally use. Unfortunately I didn't have time to really dig into what it was about our design/workload that caused the different results. I wish I could have dug deeper, especially since now there's more types of vCPU available than when we ran our tests: x2g, r6i, and x2iedn.

neonsunset · on Oct 9, 2023

Right now the web framework of choice in Rust tends to be Axum. Also, no data on CPU utilization which can be different when targeting ARM. You may also want to include .NET which has really good support for ARM64.

Also 2, t4g instances use Graviton 2 which has, relatively speaking, weaker cores. To get best experience you would need to compare versus Graviton 3 (but these are more expensive, but you can deploy to them in a denser manner).

znpy · on Oct 9, 2023

The instances gathered for setup are absolutely the worst: t3a.micro and t4g.micro.

Such instances share the vcpus, and only get burst of dedicated cpus, then they'll get throttled for performance.

The author should have picked any of the other instances. The bare minimum to make an informed decision should be the c6g.medium or the c7g.medium.

In my experience btw, the c7g family really seems to be closing the gap in single-threaded performance with x86-based instances.

xurbax · on Oct 9, 2023

I find it more impressive that Intel/AMD vcpu (i.e. each "vcpu" is actually an SMT thread) competes with or usually beats a Graviton vcpu (which afaik is always a real single Arm core).

monlockandkey · on Oct 9, 2023

I wonder how much does Arm save data centers in electricity and CPU cost. Not only cheaper to fabricate but also cheaper to run?

tuetuopay · on Oct 9, 2023

A lot, and that's why cloud providers push them so hard. I'd bet that AWS has more margin on those Arm instances than on x86. Much cheaper to run, a discount for the client, and everybody saves money.

Random numbers:

- an Ampere Altra Max Q80-30 (80 cores) has a 210W TDP

- for the same power draw, you get an AMD EPYC 9334 which is a 32c64t monster

At 2.5x the core count, even if the ampere ones are less powerful, there still is a huge gap. Those are not the CPUs used by AWS but this still gives us a ballpark: there is a substantial performance gain at the same power target, or a substantial power savings at the same performance target.

59nadir · on Oct 9, 2023

Was this tested with dedicated instances? Would there be a potential difference if it was?