TLDR: they saved 400 to 1000 cores by switching off the switching off of the GC ...

stmw · on May 28, 2018

Impressive and interesting, but it is really a great example of how in larger production systems, Garbage Collection turns into "manual memory management" akin to that required to manage malloc() and free() correctly -- little performance traps, ever-changing library behaviors, tuning parameters, etc. (This is not specific to Ruby, the JVM has the same kind of "here is a whole blog post about person-weeks or person-months that went into getting the GC to behave better).

tetha · on May 28, 2018

The thing is, the more I work with network based reactive systems, the more a GC seems like an inefficient hack in an application, arising from the fact that we had nothing better at the time it was invented and grew big.

I mean a reactive application needs three things:

- a fast network connection to persistence. If you can't talk quickly to your database, latency eats your response time no matter what you do.

- Maybe some LRU caches across requests, but caches inside an application tend to become a maintenance nightmare quickly. Stales reads are fun, and stalled cache writes are too.

- And beyond that, it should be possible to handle most request data by stack allocation/deallocation with references to a cache version if you have a cache with little overhead, depending on your security requirements.

I understand the ease of use and simplicity of a GC'd language, especially given the time GC'd languages came around big time. But I've always been wondering if you really need a GC if you think properly about request and object lifetimes.

Gibbon1 · on May 28, 2018

> handle most request data by stack allocation/deallocation

I just write crappy code for microcontrollers. Over 30 years I've seen the amount of ram available for stack grow. From maybe 20-30 bytes to a couple of k. That said it appears to me that stack allocation is severely under utilized for historical reasons. The idea that stack space is a precious resource. Which doesn't fly on machines with 10's of gigabytes of memory.

Think programs that put vast quantities of ephemeral objects and short strings on the heap. Using call tree analysis would allow you to put a lot of that on the stack resulting is much better performance and bounded latency.

sorokod · on May 28, 2018

Java performs escape analysis and may allocate non-escaping objects on the stack.

archi42 · on May 29, 2018

Quoting https://www.beyondjava.net/#/blog/escape-analysis-java

> Moving objects from the heap to the stack

> Many sources report that escape analysis moves Java objects from the heap to the stack. As Aleksey Shipilёv points out in his article about scalar replacement, the JVM does not do this implementation. It's just a misconception. But it's an interesting one, and I wonder why the JVM doesn't implement it.

MaxBarraclough · on May 29, 2018

Edit: Turns out that BeyondJava article is just poorly written, and tries so hard to emphasise its subtle distinctions that it ends up being actively harmful to the reader's understanding. I'll leave my comment anyway.

From that article:

> Java doesn't store any object in the stack. It does store local variables on the stack: primitive types like int or boolean. It also store the pointers to objects on the stack. But it doesn't store the objects themselves on the stack. Anything you create with new is always created on the heap.

Is that true? For years now there have been articles from serious sources discussing JVM escape-analysis-based optimisations.

Is there something mistaken in the analysis in this DZone article?

https://dzone.com/articles/do-not-let-your-java-objects-esca...

How about this StackOverflow answer, which even goes into the detail of distinguishing escape-analysis, stack allocation of objects, and object deconstruction+scalar replacement:

https://stackoverflow.com/a/43002529/

sorokod · on May 29, 2018

Link to Shipilev's article: https://shipilev.net/jvm-anatomy-park/18-scalar-replacement

If stack allocation was really done, it would allocate the entire object storage on the stack, including the header and the fields, and reference it in the generated code. The caveat in this scheme is that once the object is escaping, we would need to copy the entire object block from the stack to the heap, because we cannot be sure current thread stays in the method and keeps this part of the stack holding the object alive. Which means we have to intercept stores to the heap, in case we ever store stack-allocated object — that is, do the GC write barrier.

mhaymo · on May 29, 2018

Interesting article, but the distinction you're making is quite subtle. Using escape analysis, Java can avoid heap allocations by removing the creation of the Object altogether, treating its fields as if they were local variables which live in registers/the stack. The practical difference between that and simply allocating the object on the stack is that no space is required for the object header, and some reference-based operations are not possible.

MaxBarraclough · on May 29, 2018

But the article states "Anything you create with new is always created on the heap."

mhaymo · on May 29, 2018

Yes, but it then explains that escape analysis allows the runtime to avoid creating the Object entirely with scalar replacement. It's written in a confusing way, which is why I felt the need to clarify.

MaxBarraclough · on May 29, 2018

Ugh, you're right, of course.

In my defence the sentence I quoted isn't so much deliberately misleading, as outright false. It essentially states that 'new' always results in a heap allocation, which just isn't true.

A profoundly unhelpful writing style, going out of its way to punish skim-readers. I'm reminded of this TV Tropes page: http://tvtropes.org/pmwiki/pmwiki.php/Main/ByNoIMeanYes

twic · on May 28, 2018

Moreover, "removing OOBGC reduced average response times by about 25%". CPU utilisation is a pretty terrible metric on its own; once you've paid for a CPU, you might as well run it at 100%!

LukaAl · on May 28, 2018

Also average response time is a bad metrics. For one, the average is a bad measure if the variance is high. So I'll probably use 95% time instead of average.

But as a PM I have a different question. Which is the impact of the response time of Github on the user experience? 25% reduction on an average of 2 sec improves greatly the user experience, on a 200ms for a service that has basically no competition is just a marginal improvement, on a 50 ms average is not noticeable.

How is it distributed across multiple operations? Improving the latency of loading a diff is probably more important than the latency of approving a PR. In one case I'm trying to work, in the second I'm done with the work.

Now, my experience with GitHub is that it is already reasonably fast unless I'm doing something stupid. So the CPU utilization is a good metrics because 10% less CPU means 10% less server I need to pay and that goes to their bottom line (I don't know their economics to decide if it is substantial or not). The 25% reduction in latency is just icing on the cake...

majewsky · on May 28, 2018

Not really. Running CPUs at 100% load means you have no breathing room for high-load situations. It also makes them hotter, which may lead to worse performance because of throttling.

puzzle · on May 28, 2018

Not 100%, but running them at least 50% is not a bad idea. In these days of containers and QoS scheduling, it's easier to make good use of slack resources. Letting memory sit unused is an even bigger crime!

Someone told me that the team at his previous job with a "large cloud provider" based in Seattle was told not to go above 30% CPU usage, at which point they'd buy more hardware. I'm sure not all teams there do things that way, but, coming from Google, 30% (95th percentile over a week) is very low.

merb · on May 28, 2018

well if you use k8s you need to have some spare capacity for updates, etc.

consider a deployment (deployment app) update it will first create a new container and than kill the old one, which means depending how much memory you request at container creation is the least minimum you would have on spare.

if you schedule java apps with at least 1 Gi heap you would at least need 1Gi spare capacity and that is just with one pod/container (it will be worse if you need to do blue/green deploys, since you need the same capacity as your live cluster)

puzzle · on May 28, 2018

If you're running Kubernetes, you probably have multiple services running (or it's not really worth the complexity). Then whatever slack you have in the cluster can be amortized over all your services, if they all share the same resources (quota). It's also a good idea not to update too many deployments all at the same time.

Even then, deployment updates don't necessarily need to surge above their replica count. You can also configure them to terminate X replicas at a time before bringing up new ones. At Google, all teams have Borg quotas, so it's not unusual to max those up by running as many replicas as possible. During updates, Borg does not allow an user to temporarily oversubscribe their quota (unless you're changing replica count and replica footprint at the same time, but that's another fun story), so it will always take down Y tasks first.

vidarh · on May 28, 2018

Those 1000 cores seems to have been equivalent to 10% of their load. The 10% number isn't nearly as shocking.

Sean1708 · on May 29, 2018

> TLDR: they saved 400 to 1000 cores by switching off the switching off of the GC during requests.

Isn't that exactly the opposite of what the article is saying?

> An OOBGC is not really a Garbage Collector, but more of a technique to use when deciding when to collect garbage in your program. Instead of allowing the GC to run normally, the GC is stopped before processing a web request, then restarted after the response has been sent to the client. Meaning that garbage collection occurs “out of band” of request and response processing.

> This graph shows the difference in core utilization before and after OOBGC removal. In other words “number of cores used yesterday” minus “number of cores used today” ... We saw a savings of between 400 and around 1000 cores depending on usage at that point in the day.

So it sounds to me that by switching the GC on during requests they saved that many cores.

k__ · on May 28, 2018

Isn't Ruby known for these kind of issues?

vidarh · on May 28, 2018

In this case it was Github's hack to work around no-longer-existing issues with Ruby's GC that was the issue.... Note that the speedup came from removing their hack in favour of relying on the default behaviour of Ruby 2.4

jashmatthews · on May 28, 2018

It was actually Ruby 2.2 from 2014 that made this redundant. 2.2 introduced incremental marking, removing the last long GC phase in MRI Ruby.

Karunamon · on May 28, 2018

Sadly yes. It’s an absolute joy to code in, but it isn’t even kind of light.

nwmcsween · on May 28, 2018

Ruby as a language no, most of the implementations yes.