Performance Impact of Removing an Out-of-Band Garbage Collector

bluesnowmonkey · on May 28, 2018

Of course removing it reduced your CPU consumption. What did it do to p99 request latency? That was the trade-off being made by OOBGC.

Does Github use SOA? The more backend services that are involved in handling a user facing request, the more the latency of frontend requests will be dominated by the tail latency of backend services. So in distributed systems it makes much more sense to focus on those tail latencies. Who cares about average latency?

hinkley · on May 29, 2018

> Who cares about average latency?

People who are bad at statistics. So most of us programmers.

Seriously though, I don’t think many of us get around to acknowledging that microservices are distributed computing and then consider all of the limitations that go with it.

Latency is dominated by the slowest thing you can’t start earlier. So in real distributed computing you see a degree of duplicate effort in order to improve latency. It reduces throughout but increases the utility of the system.

When I took CS in college there was a proper class on distributed computing. When I got out I discovered not everybody had one of these (and I think in my program it was an elective. IIRC I chose it instead of kernel design). I had hoped that situation improved but I got tired of asking people disappointing questions about their education so I stopped asking.

srean · on May 29, 2018

> People who are bad at statistics.

Heh! sadly very true.

Many of those who do care about the p95s of an internal service often do it with a sense of chivalry that they have that 5% of the users back. The end user latency percentiles would be quite different from the service side latency, more so if its composed by a "Last_of" or a Max operator.

At times I feel tempted to write a library that offers primitives to glue microservice calls such as fan_out, retry, timeout etc but one that can reason about the percentiles of the output from the percentiles of the input.

BTW are statistical properties of latencies covered in distributed system courses ? I thought they were mostly about different kinds of guarantees, such as consensus, atleast-once, atmost-once etc etc.

hinkley · on May 29, 2018

Heck no, I got a little bit in a multivariate stats class but the rest I had to pick up in the field or a bit from books. I had a particularly illuminating moment with my ops guy trying to work out how often we could expect a drive to fail in a ten drive array and it was pretty shocking.

My knowledge is pretty thin, but the fact that I even know what questions to ask puts me in a better spot to contribute than 4 out of 5 people in the room. That shouldn’t be how things are, but that’s where we are. Every new (old) technique is a bunch of naive people crossing their fingers and hoping things go better than last time.

srean · on May 29, 2018

I thought as much. Your handle has anything to do with https://books.google.co.in/books/about/Theoretical_Statistic... ? very nice book BTW

hinkley · on May 29, 2018

Nope just random.

srean · on June 1, 2018

Spoken like a true statistician :)

tenderlove · on May 29, 2018

Hi. Sorry, I should have included p99 and p95 info in the post (I wrote the post and shipped the patches). We didn't see any change on p99 and p95 requests, which is why we decided to keep it. I will include that info next time I make a post like this, sorry.

taf2 · on May 29, 2018

There is definitely a trade off here... I noticed without OOBGC we did increase response times for the p99 and p95 requests however it was a few ms. So, for us the trade off of more CPU capacity and slightly slower in the p99 was worth the trade off. The cool thing about seeing this github post for me was to be reminded about OOBGC being a thing and that we should re-evaluate it after upgrading ruby.

twic · on May 28, 2018

I find this an interesting contrast to Go's move towards a "request-oriented collector":

https://docs.google.com/document/d/1gCsFxXamW8RRvOe5hECz98Ft...

Perhaps a key difference is this point from the Github document:

"Since the OOBGC runs the GC after the response is finished, it can cause the web worker to take longer in order to be ready to process the next incoming request. This means that clients can suffer from latency due to queuing wait times."

It seems like a little bit more intelligence in the load balancer that sits in front of the application servers would go a long way here. Could it only route requests to servers which have finished garbage collecting and are ready to serve again?

vidarh · on May 28, 2018

One option is a reversal: Make the worker request a request. Slow, unavailable or failed workers will in effect automatically back off.

tybit · on May 29, 2018

I’ve been wondering whether this approach would be viable for a while now. Are there any systems that do this? I know solo’s Gloo reverses the flow using NATS but that’s all I’ve seen use this approach.

vidarh · on May 29, 2018

I'm not sure. Mongrel2 uses ZeroMQ to talk to the backends to get it automatic handling of disconnecting backends, but I'm not sure if it's actually waiting for the backends to be ready before sending the next request.

jashmatthews · on May 28, 2018

Phusion Passenger has some intelligent routing to balance requests so you don't end up with one process idle and one process serving two requests using 2 threads.

vidarh · on May 29, 2018

Lots of systems do, but that's requiring lots of logic in the load balancer. Reversal makes this an emergent property of the system.

puzzle · on May 28, 2018

Now you have another problem: the load balancer needs to know accurately when the backend is ready. This can be prone to race conditions, too, especially for services that have very short response times, unless you manage to keep the LB/RPC protocol in lockstep with the garbage collector's behaviour. E.g. you might have to extend the LB/RPC protocol for responses in-band, to include a bit that says "those were the results for the request and, by the way, don't send me new work until I signal you again", which also assumes a permanent connection between the two ends, etc.

It is doable, but it benefits greatly if you have complete control over the whole stack. It gets harder and harder as more languages are thrown into the mix.

ASalazarMX · on May 28, 2018

Come on, we can solve that by adding a message queue manager in between. Whenever in doubt, just throw more layers at the problem.

adrianratnapala · on May 29, 2018

Yes! Especially when fixing latency problems.

zokier · on May 28, 2018

If the OOBGC pauses are anywhere near predictable, then maybe even a super stupid simple solution would work: just have the load balancer assume that backend is busy a fixed amount of time after each request. Even if you massively underestimate the busy time, it shouldn't hurt the latencies. Overestimating on the other hand leads to under-utilization, so it would make sense to set the busy time value to some small percentile of observed oobgc times.

twic · on May 29, 2018

You absolutely have to extend the protocol - or rather, at least one protocol in the stack - to communicate the status in-band. It would be a nightmare otherwise.

With HTTP/2 you could have a separate control channel carrying a text/event-stream payload, with messages indicating when the server was ready.

With HTTP/1.x, you could abuse HTTP trailers to indicate readiness - the load balancer could read the response until the end of the body, send that to the client, then wait until the server sends a trailer to say it's ready for more work. Or the load balancer could follow every real request with a request to /pleaseBlockUntilYouAreReadyForMoreWork, and only send another real request once it got a reponse to that.

With RPC protocols that aren't HTTP, and don't have a flexible presentation layer where this would fit, then you would probably have to add explicit messages to the protocol, which would be a shame.

jashmatthews · on May 28, 2018

Go's "request oriented collector" proposal is a generational GC, and Ruby has had a generational GC since 2.1. Generational GC is about minimizing GC overhead by avoiding re-marking the entire heap.

OOBC is very different. It's trading an increased overhead from collecting more often than necessary for a lower latency experienced by users.

twic · on May 29, 2018

The request-oriented collector is not just a generational GC by another name. The key point:

This write barrier design allows ROC to efficiently reclaim unpublished objects when a goroutine exits simply by resetting the current sweep pointer back to the initial sweep pointer.

If you use a disposable goroutine for a task, then when it exits, any memory it allocated but did not publish to other goroutines can be reclaimed without doing any further work. It's very much based on the idea that if you can put off collection to the end of a task, you can do it more efficiently, like OOBGC, but approached from a different angle.

I have no idea how well it actually works in practice. I've wanted something similar in Java for years, though!

jashmatthews · on May 29, 2018

This is still generational GC but with a young generation heap per coroutine. I've been working on multi-heap GC for MRI Ruby.

Compared to traditional concurrent generational GC, skipping the marking when a coroutine terminates is not really an advantage because there will be no live objects to mark and marking would immediately terminate anyway. After that, you just reset the nursery pointer and you're done.

smials · on May 28, 2018

And contrast again with Erlang's approach, where each process (actor) keeps its own heap, which can just be thrown away once the request handling process is done. Simple and clean, and requires no proofs, barriers or other juju... :)

jashmatthews · on May 28, 2018

It does have a lot of overhead in copying, though. Erlang has lower throughput than Ruby in a lot of cases.

It's somewhat mitigated by the reference counted global heap for passing large messages by reference but multi-threaded reference counting is extremely hard to do fast and AFAIK Erlang doesn't really try and do anything too complex.

derefr · on May 28, 2018

Erlang doesn’t usually need to do anything fancy with shared binary refcounts across processes, though, since in most workloads you’ve got a small number of processes sharing the binary (e.g. one transport + one request handler in HTTP), and the refcount in the shared-binaries table only needs to be synchronously updated when a process gets its first/loses its last reference.

Shared-binary refcounts are process-grained. That is: each Erlang process will allocate a single local handle to hold a reference to a given shared heap binary, and then all the slices of that shared binary in that process are pointers to that single process-arena-local handle. So you only actually have to update the refcount cross-thread when the entire binary goes out of scope, which will only happen once per request (either when all process-local references to it are found to be dead during a GC sweep on a long-lived process; or, more often, when the process’s arena is deallocated at process exit.)

TheSwordsman · on May 28, 2018

I thought this Request-Oriented Collector idea was abandoned? I seem to remember it being talked about in the Go community, where this hypothesis was tested and the results were sub-par.

zik · on May 29, 2018

Yes, and the current GC has sub millisecond pauses so it's not necessary in any case.

twic · on May 29, 2018

Ah, i stand corrected!

jonhohle · on May 28, 2018

A Least Connections strategy effectively does this without needing to know why slow servers are slow (be it GC, lemon host, networking issues, etc) or needing the server to communicate anything back to the load balancer. Hosts which are more performant will naturally get more traffic.

compsciphd · on May 29, 2018

I was thinking of this, if the backend handler would say "take me out of rotation for Xms as I'm about to do a garbage collection" and the load balancer would see this and act on it.

for http I'd imagine it can be a header. For https, I can imagine it might be harder.

ksec · on May 29, 2018

I am actually surprised GitHub still has OOBGC, which I thought was made redundant in 2014/2015 when Ruby 2.2 had incremental marking, and further improvement in GC 2.3, 2.4 and now 2.5.

In 2.6 they are introducing something similar called Sleepy GC

https://bugs.ruby-lang.org/issues/14723

There are even better memory default for 2.6 using less memory. ( Or better switch to Jemalloc )

https://bugs.ruby-lang.org/issues/14718

I cant find the thread, but there also a test showing how on Alpine Linux memory and performance were much better for Ruby Apps.

Meanwhile we are waiting for Truffle Ruby as a true drop in replacement, that should take lots of Ruby Pain point away. ( Native Image deployment, Better Memory handling and GC, Much faster JIT )

nneonneo · on May 28, 2018

Interesting work, thanks for sharing.

Nitpick:

> The blue line is CPU utilization for the day the patch went out. You can see a great drop around 15:20.

I see a dip at 18:20, not at 15:20 - if there is a dip at 15:20, maybe it would be worth highlighting it.

irishsultan · on May 28, 2018

I'd assume that it's a timezone issue. The author probably didn't check to see if the labels on the x-axis matched his own timezone and just pointed out where he knew the drop happened, instead of where it actually happened (on his graph).

chris_wot · on May 29, 2018

That's a remarkable author profile on that article.

titzer · on May 28, 2018

TLDR: they saved 400 to 1000 cores by switching off the switching off of the GC during requests.

The fact that they are running Ruby and are spending 1000 cores on GC is o_O.

stmw · on May 28, 2018

Impressive and interesting, but it is really a great example of how in larger production systems, Garbage Collection turns into "manual memory management" akin to that required to manage malloc() and free() correctly -- little performance traps, ever-changing library behaviors, tuning parameters, etc. (This is not specific to Ruby, the JVM has the same kind of "here is a whole blog post about person-weeks or person-months that went into getting the GC to behave better).

tetha · on May 28, 2018

The thing is, the more I work with network based reactive systems, the more a GC seems like an inefficient hack in an application, arising from the fact that we had nothing better at the time it was invented and grew big.

I mean a reactive application needs three things:

- a fast network connection to persistence. If you can't talk quickly to your database, latency eats your response time no matter what you do.

- Maybe some LRU caches across requests, but caches inside an application tend to become a maintenance nightmare quickly. Stales reads are fun, and stalled cache writes are too.

- And beyond that, it should be possible to handle most request data by stack allocation/deallocation with references to a cache version if you have a cache with little overhead, depending on your security requirements.

I understand the ease of use and simplicity of a GC'd language, especially given the time GC'd languages came around big time. But I've always been wondering if you really need a GC if you think properly about request and object lifetimes.

Gibbon1 · on May 28, 2018

> handle most request data by stack allocation/deallocation

I just write crappy code for microcontrollers. Over 30 years I've seen the amount of ram available for stack grow. From maybe 20-30 bytes to a couple of k. That said it appears to me that stack allocation is severely under utilized for historical reasons. The idea that stack space is a precious resource. Which doesn't fly on machines with 10's of gigabytes of memory.

Think programs that put vast quantities of ephemeral objects and short strings on the heap. Using call tree analysis would allow you to put a lot of that on the stack resulting is much better performance and bounded latency.

sorokod · on May 28, 2018

Java performs escape analysis and may allocate non-escaping objects on the stack.

archi42 · on May 29, 2018

Quoting https://www.beyondjava.net/#/blog/escape-analysis-java

> Moving objects from the heap to the stack

> Many sources report that escape analysis moves Java objects from the heap to the stack. As Aleksey Shipilёv points out in his article about scalar replacement, the JVM does not do this implementation. It's just a misconception. But it's an interesting one, and I wonder why the JVM doesn't implement it.

MaxBarraclough · on May 29, 2018

Edit: Turns out that BeyondJava article is just poorly written, and tries so hard to emphasise its subtle distinctions that it ends up being actively harmful to the reader's understanding. I'll leave my comment anyway.

From that article:

> Java doesn't store any object in the stack. It does store local variables on the stack: primitive types like int or boolean. It also store the pointers to objects on the stack. But it doesn't store the objects themselves on the stack. Anything you create with new is always created on the heap.

Is that true? For years now there have been articles from serious sources discussing JVM escape-analysis-based optimisations.

Is there something mistaken in the analysis in this DZone article?

https://dzone.com/articles/do-not-let-your-java-objects-esca...

How about this StackOverflow answer, which even goes into the detail of distinguishing escape-analysis, stack allocation of objects, and object deconstruction+scalar replacement:

https://stackoverflow.com/a/43002529/

sorokod · on May 29, 2018

Link to Shipilev's article: https://shipilev.net/jvm-anatomy-park/18-scalar-replacement

If stack allocation was really done, it would allocate the entire object storage on the stack, including the header and the fields, and reference it in the generated code. The caveat in this scheme is that once the object is escaping, we would need to copy the entire object block from the stack to the heap, because we cannot be sure current thread stays in the method and keeps this part of the stack holding the object alive. Which means we have to intercept stores to the heap, in case we ever store stack-allocated object — that is, do the GC write barrier.

mhaymo · on May 29, 2018

Interesting article, but the distinction you're making is quite subtle. Using escape analysis, Java can avoid heap allocations by removing the creation of the Object altogether, treating its fields as if they were local variables which live in registers/the stack. The practical difference between that and simply allocating the object on the stack is that no space is required for the object header, and some reference-based operations are not possible.

MaxBarraclough · on May 29, 2018

But the article states "Anything you create with new is always created on the heap."

mhaymo · on May 29, 2018

Yes, but it then explains that escape analysis allows the runtime to avoid creating the Object entirely with scalar replacement. It's written in a confusing way, which is why I felt the need to clarify.

MaxBarraclough · on May 29, 2018

Ugh, you're right, of course.

In my defence the sentence I quoted isn't so much deliberately misleading, as outright false. It essentially states that 'new' always results in a heap allocation, which just isn't true.

A profoundly unhelpful writing style, going out of its way to punish skim-readers. I'm reminded of this TV Tropes page: http://tvtropes.org/pmwiki/pmwiki.php/Main/ByNoIMeanYes

twic · on May 28, 2018

Moreover, "removing OOBGC reduced average response times by about 25%". CPU utilisation is a pretty terrible metric on its own; once you've paid for a CPU, you might as well run it at 100%!

LukaAl · on May 28, 2018

Also average response time is a bad metrics. For one, the average is a bad measure if the variance is high. So I'll probably use 95% time instead of average.

But as a PM I have a different question. Which is the impact of the response time of Github on the user experience? 25% reduction on an average of 2 sec improves greatly the user experience, on a 200ms for a service that has basically no competition is just a marginal improvement, on a 50 ms average is not noticeable.

How is it distributed across multiple operations? Improving the latency of loading a diff is probably more important than the latency of approving a PR. In one case I'm trying to work, in the second I'm done with the work.

Now, my experience with GitHub is that it is already reasonably fast unless I'm doing something stupid. So the CPU utilization is a good metrics because 10% less CPU means 10% less server I need to pay and that goes to their bottom line (I don't know their economics to decide if it is substantial or not). The 25% reduction in latency is just icing on the cake...

majewsky · on May 28, 2018

Not really. Running CPUs at 100% load means you have no breathing room for high-load situations. It also makes them hotter, which may lead to worse performance because of throttling.

puzzle · on May 28, 2018

Not 100%, but running them at least 50% is not a bad idea. In these days of containers and QoS scheduling, it's easier to make good use of slack resources. Letting memory sit unused is an even bigger crime!

Someone told me that the team at his previous job with a "large cloud provider" based in Seattle was told not to go above 30% CPU usage, at which point they'd buy more hardware. I'm sure not all teams there do things that way, but, coming from Google, 30% (95th percentile over a week) is very low.

merb · on May 28, 2018

well if you use k8s you need to have some spare capacity for updates, etc.

consider a deployment (deployment app) update it will first create a new container and than kill the old one, which means depending how much memory you request at container creation is the least minimum you would have on spare.

if you schedule java apps with at least 1 Gi heap you would at least need 1Gi spare capacity and that is just with one pod/container (it will be worse if you need to do blue/green deploys, since you need the same capacity as your live cluster)

puzzle · on May 28, 2018

If you're running Kubernetes, you probably have multiple services running (or it's not really worth the complexity). Then whatever slack you have in the cluster can be amortized over all your services, if they all share the same resources (quota). It's also a good idea not to update too many deployments all at the same time.

Even then, deployment updates don't necessarily need to surge above their replica count. You can also configure them to terminate X replicas at a time before bringing up new ones. At Google, all teams have Borg quotas, so it's not unusual to max those up by running as many replicas as possible. During updates, Borg does not allow an user to temporarily oversubscribe their quota (unless you're changing replica count and replica footprint at the same time, but that's another fun story), so it will always take down Y tasks first.

vidarh · on May 28, 2018

Those 1000 cores seems to have been equivalent to 10% of their load. The 10% number isn't nearly as shocking.

Sean1708 · on May 29, 2018

> TLDR: they saved 400 to 1000 cores by switching off the switching off of the GC during requests.

Isn't that exactly the opposite of what the article is saying?

> An OOBGC is not really a Garbage Collector, but more of a technique to use when deciding when to collect garbage in your program. Instead of allowing the GC to run normally, the GC is stopped before processing a web request, then restarted after the response has been sent to the client. Meaning that garbage collection occurs “out of band” of request and response processing.

> This graph shows the difference in core utilization before and after OOBGC removal. In other words “number of cores used yesterday” minus “number of cores used today” ... We saw a savings of between 400 and around 1000 cores depending on usage at that point in the day.

So it sounds to me that by switching the GC on during requests they saved that many cores.

k__ · on May 28, 2018

Isn't Ruby known for these kind of issues?

vidarh · on May 28, 2018

In this case it was Github's hack to work around no-longer-existing issues with Ruby's GC that was the issue.... Note that the speedup came from removing their hack in favour of relying on the default behaviour of Ruby 2.4

jashmatthews · on May 28, 2018

It was actually Ruby 2.2 from 2014 that made this redundant. 2.2 introduced incremental marking, removing the last long GC phase in MRI Ruby.

Karunamon · on May 28, 2018

Sadly yes. It’s an absolute joy to code in, but it isn’t even kind of light.

nwmcsween · on May 28, 2018

Ruby as a language no, most of the implementations yes.