20 years ago in the programming languages lesson at the university we started learning about OOP using Java and Ada as examples. When the professor started describing Java, a fellow student interrupted him to inform the class that "Java has garbage collection" (and boast of his special knowledge).
After that incident his nickname was "the garbage collector"!
The best part is that it's faster than manual management. People will tell you they need do to malloc and free manually for performance, but when you actually run the numbers GC wins for a majority of use cases.
Tracing garbage collectors don’t generally win against reference counting connectors, especially when those reference counts are automatically elided via ARC (eg swift and objective C) or because they’re rarely used by means of composition (c++ and rust). Additionally, different kinds of application strategies are better depending on the use case (eg a pool allocator that you bulk drop at the end of some computation).
What papers are you referencing showing tracing GCs outperforming things? If it’s just the website, I think it’s an artifact of a micro benchmark rather than something that holds true for non trivial programs.
I've always heard of the "Swift elides reference counts" statements but I've never seen it substantiated. I don't claim to be a Swift GC expert by any means, but the impression I get from the two Swift GC papers I've read [1, 2] is that Swift has a very simple implementation of RC. The RC optimization document (albeit the document is incomplete) [3] also doesn't give me the impression that Swift is doing much eliding of reference counts (I'm sure it is doing it for simple cases).
Do you have any links which might explain what kind of eliding Swift is doing?
EDIT: The major RC optimizations I have seen which elide references are deferral and coalescing and I'm fairly certain that Swift is doing neither.
That is correct. Like Objective-C the compiler statically removes ARC when it can prove it doesn’t need it (or when you annotate it because where you’re getting it from is manually managing and giving you ownership of the reference). So within a function or cross function with LTO (although it looks like swift doesn’t yet have LTO [1] so I’m not sure about cross-module optimizations).
> When receiving a return result from such a function or method, ARC releases the value at the end of the full-expression it is contained within, subject to the usual optimizations for local values.
Quote from [2] (section 6 has the full details about optimizations). I believe [3] might be the compiler pass.
There actually is some elision that happens at runtime if you install an autoreleasepool if I recall correctly.
I did actually work at Apple so that’s where my recollection comes from although it’s been 8 years since then and I didn’t work on the compiler side of things so my memory could be faulty.
> So within a function or cross function with LTO (although it looks like swift doesn’t yet have LTO [1] so I’m not sure about cross-module optimizations).
I'm not sure Swift supports "static library modules" so in practice any linked object consists of a single module.
> There actually is some elision that happens at runtime if you install an autoreleasepool if I recall correctly.
There is a trick in the ObjC ABI that elides autorelease-returns, but it's deterministic after compilation time so I wouldn't call it a runtime optimization.
Correct. But it’ll do it within functions in the same module.
> but it's deterministic after compilation time so I wouldn't call it a runtime optimization.
What do you mean? My understanding is that autoreleasepool is 100% at runtime. The compiler is not involved afaik except to know to register autorelease with the currently installed pool.
> What do you mean? My understanding is that autoreleasepool is 100% at runtime. The compiler is not involved afaik except to know to register autorelease with the currently installed pool.
The compiler emits calls that always put something in the autorelease pool or always don't; there's no smart decisions at runtime that skips it or make the ordering of releases nondeterministic. A garbage collector runs whenever it feels like and so the ordering of finalizations changes.
Oh sure. You have to explicitly indicate which resources should go to the autoreleasepool. It’s not as magic as ARC. It’s also generally fallen out of favor at Apple as far as I could tell.
The conventional wisdom is that evacuating (copying) GC's win over malloc/free since 1) the GC touches only the live data and not the garbage, and 2) it compacts the active memory periodically, which improves its cache and (when relevant) paging hit rates.
Obviously though, this will be situation dependent.
Then why do every performant managed language opts for tracing GCs when they can?
RC is used in lower level languages because it doesn’t require runtime support, and can be implemented as a library.
As I wrote in another comment, even with elisions, you are still trading off constant writes on the working thread for parallel work, and you even have to pay for synchronization in parallel contexts.
Because tracing GCs can solve referential loops which RC can’t. So at the language level where you have to handle all sorts of programs written by programmers of varying quality (+ mistakes) a tracing GC gives better predictable memory usage performance across a broader range of programs.
Seriously. A single threaded reference counter is super cheap. Cross thread reference counts shouldn’t be used and I think are an anti pattern - it’s better to have the owning thread be responsible for maintaining the reference count and passing a borrow via IPC that the borrower has to hand back. There is also hybrid RC where you Arc across threads but use RC within the thread. This gives you the best of both worlds with minimal cost. Which model you prefer is probably a matter of taste.
CPUs are stupid fast at incrementing and decrementing a counter. Additionally most allocations should be done on the stack with a small amount done on the heap that is larger / needs to outlive the current scope. I’ve written all sorts of performance-critical programs (including games) and never once has shared_ptr in C++ (which is atomic) popped up in the profiler because the vast majority of allocations are on stack, value composition, or unique_ptr (ie no GC of any kind needed).
The fastest kind of GC is one where you don’t even need any (ie Box / unique_ptr). The second fastest is an integrated increment that’s likely in your CPU cache. I don’t think anyone can claim that pointer chasing is “fast” and certainly not faster than ARC. Again assuming you’re not being uncareful and throwing ARC around everywhere when it’s not needed in the first place. Value composition is much more powerful and leave RC / Arc when you have a more complicated object graph with shared ownership (and even then try to give ownership to the root uniquely or through RC and hand out references only to children and RC to peers).
Even when I had this in the hot path 10 years ago and was owning hundreds of objects in a particle filter, handing out ownership copies and creating new ones ended up taking ~5% (ie making it a contiguous vector without any shared_ptr). It can be expensive but in those case you probably shouldn’t be using shared_ptr.
Oh, and the cost of incrementing an integer by itself (non atomically) is stupid fast. Like you can do a billion of them per second. The CPU doesn’t actually write that immediately to RAM and you’re not putting a huge amount of extra cache pressure vs all the other things your program is doing normally.
If you write to cache, then depending on architecture that change has to be made visible to every other thread as well. Reading is not subject to such a constraint.
MESI cache coherency (and its derivatives) [1] means that you can have exclusive writes to cache if and only if no other core tries to access that data. I would think most if not all microarchitectures have moved to MESI (or equivalent) cache coherency protocols as they avoid unnecessary writes to memory.
Only for atomics. Single-threaded RC counts are fine and WebKit uses hybrid RC where you use an atomic shared ptr to share across threads and then you downgrade it to a non atomic version in-thread (and can hand out another atomic copy at any time).
Atomics are rarely needed as you should really try to avoid sharing ownership across threads and instead change your design to avoid that if you can.
>> If you write to cache, then depending on architecture that change has to be made visible to every other thread as well
> Only for atomics
I don't think cache coherency is aware of threads; they live at a level above it (IANAExpert though)
> where you use an atomic shared ptr to share across threads and then you downgrade it to a non atomic version in-thread (and can hand out another atomic copy at any time).
your idea of GC is very different from others', you are happy to do a ton of manual stuff. GC is generally about not doing a ton of manual stuff.
ARM also has an option for weaker consistency atomics which help with speed (and ARC does take advantage of this but not something Apple sped up specifically extra afaik in silicon)
Proof about throughput? Stating something as fact isn’t quite as strong as showing axes. Also performance has multiple axis for measuring: throughput, latency, memory usage, etc etc. Java eating all your memory and spending a non-trivial amount of overall CPU on a tracing collector which can negatively impact latency isn’t necessarily what people want efficiency wise where smearing that over total execution time tends to give better results.
The main advantage from a tracing GC is that you have an easier programming model (which isn’t a trivial trade off) but the downside is that you don’t have deterministic destruction which can make certain programming paradigms difficult (yes Java introduced alternate mechanisms to get you there if you care but it feels like a 100% bolt on rather than something integrated more neatly and elegantly and most developers don’t actually care).
I will say that you seem to be stating "smearing that over total execution time tends to give better results" without proof as well. I certainly don't think using atomic operations everywhere and converting every pointer read operation into a write operation is efficient. Yes RC has smaller minheap sizes than tracing GC, but garbage collection is fundamentally a time-space tradeoff and RC is highly optimized for memory efficiency. Also most naive RC implementations don't copy objects so you get heap fragmentation.
Comparing naive RC to tracing GC is non-trivial. In order to have a fair comparison, you'd have to implement naive RC and tracing GC in the same system and then compare them across a set of benchmarks. I personally have never come across a great performance study which compared naive RC to tracing GC.
As I’ve stated repeatedly, RC is super cheap. Seriously. You can do about several billion of them per second. Your malloc/free call is going to be more expensive.
Sure. Atomic counters are relatively expensive. But I’m good designs there’s very few of them. And they easily show up in hotspots if they’re a problem and you fix your object model. The problem with tracing GC is that you have no way to fix it. Most languages that use RC actually avoid most memory allocations/frees by leveraging value composition instead of referential ownership.
I did actually provide proof by the way. Apple’s phones use half the RAM as Android and are at least as equally fast even if you discount better HW. Similarly, any big hyperscaler is unlikely to be using Java for their core performance-critical infrastructure. To me those are pretty clear performance advantages.
> I certainly don't think using atomic operations everywhere and converting every pointer read operation into a write operation is efficient.
I’m unaware of anyone using RC properly is doing this. You only do this when you need to share ownership but you should be doing this exceedingly sparingly. Unique ownership and referential sharing is by far the most common. If you’re passing RC into a function that doesn’t retain ownership beyond its call scope you’re not using RC properly.
> Also most naive RC implementations don't copy objects so you get heap fragmentation.
That’s only kind of true. Good allocators seem to mitigate this problem quite effectively (glibc’s is notably not good at this as compared with the mimalloc and new tcmalloc). But sure, that is kind of a problem. It can be mitigated though by optimizing your allocation patterns once you know that’s the problem (noticing it is by far the hardest bit). And because it’s more manual you can customize your allocator.
Look. I’m not disagreeing about the developer benefits being significant. I’m just saying that good memory management (made fairly easy in Rust) is always going to outperform tracing GC the same way optimized assembly will outperform the compiler. It’s possible that a tracing GC can provide better performance out the gate with minimal optimization vs needing to spend more time optimizing your allocation strategies if you make design mistakes. But remember. A tracing garbage collector still needs to do atomic reads of data structures which potentially requires cross cpu shoot downs. And you still generally need to stop the world (I think that may not be true in some advanced Java Gc algorithms but that’s the exception rather than the rule and you trade off even lower throughput). And I can’t belabor this point enough - languages with RC use it rarely as shared ownership is rarely needed and you can typically minimize where you use it.
Let me give you a real world example. When I was working on the indoor positioning in iOS, I ported our original Java codebase verbatim where I used shared_ptr for almost every Java allocation across the board where I might even potentially be sharing ownership as I wanted to start with safety and optimize later. Not only was the initial version faster than the equivalent Java code (not surprising since c++ will outperform due to no auto boxing, at least at the time), when I got rid of shared_ptr in the particle filter which is a core hot path, it only showed a 5-10% improvement in perf (using Accelerate for the core linear algebra code was way more impactful). The vast majority of it actually came from the fact that all the particles were now living continuously within the vector rather than the overhead of the atomic counting. Just saying. People really overestimate the cost of RC because GC is rarely needed in the first place / when it is ownership shouldn’t be being modified in your hot path. When I worked on Oculus on Link, we used shared_ptr liberally in places because, again, ownership is actually rarely shared - most allocations are unique_ptr.
Edit: note that I’m explicitly distinguish RC (single threaded) from ARC (atomic multi thread RC). Confusingly ARC stands for automatic RC in Apple land although it’s atomic there too. Automatic RC is trickier but again, as Swift and ObjC demonstrate, it’s generally good enough without any serious performance implications (throughput or otherwise).
> As I’ve stated repeatedly, RC is super cheap. Seriously. You can do about several billion of them per second. Your malloc/free call is going to be more expensive.
But that's the whole point. Tracing GC doesn't do malloc/free, and that's where the performance advantages come from. Instead of a complex allocator that has to do lots of work on every free() as well, you get a bump-pointer allocator and move some complexity over to the tracing thread.
And this is especially true if you tend to have large linked data structures.
With all due respect, you are talking about everything, but never had any objective comparison where the RC-tracing GC were the culprit.
> RC is super cheap. Seriously. You can do about several billion of them per second. Your malloc/free call is going to be more expensive.
Yes, and it is completely irrelevant. Good tracing GCs can use a thread local bump allocator, and it can even defragment it automatically later.
> Sure. Atomic counters are relatively expensive. But I’m good designs there’s very few of them. And they easily show up in hotspot
That’s just false, unless you are using something like cachgrind or so — any sane compiler will inline the 2-3 instructions of counter increment/decrements. It is the stereotypical “death by thousands cuts”, never showing up in profiling.
> Most languages that use RC actually avoid most memory allocations/frees by leveraging value composition instead of referential ownership.
I’m sorry but this has absolutely nothing to do with the topic here, there are plenty of tracing GCd languages that can do that as well, like D, Go, C#, Nim just from the top of my head. That’s just a completely different axis.
> I did actually provide proof by the way. Apple’s phones use half the RAM as Android and are at least as equally fast even if you discount better HW
That only proves that RCs use less memory, which as has been pointed out in the thread many times is one of its few positives — it does make sense to use in a mobile phone, but then you finish off with “ big hyperscaler is unlikely to be using Java for their core performance-critical infrastructure” which is just bad logic, and straight up false. Half of the internet literally runs on Java, with heap sizes going up to the terabytes range. Alibaba, many part of Google, Apple’s literally every backend system, Twitter, whole cloud providers(!) all run on the JVM.
> You only do this when you need to share ownership
That’s like.. the point of garbage collection algorithms? Otherwise you why don’t you just randomly increment a counter here and there?!
> Not only was the initial version faster than the equivalent Java code (not surprising since c++ will outperform due to no auto boxing, at least at the time
You literally give the reason why it was faster — you are comparing a sequentially laid out data structure to one of pointers. The effect size of that will trump any concern about which GC algorithm is used, so your point doesn’t apply here at all.
> RC is super cheap. Seriously. You can do about several billion of them per second.
Right. You bump allocate faster, however. RC is an additional operation to the allocation request. Given naive RC can't move objects, it necessarily needs a free-list allocator. Free-list allocators can allocate pretty fast (for the most common object sizes), but can't reach the speeds of bump allocators. Furthermore, bump allocators have better locality of reference than free-list allocators.
Also I've never questioned you can't do non-atomic increments/decrements efficiently. They are exceptionally fast. However you are still converting every pointer read operation into a pointer write operation. The rule of thumb is that there are generally 10x more pointer read operations in a program than pointer write operations. This is why read barriers are also considered more costly than write barriers.
> I did actually provide proof by the way. Apple’s phones use half the RAM as Android and are at least as equally fast even if you discount better HW.
I don't really agree with this statement. There are way too many unknown/unaccounted variables. It could be better hardware, maybe Android has a terrible architecture, and could just be as you said that Swift RC is genuinely better than ART GC or it can be whatever. Point is that it's not a scientific comparison. We don't know if the benefits in iOS come from RC. And we wont be able to know unless we have two systems where all parameters are the exact same _except_ one is using RC and another is using tracing GC, with both systems ran on the same hardware and on the same set of benchmarks.
> That’s only kind of true. Good allocators seem to mitigate this problem quite effectively
Yes. Note that mimalloc literally has a concept of "deferred frees" effectively emulating GC as otherwise freeing an object can result in an unbounded recursive free call (for example, dropping a large linked list).
> I’m just saying that good memory management (made fairly easy in Rust) is always going to outperform tracing GC the same way optimized assembly will outperform the compiler.
Sure. The perfect memory manager is omniscient and knows exactly when an object is not required and will free it. But unfortunately we don't have perfect memory managers yet. So yes I agree in principle, but we have to be pragmatic.
> And I can’t belabor this point enough - languages with RC use it rarely as shared ownership is rarely needed and you can typically minimize where you use it.
I feel like that entirely depends on the problem domain? I don't know where you're getting the "shared ownership is rarely needed" from? Maybe this is true for your problem domain but may not be for others. And if it's true for yours, then great! You can use RC and optimize your programs that way.
> A tracing garbage collector still needs to do atomic reads of data structures which potentially requires cross cpu shoot downs.
Sure. GC metadata needs to be accessed and updated atomically. 100% agree with you. The order of magnitude of those operations is likely much less than what you would get with naive RC though.
> as Swift and ObjC demonstrate, it’s generally good enough without any serious performance implications (throughput or otherwise).
In one of my comments about Swift in this thread, the two papers I linked see up to 80% of the benchmark execution time being dominated by ARC operations! I will note that the papers are around 6-7 years old so the situation might have drastically changed since then, but I haven't personally found many new/contemporary evaluations of Swift RC.
Yes and no. LXR is a highly optimized deferred and coalesced reference counting (amongst many other optimizations) GC and it looks nothing like the RC implementations you see in other languages like Python, Nim, Swift, etc. So yes, reference counting can be performant, _but_ only if you are using a deferred and coalesced implementation as otherwise the simple implementation requires atomic operations for every pointer read/write operation (though compilers could optimize it to use non-atomic operations when it's sure semantics are preserved).
EDIT: The post you're responding to is referring to the simple standard implementation of RC.
I'm aware. Rust isn't a conventional "managed language" which is why I consciously omitted it from the above list. In an increasingly multi-threaded world, you just have to use the synchronized version of RC (aka Arc<T> in Rust).
Compared to what, though? And is that still the case if all OS components use whatever it is, as opposed to a few applications? Memory efficiency is crucial for overall system performance, and ARC is highly memory-efficient compared to every production GC I’m aware of.
Peak memory is not the only reason ARC is memory-efficient though, the main reason is that it has better memory reading behavior because it doesn't have to scan for pointers. Server GCs assume all memory containing pointers is cheap to access, which is not true if some of it is swapped out.
As for .NET and JVM implementations, but that is a consequence of underlying platform, just like C++/CLI and C++/CX, and none of them have been that big into the Ada compiler market.
After that incident his nickname was "the garbage collector"!