EVCache is a disaster. The code base has no concept of a threading model. The code is almost completely untested* too. I was on call at least 2 time when EVcache blew up on us. I tried root causing it and the code is a rats nest. Avoid!
EVCache definitely has some sharp edges and can be hard to use, which is one of the reasons we are putting it behind these gRPC abstractions like this Counter one or e.g. KeyValue [1] which offer CompletableFuture APIs with clean async and blocking modalities. We are also starting to add proper async APIs to EVCache itself e.g. getAsync [2] which the abstractions are using under-the-hood.
At the same time, EVCache is the cheapest (by about 10x in our experiments) caching solution with global replication [3] and cache warming [4] we are aware of. Every time we've tried alternatives like Redis or managed services they either fail to scale (e.g. cannot leverage flash storage effectively [5]) or cost waaay too much at our scale.
I absolutely agree though EVCache is probably the wrong choice for most folks - most folks aren't doing 100 million operations / second with 4-region full-active replication and applications that expect p50 client-side latency <500us. Similar I think to how most folks should probably start with PostgreSQL and not Cassandra.
Throwing out a clarification: EVcache is effectively a complex memcached client + an internal ecosystem at Netflix. You can get much of its benefits with other systems (such as the memcached internal proxy: https://docs.memcached.org/features/proxy/).
For plugging into other apps they may only need a small slice of EVCache; just the fetch from local-then-far, copy sets to multiple zones, etc. A greenfield client with the same backing store could be trivial to do.
That all said I wouldn't advise people copy their method of expanding cache clusters: it's possible to add or remove one instance at a time without rebuilding and re-warming the whole thing.
Every zone has a copy, and clients always read their local zone copy (via pooled memcached connections) first and fallback only once to another zone on miss. Key is staying in zone and memcached protocol plus super fast server latencies. It's been a little while since we measured, but memcached has a time to first byte of around 10us and then scales sublinearly with payload size [1]. Single zone latency is variable but generally between 150 and 250us roundtrip, cross AZ is terrible at up to a millisecond [2].
So you put 200us network with 30us response time and get about 250us average latency. Of course the P99 tail is closer to a millisecond and you have to do things like hedges to fight things like the hard coded eternity 200ms TCP packet retry timer ... But that's a whole other can of worms to talk about.
EVCache is a disaster. The code base has no concept of a threading model. The code is almost completely untested* too. I was on call at least 2 time when EVcache blew up on us. I tried root causing it and the code is a rats nest. Avoid!
* https://github.com/Netflix/EVCache