> EVCache EVCache is a disaster. The code base has no concept of a threading mod...

jolynch · on Nov 14, 2024

(I work at Netflix on these Datastores)

EVCache definitely has some sharp edges and can be hard to use, which is one of the reasons we are putting it behind these gRPC abstractions like this Counter one or e.g. KeyValue [1] which offer CompletableFuture APIs with clean async and blocking modalities. We are also starting to add proper async APIs to EVCache itself e.g. getAsync [2] which the abstractions are using under-the-hood.

At the same time, EVCache is the cheapest (by about 10x in our experiments) caching solution with global replication [3] and cache warming [4] we are aware of. Every time we've tried alternatives like Redis or managed services they either fail to scale (e.g. cannot leverage flash storage effectively [5]) or cost waaay too much at our scale.

I absolutely agree though EVCache is probably the wrong choice for most folks - most folks aren't doing 100 million operations / second with 4-region full-active replication and applications that expect p50 client-side latency <500us. Similar I think to how most folks should probably start with PostgreSQL and not Cassandra.

[1] https://netflixtechblog.com/introducing-netflixs-key-value-d...

[2] https://github.com/Netflix/EVCache/blob/11b47ecb4e15234ca99c...

[3] https://www.infoq.com/articles/netflix-global-cache/

[4] https://netflixtechblog.medium.com/cache-warming-leveraging-...

[5] https://netflixtechblog.com/evolution-of-application-data-ca...

dormando · on Nov 14, 2024

Throwing out a clarification: EVcache is effectively a complex memcached client + an internal ecosystem at Netflix. You can get much of its benefits with other systems (such as the memcached internal proxy: https://docs.memcached.org/features/proxy/).

For plugging into other apps they may only need a small slice of EVCache; just the fetch from local-then-far, copy sets to multiple zones, etc. A greenfield client with the same backing store could be trivial to do.

That all said I wouldn't advise people copy their method of expanding cache clusters: it's possible to add or remove one instance at a time without rebuilding and re-warming the whole thing.

lksajdl3lfkjasl · on Nov 14, 2024

Curious to how your getting <500us latencies. Connection pooling, GRPC?

jolynch · on Nov 14, 2024

Every zone has a copy, and clients always read their local zone copy (via pooled memcached connections) first and fallback only once to another zone on miss. Key is staying in zone and memcached protocol plus super fast server latencies. It's been a little while since we measured, but memcached has a time to first byte of around 10us and then scales sublinearly with payload size [1]. Single zone latency is variable but generally between 150 and 250us roundtrip, cross AZ is terrible at up to a millisecond [2].

So you put 200us network with 30us response time and get about 250us average latency. Of course the P99 tail is closer to a millisecond and you have to do things like hedges to fight things like the hard coded eternity 200ms TCP packet retry timer ... But that's a whole other can of worms to talk about.

[1] https://github.com/Netflix-Skunkworks/service-capacity-model...

[2] https://jolynch.github.io/pdf/wlllb-apachecon-2022.pdf

jedberg · on Nov 13, 2024

I'm surprised it's still there! It was built over a decade ago when I was still there. At the time there were no other good solutions.

But Momento exists now. It solves every problem EVCache was supposed to solve.

There are other options too. They should retire it by now.

tmikaeld · on Nov 14, 2024

Momento? This? https://www.gomomento.com/

Seems to be cloud hosted only.

Alupis · on Nov 13, 2024

Can you elaborate?

From the looks of it, each module has plenty of tests - and the codebase is written in a spring/boot style, making it fairly intuitive to navigate.