Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It makes me sad that to get these scalability numbers requires some secret sauce on top of spanner, which no body else in the k8s community can benefit from. Etcd is the main bottleneck in upstream k8s and it seems like there is no real steam to build an upstream replacement for etcd/boltdb.

I did poke around a while ago to see what interfaces that etcd has calling into boltdb, but the interface doesn’t seem super clean right now, so the first step in getting off boltdb would be creating a clean interface that could be implemented by another db.



It's possible I'm talking out of my ass and totally wrong because I'm basing this on principles, not benchmarking, but I'm pretty sure the problem is more etcd itself than boltdb. Specifically, the Raft protocol requires that the cluster leader's log has to be replicated to a quorum of voting members, who need to write to disk, including a flush, and then respond to the leader, before a write is considered committed. That's floor(n/2) + 1 disk flushes and twice as many network roundtrips to write any value. When your control plane has to span multiple data centers because the electricity cost of the cluster is too large for a single building to handle, it's hard for that not to become a bottleneck. Other limitations include the 8GiB disk limit another comment mentions and etcd's hard-coded 1.5 MiB request size limit that prevents you from writing large object collections in a single bundle.

etcd is fine for what it is, but that's a system meant to be reliable and simple to implement. Those are important qualities, but it wasn't built for scale or for speed. Ironically, etcd recommends 5 as the ideal number of cluster members and 7 as a maximum based on Google's findings from running chubby, that between-member latency gets too big otherwise. With 5, that means you can't ever store more than 40GiB of data. I have no idea what a typical ratio of cluster nodes to total data is, but that only gives you about 307MiB per node for 130,000 nodes, which doesn't seem like very much.

There are other options. k3s made kine which acts as a shim intercepting the etcd API calls made by the apiserver and translating it into calls to some other dbms. Originally, this was to make a really small Kubernetes that used an embedded sqlite as its datastore, but you could do the same thing for any arbitrary backend by just changing one side of the shim.


I run several clusters a bit over 10k nodes and the etcd db size is about 30-50GiB depending on how long ago defragmentation was run.

It is kindof sad as these nodes are running around 2k IOPS to the disk and are mostly sitting idle at the hardware level, but etcd still regularly chokes.

I did look into kine in the past, but I have no idea if it is suitable for running a high performance data store.

> When your control plane has to span multiple data centers because the electricity cost of the cluster is too large for a single building to handle

The trick is you deploy your k8s clusters in multiple datacenters in the same region (think AZs in AWS term). The control plane can span multiple AZs which are in separate buildings, but close in geography. From the setups I work on the latency between datacenters in the same region is only about 500 microseconds.


> It makes me sad that to get these scalability numbers requires some secret sauce on top of spanner, which no body else in the k8s community can benefit from.

I'm not so sure. I mean, everything has tradeoffs, and what you need to do to put together the largest cluster known to man is not necessarily what you want to have to put together a mundane cluster.


For those not aware, if you create too many resources you can easily use up all of the 8GB hard coded maximum size in etcd which causes a cluster failure. With compaction and maintenance this risk is mitigated somewhat but it just takes one misbehaving operator or integration (e.g. hundreds of thousands of dex session resources created for pingdom/crawlers) to mess everything up. Backups of etcd are critical. That dex example is why I stopped it for my IDP.


This is why I’ve always thought Tekton was a strange project. It feels inevitable that if you buy into Tekton CI/CD you will hit issues with etcd scaling due to the sheer number of resources you can wind up with.


What boundaries does this 8GB etcd limit cut across? We've been using Tekton for years now but each pipeline exists in its own namespace and that namespace is deleted after each build. Presumably that kind of wholesale cleanup process keeps the DB size in check, because we've never had a problem with Etcd size...

We have multiple hundreds of resources allocated for each build and do hundreds of builds a day. The current cluster has been doing this for a couple of years now.


Yeah I mean if you’re deleting namespaces after each run then sure, that may solve it. They have a pruner now that you can enable too to set up retention periods for pipeline runs.

There’s also some issues with large Results, though I think you have to manually enable that. From their site

> CAUTION: the larger you make the size, more likely will the CRD reach its max limit enforced by the etcd server leading to bad user experience.

And then if you use Chains you’re opening up a whole other can of worms.

I contracted with a large institution that was moving all of their cicd to Tekton and they hit scaling issues with etcd pretty early in the process and had to get Red Hat to address some of them. If they couldn’t get them addressed by RH they were going to scrap the whole project.


Yeah, quite unfortunate. But maybe there is hope. Apparently k3s uses Kine which is an etcd translation layer for relational databases and there is another project called Netsy which persists into s3 https://nadrama.com/netsy. Some interesting ideas. Hopefully native postgres support gets added since its so ubiquitous and performant.


It's not hardcoded and you can increase it via flag.

There is a hard coded warning which says safety not guaranteed after 8GB. I have tried increasing this after a database has become full and it didn’t start. It’s definitely not a recovery strategy for a full etcd by itself, maybe as part of a way to eek out a little larger margin of safety.

This warning seems to be outdated. We had run etcd at much larger volumes without issues (at least without issues related to its size). Alibaba has been running 100G etcd clusters for a while now, probably others too

Thank you for the update

It's totally possible to run tens of thousands of QPS on etcd if your disks are NVMEs (or if you disable fdatasync which is not recommended). If you use kine+cockroachdb or tidb you can go even higher which is what I'm guessing is equivalent to their spanner setup.

There was a blogpost about creating an alternative to etcd for super high scale kubernetes cluster. All code was open too. It was from someone named Benjamin I think but not sure.

I’m not able to find the blogpost but maybe someone else can!


This might be what you're thinking of: https://bchess.github.io/k8s-1m/

Yes thank you!

I’ve seen some talk of replacing etcd with FoundationDB, which could yield similar improvements.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: