> Our physical hosts have hundreds of services exporting metrics. And many of those exported metrics are from untrusted sources. So we can both rewrite labels and decrease the scrape endpoint discoverability problem by aggregating them in one place.
OK, but Prometheus can do all of this just fine?
> Because it works incredibly well, it's easy to operate, and handles multi tenancy for us.
Again, Prometheus itself ticks all of these boxes, too, if you're not trying to force it to be something it's not.
We're not forcing Prometheus to be anything, since we're not using it. What Prometheus wants to be is not really a relevant constraint in our design space. A topologically simple, scalable, multi-tenant cluster that presents as just a giant bucket of metrics to our users is what we wanted, and we got it.
There's an interesting discussion to be had about how our infrastructure works; for example, in the abstract, I'd prefer a "pure" pull-based design too. But things appear and disappear on our network a lot, and remote write simplifies a lot of configuration for us, so I don't think it's going anywhere.
I think you're reading a critique of Prometheus that isn't really present in what we're writing. Prometheus is great! Everyone should use it! Our needs are weird, since we're handling metrics as a feature of a PAAS that we're building.
> I think you're reading a critique of Prometheus that isn't really present in what we're writing.
I'm observing that you've used pull-based, horizontally-scaled tools to build a push-based, vertically-scaled telemetry infrastructure. It can be made to work, sure, but the solution is an impedance mismatch to the problem.
I agree with you here. Using Prometheus, federated Prometheus, and Thanos on top of it for good measure, would probably get you better results without using a hodge-podge of non-Prometheus-compatible tools.
So, just so you understand where our heads are at: we want our users to light their apps up with lots of Prometheus metrics. Then we want them to be able to pop up a free-mode Grafana Cloud site, aim it at our API, add a dashboard, start typing a metric name and have it autocomplete to all possible metrics.
That pretty much works now?
I see the ideological purity case you two are making for "true Prometheus", but it is not at all clear to me how doing a purer version of Prometheus would make any of our users happier.
Well, with the requisite glue code that would inform each user's Prometheus instance how to scrape the service instances -- yes, more or less.
> is not at all clear to me how doing a purer version of Prometheus would make any of our users happier.
If the only things you care about when you build systems are "works" and "direct impact on customers" then there's not really a point to this conversation. The things I'm speaking about, the architectural soundness of a distributed system, are largely orthogonal to those metrics, at least to the first derivative.
Oh, sorry, I misunderstood your meaning when you wrote "That pretty much works now?" — I thought it was a question as to whether a more traditional Prom architecture could do it, but I see now you're just saying you already have this set up.
Right. But also: I'm not trying to be dismissive. We both know that we're looking at this through different lenses. I'm genuinely curious how your lens could inform mine; like, is there something I'm missing? Where, by deploying a much more conventional Prometheus architecture, I could somehow make our users happier? I don't see it, but I'm a dummy; if there's something for me to learn, I'm happy to learn it.
I'm pretty confident that the end result would be simpler in an architectural sense (i.e. fewer components), it would be easier to understand and maintain, and it would behave both more predictably and more reliably.
But these are subjective claims! Not everyone thinks the same way!
OK, but Prometheus can do all of this just fine?
> Because it works incredibly well, it's easy to operate, and handles multi tenancy for us.
Again, Prometheus itself ticks all of these boxes, too, if you're not trying to force it to be something it's not.