> you can detect stalled metrics (per host or service), who didn't send the data on time, etc
I guess the difference here is that we leverage service discovery in Prometheus for this instead of having to externally build an authoritative list of who/what should have pushed metrics.
> <...> and wait for a response.
As opposed to waiting for $thing to push metrics to you?
I guess I'm not convinced that one architecture is obviously better? There might be some downsides to a particular implementation but generally they both work and only external constraints will dictate which you use? E.g.: if you're required to ship metrics to multiple places, pushing to graphite and datadog becomes easier.
Anything that _should_ be scraped is tagged a certain way and anything that doesn't respond to a scrape gets flagged. After a few flags, an operator is paged. When $thing is destroyed or re-provisioned, different tags lead to a different set of $things to scrape metrics from.
I guess the difference here is that we leverage service discovery in Prometheus for this instead of having to externally build an authoritative list of who/what should have pushed metrics.
> <...> and wait for a response.
As opposed to waiting for $thing to push metrics to you?
I guess I'm not convinced that one architecture is obviously better? There might be some downsides to a particular implementation but generally they both work and only external constraints will dictate which you use? E.g.: if you're required to ship metrics to multiple places, pushing to graphite and datadog becomes easier.
Anything that _should_ be scraped is tagged a certain way and anything that doesn't respond to a scrape gets flagged. After a few flags, an operator is paged. When $thing is destroyed or re-provisioned, different tags lead to a different set of $things to scrape metrics from.