> Short answer: Prometheus + Grafana + Alertmanager. Or, a higher-level recommen...

rudasn · on July 31, 2023

So about 3 years ago we had a bunch of on prem servers shutting down around March/April. We had even more servers that weren't shutting down so we had to "move fast" before they all had issues.

I must have spent about a week trying to learn just enough about prometheus and grafana (I had used grafana before with influx but for a different purpose) so that we could monitor temperature, memory, cpu, and disk (the bare minimum).

The goal was to have a single dashboard showing these critical metrics for all servers (< 100), and be able to receive email or sms alerts when things turned red.

No luck. After a week I had nothing to show for.

So I turned to Netdata. A one liner on each server and we had super sexy and fast dashboard for each server. No birds eye view, but fine. I then spent maybe 3-4 days trying to figure out how to get alerting to work (just email, but fine) and get temperature readings (or something like that).

No luck. By the end of week 2 I still had nothing, but a bunch of servers shutting down during peak hours.

Week 3 I said fuck it I'll do the stupidest thing and write my own stack. A bunch of shell scripts, deployed via ansible, capturing any metric I could think of, managed by systemd, posting to a $5/month server running a single nodejs service that would do in memory (only) averages, medians etc, and trigger alerts (email, sms, Slack maybe soon) when things get yellow or red.

By week 4 we had monitoring for all servers and for any metric we really needed.

Super cheap, super stable and absolutely no maintenance required. Sure, we probably can't monitor hundreds of servers or thousands of metrics, but we don't need to.

I really wanted to use something else, but I just couldn't :(

andrewm4894 · on Aug 1, 2023

> So I turned to Netdata. A one liner on each server and we had super sexy and fast dashboard for each server. No birds eye view, but fine. I then spent maybe 3-4 days trying to figure out how to get alerting to work (just email, but fine) and get temperature readings (or something like that).

I work in Netdata on ML. Just wanted to mention that as of last release a parent node will show all children in the agent dashboard so if doing again as of today a parent netdata might have got you the birds eye view as a starting point https://github.com/netdata/netdata/releases/tag/v1.41.0#v141...

(of course we also have Netdata Cloud which would have probably worked too but maybe was not as built out 3 years ago as is now - but don't want to go into sales mode and get blasted :) )

rudasn · on Aug 1, 2023

Hey! I subscribe to your github releases and was reading about all that the other day (the parent/child node stuff).

When/If I have the time I'll dig into Netdata some more as I like your approach. :)

I'm not a devops/sre/systems guy, I just do it because I have to, so it's a bit difficult for me to find the time to experiment with these tools.

andrewm4894 · on Aug 1, 2023

Cool! - we always looking for feedback, feel free to hop into our discord, forum, or GH discussions (links here: https://www.netdata.cloud/community/) to leave any feedback or ask any questions if you run into any issues.

(cheers for the mention here too - always nice to try get some feedback and discussion going on HN as its so candid :0 )

mike_hearn · on Aug 1, 2023

So .... why were the servers shutting down, and what metric did your own system capture that the others didn't which let you determine that?

rudasn · on Aug 1, 2023

Well, at first I was able to gather and correlate enough cpu, temperature, entrypoint data for apparently problematic servers.

The servers were shutting down due to high temperatures caused by persistent high cpu usage.

Knowing that, I installed datadog with APM on just a couple of the servers (because $$) which led me to postgres issues (indexing), weasy pdf generation issues (a python lib), and some bad django code (queryset to list before pagination).

droopyEyelids · on July 31, 2023

If you have a one-off server running nodejs, you've definitely got maintenance

rudasn · on July 31, 2023

Why's that?

I think the only time I sshd to that server was last week when I added usb device monitoring and had to docker pull & & docker up -d.

Other than that... Can't remember dealing with the "monitoring stack".

ownagefool · on July 31, 2023

Alternative view point.

Observability is hella expensive. Orgs should consider TCO when making such decisions. Paying a few hundred thousands more for the skills to self run could literally chop tens of millions off vendor bills.

happymellon · on Aug 1, 2023

But then you aren't taking into account server and storage costs of self managed monitoring.

Unless it's Datadog. That's expensive.

ownagefool · on Aug 1, 2023

Not in the post, but I think there's still some pretty large savings.

Pretty much anything SaaS based is ridiculous. If you can swing self-hosted ( managed but in your account with there's potential for a discussion, but with many products, it's the actual integration work that's the real work.

Don't get me wrong, there's specific "always going to be small" where it likely makes sense.

happymellon · on Aug 1, 2023

I'm just saying that SaaS provides a lot more than just the cost of having an engineer or two.

It may not be cost effective, but if you think that hiring two people will be all you spend when you move everything on prem, you'll be in for a bit of a shock.

ownagefool · on Aug 2, 2023

I don't think we're saying different things, but I think you misunderstood my wider point because it was constructed as a hot take.

The main thing for me as a prior exec is HR / the org will control your people cost, but they tend to be significantly more flexible over the compute / vendor costs. I'm not saying you can add an non-forecasted 20% to your headline spend, folks would get upset at that, but if you decide to consolidate all your services into 5 beefy VMs as opposed to 100 smaller ones, nobody cares.

Do that with people though, and folks tend to lose their shit pretty quickly. The problem is this has several outcomes:

- You're getting people with less experience in the "been there, done that" category, which means the work takes longer. - Since they've not yet experience the pros & cons of decisions. It's likely they'll make some decisions that won't pan out. - They'll leave once they've realised they've fucked up and they now have the "been there, done that" badge, so they can take that experience to a market that values their skills. - Result is you end up hiring 2+ folks to do 1 persons job. - Since you falling foul of Brookes law, you're unable to execute, you work with vendors. - They charge astronomical figures; but since they're not a person, the politics of envy don't apply, thus the org may begrudgingly accept it. - You then need more "cheap" resources to do / maintain the integration work.

The problem being your TCO goes through the roof because you're not hiring quality.

Now going back to your point. Besides economics of scale, the SaaS provider is actually deriving pretty stellar profit margins for a wrapper of people and compute. I would argue these economics of scale quickly dissipate when you're also funding sales, marketing, legal, founders, executives & investor concerns, and further when you're now funding your own internal procurement, legal, and SMT to sign off the contracts.

That said, couple of additional points:

- I didn't mention on-prem. My early career was developing an IaaS provider ( 2007 times ). Folks spend a lot of unnecessary money doing on-prem, but it's a fairly large undertaken for a small dev team with a lack of hardware experience. Most folks should start in cloud unless they are strong on-prem already.

- I didn't mean all saas, the focus was on observability. Though anything you need a large number of seats and has an SSO tax should be scrutinised.

:)