Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Short answer: Prometheus + Grafana + Alertmanager.

Or, a higher-level recommendation, appropriate for most SMBs: sign up for Grafana Cloud's managed prometheus+grafana (or any equivalent external managed monitoring stack), and then follow their setup instructions to install their grafana-agent monitoring agent package (which sticks together node_exporter, several other optional exporters enable-able with config stanzas (e.g. redis_exporter, postgresql_exporter, etc.), and a log multiplexer for their Loki logging service [which is to logs-based metrics as Prometheus is to regular time-series metrics.])

Why use a managed service? Because, unless your IT department is large enough to have its own softball team, the stability/fault-tolerance of the "monitoring and alerting infra" itself is going to be rather low-priority for the company compared to other things you're being asked to manage; and also will be something you rarely need to touch... until it breaks. Which it will.

You really want some other IT department whose whole job is to just make sure your monitoring and alerting stay up, doing this as their product IT focus rather than their operations IT focus.

(You also want your alerting to be running on separate infra from the thing it watches, for the same reason that your status page should be on a separate domain and infra from the system it reports the status of. Having some other company own it is an easy way to achieve this.)

> Regarding the type of alert itself, I send myself mails for the persistence/reminders + Telegram messages for the instant notifications.

Again, higher-level rec appropriate for SMBs: sign up for PagerDuty, and configure it as the alert "Notification Channel" in Grafana Cloud. If you're an "ops team of one", their free plan will work fine for you.

Why is this better than Telegram messages? Because the PagerDuty app does "critical alerts" — i.e. its notifications pierce your phone's silent/do-not-disturb settings (and you can configure them to be really shrill and annoying.) You don't want people to be able to call you at 2AM — but you do want to be woken up if all your servers are on fire.

---

Also: if you're on a cloud provider like AWS/GCP/etc, it can be tempting to rely on their home-grown metrics + logging + alerting systems. Which works, right up until you grow enough to want to move to a hybrid "elastic load via cloud instances; base load via dedicated hardware leasing" architecture. At which point you suddenly have instances your "home cloud" refuses to allow you to install its monitoring agent on. Better to avoid this problem from the start, if you can sense you'll ever be going that way. (But if you know your systems aren't "scaling forever" and you'll stay in the cloud, the home-grown cloud monitoring + alerting systems are fine for what they do.)



So about 3 years ago we had a bunch of on prem servers shutting down around March/April. We had even more servers that weren't shutting down so we had to "move fast" before they all had issues.

I must have spent about a week trying to learn just enough about prometheus and grafana (I had used grafana before with influx but for a different purpose) so that we could monitor temperature, memory, cpu, and disk (the bare minimum).

The goal was to have a single dashboard showing these critical metrics for all servers (< 100), and be able to receive email or sms alerts when things turned red.

No luck. After a week I had nothing to show for.

So I turned to Netdata. A one liner on each server and we had super sexy and fast dashboard for each server. No birds eye view, but fine. I then spent maybe 3-4 days trying to figure out how to get alerting to work (just email, but fine) and get temperature readings (or something like that).

No luck. By the end of week 2 I still had nothing, but a bunch of servers shutting down during peak hours.

Week 3 I said fuck it I'll do the stupidest thing and write my own stack. A bunch of shell scripts, deployed via ansible, capturing any metric I could think of, managed by systemd, posting to a $5/month server running a single nodejs service that would do in memory (only) averages, medians etc, and trigger alerts (email, sms, Slack maybe soon) when things get yellow or red.

By week 4 we had monitoring for all servers and for any metric we really needed.

Super cheap, super stable and absolutely no maintenance required. Sure, we probably can't monitor hundreds of servers or thousands of metrics, but we don't need to.

I really wanted to use something else, but I just couldn't :(


> So I turned to Netdata. A one liner on each server and we had super sexy and fast dashboard for each server. No birds eye view, but fine. I then spent maybe 3-4 days trying to figure out how to get alerting to work (just email, but fine) and get temperature readings (or something like that).

I work in Netdata on ML. Just wanted to mention that as of last release a parent node will show all children in the agent dashboard so if doing again as of today a parent netdata might have got you the birds eye view as a starting point https://github.com/netdata/netdata/releases/tag/v1.41.0#v141...

(of course we also have Netdata Cloud which would have probably worked too but maybe was not as built out 3 years ago as is now - but don't want to go into sales mode and get blasted :) )


Hey! I subscribe to your github releases and was reading about all that the other day (the parent/child node stuff).

When/If I have the time I'll dig into Netdata some more as I like your approach. :)

I'm not a devops/sre/systems guy, I just do it because I have to, so it's a bit difficult for me to find the time to experiment with these tools.


Cool! - we always looking for feedback, feel free to hop into our discord, forum, or GH discussions (links here: https://www.netdata.cloud/community/) to leave any feedback or ask any questions if you run into any issues.

(cheers for the mention here too - always nice to try get some feedback and discussion going on HN as its so candid :0 )


So .... why were the servers shutting down, and what metric did your own system capture that the others didn't which let you determine that?


Well, at first I was able to gather and correlate enough cpu, temperature, entrypoint data for apparently problematic servers.

The servers were shutting down due to high temperatures caused by persistent high cpu usage.

Knowing that, I installed datadog with APM on just a couple of the servers (because $$) which led me to postgres issues (indexing), weasy pdf generation issues (a python lib), and some bad django code (queryset to list before pagination).


If you have a one-off server running nodejs, you've definitely got maintenance


Why's that?

I think the only time I sshd to that server was last week when I added usb device monitoring and had to docker pull & & docker up -d.

Other than that... Can't remember dealing with the "monitoring stack".


Alternative view point.

Observability is hella expensive. Orgs should consider TCO when making such decisions. Paying a few hundred thousands more for the skills to self run could literally chop tens of millions off vendor bills.


But then you aren't taking into account server and storage costs of self managed monitoring.

Unless it's Datadog. That's expensive.


Not in the post, but I think there's still some pretty large savings.

Pretty much anything SaaS based is ridiculous. If you can swing self-hosted ( managed but in your account with there's potential for a discussion, but with many products, it's the actual integration work that's the real work.

Don't get me wrong, there's specific "always going to be small" where it likely makes sense.


I'm just saying that SaaS provides a lot more than just the cost of having an engineer or two.

It may not be cost effective, but if you think that hiring two people will be all you spend when you move everything on prem, you'll be in for a bit of a shock.


I don't think we're saying different things, but I think you misunderstood my wider point because it was constructed as a hot take.

The main thing for me as a prior exec is HR / the org will control your people cost, but they tend to be significantly more flexible over the compute / vendor costs. I'm not saying you can add an non-forecasted 20% to your headline spend, folks would get upset at that, but if you decide to consolidate all your services into 5 beefy VMs as opposed to 100 smaller ones, nobody cares.

Do that with people though, and folks tend to lose their shit pretty quickly. The problem is this has several outcomes:

- You're getting people with less experience in the "been there, done that" category, which means the work takes longer. - Since they've not yet experience the pros & cons of decisions. It's likely they'll make some decisions that won't pan out. - They'll leave once they've realised they've fucked up and they now have the "been there, done that" badge, so they can take that experience to a market that values their skills. - Result is you end up hiring 2+ folks to do 1 persons job. - Since you falling foul of Brookes law, you're unable to execute, you work with vendors. - They charge astronomical figures; but since they're not a person, the politics of envy don't apply, thus the org may begrudgingly accept it. - You then need more "cheap" resources to do / maintain the integration work.

The problem being your TCO goes through the roof because you're not hiring quality.

Now going back to your point. Besides economics of scale, the SaaS provider is actually deriving pretty stellar profit margins for a wrapper of people and compute. I would argue these economics of scale quickly dissipate when you're also funding sales, marketing, legal, founders, executives & investor concerns, and further when you're now funding your own internal procurement, legal, and SMT to sign off the contracts.

That said, couple of additional points:

- I didn't mention on-prem. My early career was developing an IaaS provider ( 2007 times ). Folks spend a lot of unnecessary money doing on-prem, but it's a fairly large undertaken for a small dev team with a lack of hardware experience. Most folks should start in cloud unless they are strong on-prem already.

- I didn't mean all saas, the focus was on observability. Though anything you need a large number of seats and has an SSO tax should be scrutinised.

:)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: