Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, sorry! We're investigating, but my current theory is we got overloaded because I relaxed some of our anti-crawler protections a few days ago.

(The reason I did that is that the anti-crawler protections also unfortunately hit some legit users, and we don't want to block legit users. However, it seems that I turned the knobs down too far.)

In this case, though, we had a secondary failure: PagerDuty woke me up at 5:24am, I checked HN and it seemed fine, so I told PagerDuty the problem was resolved. But the problem wasn't resolved - at that point I was just sleeping through it.

I'll add more as we find out more, but it probably won't be till later this afternoon PST.

Edit: later than I expected, but for those still following, the main things I've learned are (1) pkill wasn't able to kill SBCL this time - we have a script that does that when HN stops responding, but it didn't work, so we'll revise the script; and (2) how to get PagerDuty not to let you go back to sleep if your site is actually still down.





Crazy that Dang literally manages HN in his sleep!

We all knew that but I haven't seen any confirmation before this.


I like hacker news but I don't think this site is worth getting paged over lol

You might be underestimating HN's popularity.

> You might be underestimating HN's popularity.

I think you're confusing popularity with criticality. I'm sure everyone in here can withstand a few hours without browsing the page.


If you like the thing you're managing, then its health is critical for you, not your users.

It's dang's baby at this point, and this is a good thing, as long as HN doesn't affect his life in ways he doesn't want.


> If you like the thing you're managing, then its health is critical for you, not your users.

Get a grip and go touch some grass. Even FANGs understand the concept of business hours SEV.


Aw, please don't cross into personal attack. You can make your substantive points without that.

https://news.ycombinator.com/newsguidelines.html

Edit: it looks like you've been breaking the site guidelines quite a bit, unfortunately. Could you please not do that? We end up banning accounts that keep doing it and I don't want to ban you.


I have a pretty firm grip on life and touch plenty of grass both literally and figuratively.

However, when something I care about crashes and burns once in a blue moon, I make sure to put the fire out, at least to make it survive till regular hours. Things I care about can be both business and personal, and nobody bugs me for them.

Maybe we shouldn't make any assumptions about people we don't personally know, while we are at it.


> However, when something I care about crashes and burns once in a blue moon, I make sure to put the fire out, at least to make it survive till regular hours.

You are free what you choose to do with your personal life.

Meanwhile, it is pretty obvious that it's pointless to demand or expect personal sacrifice to maintain unrealistic levels of high-availability in services that are far from critical. I mean, do you honestly believe that these messages you and I are writing are so important to get out that someone must sacrifice their personal time to ensure it is served to the world in this very instant instead of, say, 3 or 6 or 13 hours? Absurd.


It looks like I failed to convey what I've tried to say in the first comment. Let me reiterate one more time.

    - I believe dang sees HN as his baby, so *voluntarily* monitors it as a critical infrastructure *for him*.
    - I personally like this kind of commitment from people who like their job, however *I don't expect or demand it in any way*.
    - I also hope that attention doesn't affect his life. *Especially negatively and/or in a crippling way*.
I don't care whether this site is down for 6 seconds or 6 hours. I just wanted to commend him for liking what he's doing this much. I demand nothing from any service provider I use. Let it be a small, one person operation or dang or Amazon/Google.

I also keep servers up in my daily job, and some are more important than others, but none of them requires me to wake up 5AM to solve a problem (by design). So I don't demand anything from others something which I won't do.

As long as nobody is dying, nobody should stop, drop, and work on something else regardless of time, date and location.


failing to manage HN in my sleep is more like it

Your sleep is more important than our work distraction.

I was curiously productive this morning..

I took a shower for the first time in one week

looks like a redditor snuck into HN

Bet that felt great.

came back just in time for me to spend the first hour of my work

Which is fine! I don't mind if it's down for a few hours. It reminds me that it's just a place to stop by for a bit before moving on. Like a digital coffee shop that sometimes has a leaky pipe and isn't open right at 7am.

I hope it doesn't change (much).


You're still a miracle worker. Single-handedly managing a well-known fully user-contributed site not just technically but moderation in contentious times like these and still keeping it working well and encouraging a positive user community can't be an easy task.

Thanks, I'll take it! except for the single-handedly part - gotta share the love with https://news.ycombinator.com/posts?id=tomhow.

No worries, please take care of your sleep and thanks for all your hard work

We all have our moments, and I personally consider HN to be “best effort”, almost like a volunteer project. I’m not certain I’m correct: but thats the optics I have so my expectations are adjusted to that.

So don’t beat yourself up please.

When I worked for “SaaS unicorn” we typically had multiple levels of escalation, and acknowledging would have done nothing because the alarm would continue firing until fixed. Not sure what’s changed in 15 years of ops, I had assumed it would be better now- I can’t imagine silencing an alert totally by acknowledging it- if its still occurring.

I’m totally fine with how you handled it, if anything I am thankful. But that seems to be a system I would improve if I had the time.

“mute” is different than “resolve” to me, and both should exist. (Where mute is an acknowledgement of an issue as ongoing.)


Yeah we don't exactly pay to be on HN, not much to complain about. I appreciate everyone who works on HN.

We pay with content and with the fact that we attract the talent that eventually ends up powering ycombinator investment rounds.

It’s ad-supported. Any post with comments disabled is definitely an ad. Probably a lot of the others are, too.

Your comment makes me realize that I consume HN differently than many others, because I've never seen a post with comments disabled and I've been around here for at least ten years. It's not that I don't think they don't exist — they obviously do because you're mentioning them. I've just never encountered one, primarily because I don't casually browse HN, ever. I subscribe to a pushbullet channel that notifies me when a post hits 500 up votes. That's it. The list of submissions on the home page (even on reddit) is just overwhelming to me so I use the pushbullet channel as a sort of community curated "best of" or "trending" signal.

Not to say that I don't procrastinate or waste time doing other nonsense. I can definitely spend a lot of time reading HN comments, as I'm doing right now.

Anyway,anyone who finds themselves with a problem with HN should try that out :)


> Anyway,anyone who finds themselves with a problem with HN should try that out :)

To be clear, I wasn’t complaining. Just pointing it out. Aside from any more speculative benefit to YC for running the site, the site does run outright ads.


Sorry, I didn't mean to imply you had a problem with ads. By "problem" I meant "if you find yourself procrastinating a lot" (not you specifically, but the reader in general)

Apologies for the misunderstanding


On, no issue here, no apology needed at all. I didn’t take it as any kind of dig, or even necessarily aimed at me (and if it was—still not offensive in any way). I’d almost edited my original comment right after posting because it occurred to me the tenseness might allow it to be interpreted as a complaint, and was just using your post as a jumping-off point to finally clarify that, is all.

I did miss exactly what you meant by “problem” in that passage, but get it now, so thanks for that.


I assumed the main purpose was to show off the ycombinator batches when they launch.

Actually, I'm doing my best alienating these kind of people :p

Good for you!

This. If it were a business-critical money fountain, I'd expect follow-the-sun SRE coverage. I don't think it is, so I can probably accept drinking my morning coffee without scrolling HN once in a while. There's only so much one can beat oneself up about a slow/incorrect response when the on-call is handled by what, just one person? maybe two people in the same time zone?

(Might be wise though to have PagerDuty configured to re-alert if the outage persists.)


And that is a good thing. Sleep tight!

I was starting to think you never slept, I remember that one time I emailed you at 1am. :)

Time to train an AI agent on your moderation activity and get some well deserved sleep!

We're working on it! well, some of it.

I'm pretty happy with how it's developing—the trendline is promising—but not ready to rely on it in prod yet.


Do you have nightmares of failing to manage HN when you sleep too?

I appreciate what you do. Hope you got some rest when it was all over.

You deserve a lot of rest!

Yeah, I mean how dare you?! I pay good money for high uptime SLAs! :)

I was today years old when I found out Dan sleeps.

I was today years old when I found out that dang's first name is Dan

You'll never guess what letter dang's last name starts with.

A as in Ang, clearly.

No, he's Asian. The n is doing double-duty. His last name is Ng :p

That's exactly what a French person with the last name of Les would say!

What if Dang made an AI agent of himself for when he sleeps?

By demonstration he didn't.

Hey dang, don't worry. It's just a site for reading articles and reacting to them.

Enjoy your deserved sleep and if for a couple of hours it's down, so be it.

Thanks for your continued service!


100%

Though I will say, HN is a pretty great source of information about major outages like the recent AWS and Cloudflare issues. I had a moment this morning where I thought, oh, is there a larger issue and then, oh, HN is down, huh, the next option is so far down my list that it's going to take me a moment to think of it.

I hope that serves as a testament to how great this site and the community is. Thanks for all your hard work keeping it that way!


> huh, the next option is so far down my list that it's going to take me a moment to think of it.

Option 4: take your grab bag with the tcp over IP shortwave radio, sextant and head for pre-cached month supply of food in the hills.


Maybe it would be fine if ops alerts were silenced during normal US sleeping hours?

HN is important, but unlikely much harm could be done before morning.

(Source: Lost a lot of sleep at one place, enough to realize that sleep interruption and deficit has significant costs.)


I was personally worried if there was some major outage of the whole world or something the first time hackernews didnt work because I didnt expect hackernews to go down but rather, something even more catastrophic than aws going down must happen (because we see major cloud outage posts)

https://downforeveryoneorjustme.com/hacker-news

This website had many instances of reports, the last I saw were 52 reports in only a short frame of time, the maximum reports on this are 118 it seems.

> In this case, though, we had a secondary failure: PagerDuty woke me up at 5:24am, I checked HN and it seemed fine, so I told PagerDuty the problem was resolved. But the problem wasn't resolved - at that point I was just sleeping through it.

Its okay I suppose, have you figured out who is crawling hackernews so much tho, was it a ddos attack or an AI company trying to get data, doesn't hackernews support an api and I am sure that there are datasets for it too so Its interesting why they might crawl but we all know the reasons why as they have been discussed here.


No apology needed. We all needed to stop procrastinating anyways :)

During the last week my IP was banned for unknown reason. Glad to hear it might not be a problem from my side.

Yes, sorry! This is the problem - we don't want to block legit users, but if we loosen the bolts, we get flooded.

If you browse HN while logged in, that should immunize you against this happening. Also, if it does happen again, you can unban your IP as described at https://news.ycombinator.com/newsfaq.html. But you have to do that from a different IP address, of course.

If those things don't work, email hn@ycombinator.com and we'll get it sorted.


Thanks. It is so easy to change the IP using mobile that the unbanning is little hassle.

I’d love to know more about what running a site like HN involves, would be great to get a write up of what it’s like running something like this at this scale (and what kind of traffic you guys get)!

I can’t put my finger on anything within the last decade, but I seem to recall it running in something close to its current form on a single core on a single server for a long time:

https://news.ycombinator.com/item?id=5229522

Re: traffic, dang said (2022):

https://news.ycombinator.com/item?id=33454140

I took it as a good reminder that the hard part is the human part: that high-overhead features and UI fripperies are nice but not necessary (or sufficient) to keep a community healthy and vibrant over the decades.

(And on the subject of the human side, if you didn’t catch Anna Wiener’s 2019 profile, it’s here:

https://www.newyorker.com/news/letter-from-silicon-valley/th... )


From dang's 2022 comment about traffic:

The most interesting number is the 1300 submissions because that hasn't grown since 2011 - it just fluctuates. Everything else has been growing more or less linearly for a long time, which is how we like it.

I find that surprising, as 2011-2022 covers an exponential rise in SEO spam and "growth hackers" attempting to drive traffic and links.

Or was 1,300 the number of non-flagged submissions?


Nope, total submissions. And it's still very much within that same window!

The other reality is that as much as this industry is up its ass about scalability you can run a very very busy site on a single machine now a days.

A lot of people out here designing their blogs like its 1989.


This is completely wrong, everyone knows you should rewrite everything in microservices immediately :-D

The transparency is deeply appreciated by me and others. We don't pay to keep HN on, so we cannot complain. Thank you and the rest of the team for all you do to give us a corner of the internet that is quite 'different' from the rest of the wild west that is the web.

> The reason I did that is that the anti-crawler protections also unfortunately hit some legit users, and we don't want to block legit users.

it is a shame that it needs to be this way. as a lurker who doesn't stay logged in nor use incognito mode, i have seen "Sorry" page way too often, even when opening the "past" page from the homepage.

truly hope you find a solution that reduces friction for all. personally, it is back to "Sorry" situation for now.

PS: for others facing a similar situation, it all disappears after logging in, which has been the most reliable solution thus far.


Yes, and I'm sorry. We do our best but it's both a hard problem and a moving target.

In a situation like this one, good crisis leadership is essential. dang, HN will help you with tips from vast collected experience (please chip in):

1. Blame: The first thing to do is to point the finger. That doesn't mean analysing the technical issue, which can delay this step and limit your options, but figuring out who is politically easiest to blame. Often, that's the new guy, but outside contractors and vendors without good connections are also a common solution. Even if you are technically responsible for hiring them, you can always push them under the bus with a little skill. This small sacrifice helps unify, focus, and motivate the rest of the team.

2. Emotion: Inject your emotion into the situation and make that the implicit, but indisputable priority. Particularly, outrage and anger - This is completely _____. These people are utterly _____ (I'd use all caps, but that's not allowed on HN). Make sure everyone's attention is over their shoulder, on your emotion, and infect the team with it. Threats are an effective tool here - this is a crisis, and anyone who is calm is not emotionally engaged. Otherwise, they won't care enough about this problem - without you driving them, they probably wouldn't care much at all. Anyway, you don't have time for niceties like empathy or even basic respect.

3. Speed: Respnsiveness to stakeholders is very important. People need answers now. Give them answers they want to hear, outcomes they will be comfortable with. Don't worry if different groups hear different things. Your team will find a way to make it all work - that's their job.

4. Communication: Good communication is essential. Make sure you clearly tell your team what they should be doing; repeat it several times to prevent misunderstanding. Especially people with experience can have minds of their own; keep them on track. The situation is a crisis so you can't take any risks; stay on top of them and everything they do, and give input if you're not certain they are doing exactly what you would be doing.

5. Victimhood: Find a way to turn the tables: Make it about you, and how you're the victim here, and feed the fire with more outrage. With this and outrage, nobody will undermine the team by challenging your ideas or authority, which is the most essential component of a successful outcome. Remember, without you this all falls apart.

Have I missed anything?


Engagement: make sure that every member of the team is either on the incident bridge or has dropped what they are doing to watch you diagnose the problem. The more eyes on the problem, the more awareness of the pain will be absorbed by all. If members need to leave to get food or put children to bed, tell them to order delivery and to ask their spouse to do their job. Demonstrate human touch by allowing them to turn off camera while they are eating.

Comprehensiveness: propose extreme, sweeping solutions, such as a lights-out restart of all services, shutting down all incoming requests, and restoring everything to yesterday's backup. This demonstrates that you are ready to address the problem in a maximally comprehensive way. If someone suggests a config change rollback, or a roll-forward patch, ask them why are gambling company time with localized changes, and ask them why are they willing to gamble company time on technical analysis?

Root Cause Analysis Meeting: spend the entire meeting time rehashing the events, pointing fingers and assigning blame. Be sure to mention how the incident could've been over sooner if you just restarted and rolled back every single thing. Be sure to demonstrate out-of-the-box thinking by discussing unrealistic grandiose solutions. When the time is up, run the meeting over by 30 minutes and force all to stay while realistic solution ideas are finally discussed in overtime. This makes it clear to the team that nothing is more important than this incident's RCA--their time surely is not. If someone asks to tap out to pick their kids up after school, remind them that they are making enough money to call them an Uber.

Alerting: be sure to identify anything remotely resembling leading indicators, and add Critical-level wake-you-up alerts with sensitive thresholds for those indicator. Database exceeding 50% CPU? Critical! Filesystem queue length exceeding 5? Critical! Heap usage over 50%? Critical! 100 errors in one minute on a 100000 requests per minute service? Critical! Single log line indicating DNS resolution failure anywhere in the system? Critical! (What if AWS's DNS is down again?) Service requests rate 10% higher than typical peak? Critical! If anyone objects to such critical alerts, ask them why do they want to be responsible for not preventing the next incident?


Frankly, I don't understand why someone would even try to crawl Hacker News.

There is an official dump which doesn't even require parsing HTML at all: https://console.cloud.google.com/marketplace/details/y-combi...


These are not, er, experienced crawlers.

https://www.youtube.com/watch?v=Sbpl3ywNlpA#t=56s


Short lived and driven by good intentions– all's good. Thanks again for keeping this thing going!

Even after providing firebase endpoint, crawlers come to the site ?

Most crawlers have no concept of what that is. They will follow links to this site and then follow links out of this site even after being told not to [1]. The majority of crawlers follow zero rules, RFC's, etc... The few platforms that do follow standards and rules are akin to a law abiding citizen in Mos Eisley.

[1] - rel="nofollow"


Oh my god. It's the crawlopalypse.

Yes. It's hard to explain the experience of hosting a website since 2023.

A crazy amount of really dumb bots loading every url on the website in a full headless browser, with default Chrome user-agent string. All different ip addresses, various countries and ASNs.

These crawlers are completely automated and simply crawl _everything_ and don't care at all if there's value in what they're crawling or if there's duplicate content, etc.

There's no attempt at efficiency, just blindly crawl the entire internet 24/7. Every page load (1 per second or more?) is from a different ip address.


Unfortunately, the firebase API is very bad as they even acknowledge that in their github page.

> anti-crawler protections

Sometimes I could not open the comment section, receiving a blank page with "... We're sorry" or something along these lines when opening from new private window. It works when opening normally.

Logging in on the private window seems to resolve the issue. Can you take a look on this if possible?


Best to email your IP address to hn@ycombinator.com so we can see if it's blocked.

Can't speak for others, but I'm sure i'll be pretty fine if no one gets woken up if HN is down...

Of course, they'd better restore service after they wake up naturally, because I need my HN dose. But it's not worth losing sleep over it.


> the anti-crawler protections also unfortunately hit some legit users, and we don't want to block legit users

Was the blocking returning “Sorry.” instead of any page content? A couple of days ago there was a few hours where when I’d go to HN I could load the main page as a non-logged in user. But if I tried to log in I would get “Sorry.” instead. I also got the sorry message if I tried to click on user profiles of other people and a few other pages.

I am assuming that the reason I could see the front page itself and discussions on posts on the front page is that they were in a shared cache for non-logged in users, but that when I clicked on some pages like some random user pages those were not in cache and hit the origin server and it blocked those with “Sorry.” like it did for log-in attempts.

I also tried to go to the unblock IP page, but that one also returned “Sorry.”

For a while I was scratching my head wondering if I had gotten some malware on one of my computers that was aggressively making requests to HN, and that I had become IP banned because of that. Since I think my actual request rate from browsing and commenting should be pretty average. I read HN a lot, but not that much :p

Later in the day, or the next day, things were back to normal and I could log in again. Presumably after those anti-crawler protections had been relaxed again.


> Was the blocking returning “Sorry.”

> Presumably after those anti-crawler protections had been relaxed again

Yup and yup. Apologies for the inconvenience! If it happens again you're welcome to email us at hn@ycombinator.com with your IP and we'll unblock it for you.


I didn't realize you were carrying the pager too! Kudos!

I feel such a sense of kinship for anyone who carries a pager, almost 7 years at my current role doing it. Super cool that dang is among our number :)

Yep, have been on constant "pager duty" for 2+ years, although I have more help now and I get paged 1-3 times a week instead of per night. Still, carry my lappy everywhere I go. Bought an ARM Windows laptop to get that 20hr battery life so I could worry less during my travels. You know, fancy things like going get food or going grocery shopping.

Rough shift, my worst was every other week and my boss prior to hiring me was 24/7 just like you. I just carry a backpack with a few batteries + my work laptop, fortunately only a few really bad stories but hooooo boy me and that backpack have seen some fun times.

Do you carry a literal pager? We use the PagerDuty app.

My organization is, for now, using OpsGenie.

My pager noise: https://www.soundjay.com/transportation/sounds/train-crossin...

That will not only wake the dead, it'll wake me no matter how asleep I am.


Haha I made the mistake of using the default iPhone ringtone, now when strangers get called in public my heart rate spikes. Too scared to change it.

The "for now" is very important because it will be sunset in 1 year and something. I can recommend you Incident.io or Rootly as alternatives.

It may interest you to know that pagers are still a thing, Motorola still makes them, and I know that one major use case is volunteer fire departments

I used to work on Motorola Minitor 5 pagers. Looks like they recently released their newest model, the Minitor 7

I wonder if pagers are still used in hospitals? I imagine so


There's a company in England called "Cascode" who make firefighter alerters. These are really basic "beeper" pagers, which you can program to have a bunch of different tones and LED patterns based on the RIC and Subcode.

I look after several thousand of these across several hundred paging sites.

They're relatively inexpensive (70 quid or so in quantity) and they last about six weeks on a commonly-available AA battery. The batteries go flat enough to trigger the "low battery" beep at about 3am, for some reason. I don't know why.

There's no messaging involved, although the encoders are capable of sending a text string. The message is "get up and get down to the fire station right now", which generally needs no further explanation. POCSAG is unencrypted, so there would be privacy concerns with sending actual incident information in the clear with it.

While we're on the subject of old tech, until BT finally cut the last of them off, we use dialup modems to control the encoders (not dialup internet, just a hundreds-of-miles serial cable) as a backup, and dot-matrix printers to print out a hardcopy message for the crews to pick up.

All very low-tech. All very fixable. All stays working if you don't mess with it.

https://cascode.co.uk/products/2ar2-and-2ar3/


Encryption is easily doable even with one way pagers. With one way you will lose the perfect forward secrecy option but that's usually ok.

It's doable but it would be custom firmware and it's not really necessary. Two way paging isn't really worth doing because then you need a massive device with a massive battery, or something that uses uncontrolled mobile phone networks (and generally still has a massive battery, that lasts about a day).

You wouldn't even need particularly good encryption, you'd just need something adequate to stop casual eavesdropping really - "keep them busy for half an hour" would stop people from sniffing the POCSAG traffic and tweeting it, so that people show up at incidents and hang around filming it on their phones.

This incidentally is what a guy in England got arrested for a few years ago, exactly that. It's perfectly legal to listen to and decode pager messages (or any other radio messages), you're just not allowed to pass them on to people or act upon them, and posting them on twitter and then going round to rubberneck at the ongoing incident very much ticks those boxes. As with so many things in the UK, to paraphrase Aleister Crowley, "Don't Be A Dick shall be the whole of the law".


Doctors on call at hospitals also routinely still use pagers. There was a planet money episode on it a couple years ago: https://www.npr.org/2023/12/08/1197955913/doctors-pagers-bee...

Do doctors in the Middle East also carry pagers?

The AUBMC hospital is definitely using them as well as the paramilitary in that country, at least until recently.

Now, whenever I see a pager, I think of explosions. Haha.

Oh no, I just always hear it termed that way and it captures the “feeling” for me since it feels like a dedicated device. I just just carry a work phone w/ PagerDuty during my shift.

I wish I could still buy a pager where I live :'(

Just out of curiosity, if HN is still running on one physical system, what does a daily or weekly traffic chart look like for the switch port facing it?

> The reason I did that is that the anti-crawler protections also unfortunately hit some legit users

How does this happen?


How does this happen?

Not the person you are asking. Bot operators have an incentive to make crawlers look as much like a human as possible so they do not get blocked. Some of them fail miserably and some nearly succeed. That makes it trivial to accidentally block a real person. I am personally fine with that given I do not pay for this site and have no SLA or contract with it.


some humans also try their best to make themselves look like bots...

You're absolutely right!

beep boop.


Last week if you are using a VPN + a browser that limits fingerprinting, you were likely to see error messages accessing HN.

Every filter process has false positives and false negatives, especially when crawlers are trying to fake their status.

> anti-crawler protections

what type of protections are used on HN? rate-limiting? ip range blacklist?


Looking forward to the post mortem. :)

dang

In my defense, I was commenting at 0 min, since then he made several updates explaining the situation.

Yes sorry! Normally I put in "[editing - bear with me...]" or some such.

I was just trolling, thanks for ur work

dang - just to say, we've all done it...

Decades ago I had to write a Perl script to auth to the site for proper downtime checking. Some things never change :) Good luck with the triage.

dang!



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: