Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This post would be better if they gave more concrete examples of their infrastructure. I read the whole post and still don't know how they survived except some knowledge about distributed system design.


They had some good general points though, like fast retries. Which brings me to one of the worst examples of a Human Factors mistakes I can think of right now...

The new rent-a-bike scheme in London has POS terminals connected to the central system via bits of string and/or cellular modems. Every now and again these links fall over or the central system becomes unresponsive.

If you are attempting to get a bike (with an active card subscription) you drop your card into the terminal and it prints you a release code that lets you take a bike.

Unless the system is down... in which case it still reads your card, and then sits there and shows you a spinner for 5 minutes.

You can't walk away during this time, because if you do and the link comes back up it'll print a release code which anyone can use to take a £300+ bike on your account.

If you do stick around and try again? That'll be another 5 minutes which you could have spent walking to the next bike dispensary.

I think that timeouts are one of those things that you can only tune really well when you use the system in a live environment and see how well things work. In this case a higher transaction failure rate would be vastly better than a 5 minute time out - on other systems not so much.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: