Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We tried a similar chaos tool in our company built in-house. Simulated most of the scenarios mentioned here using SSM/other scripts. At first everyone was interested and after some time the interest faded. Our problem was lack of visualization across the app ecosystem i.e how will it impact the app ecosystem when a batch of ec2 instances are suddenly spiking on CPU and what will be the impact to end user.

Turns out people care only if there is an end user impact and doesn't really care about random anomalies.

And to build the capabilities required for measuring the impact + automating the workflow of the actual chaos tests is a lot of work



Stress testing a whole app ecosystem end-end and preventing/mitigating end user impact is generally a part of "gamedays" - https://wa.aws.amazon.com/wat.concept.gameday.en.html.

A library like AWSSSMChaosRunner would be a core component of building gameday like capability. But building a full gameday framework is out of the scope of this discussion.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: