Wednesday, November 12, 2014

Good post on testing server failures

tl;dr: make your system "pressure tolerant" by using system tools to simulate failures.
If there's one thing to know about distributed systems, it's that they have to be designed with the expectation of failure. It's also safe to say that most software these days is, in some form, distributed—whether it's a database, mobile app, or enterprise SaaS. If you have two different processes talking to each other, you have a distributed system, and it doesn't matter if those processes are local or intergalactically displaced. 
Marc Hedlund recently had a great post on Stripe's game-day exercises where they block off an afternoon, take a blunt instrument to their servers, and see what happens. We're talking like abruptly killing instances here—kill -9, ec2-terminate-instances, yanking on the damn power cord—that sort of thing. Everyone should be doing this type of stuff. You really don't know how your system behaves until you see it under failure conditions.