The chaos monkey, a mythical creature. I had actually never heard of it until I read Jeff Atwood’s post (which is excellent and you should read it). Then I realized that I’ve worked with and fought against chaos monkeys my entire career. Tough love sometimes, but the monkey always pushed the software to become more robust each and every time it struck.
So what is a Chaos Monkey?
Before I continue, we should perhaps clarify what a chaos monkey is. I think that it all started 1983 during the development of MacPaint and MacPrint (the full story can be found here). What basically happened was that Steve Capps created a helper application that fed events to the other applications under development. The events represented things such as the keyboard being pressed and mouse clicks. While the application was running it kind of looked like a really frantic monkey was slapping the keyboard and playing with the mouse. Hence, it was named “The Monkey”.
The monkey was then used to test the applications and make sure that it did not crash whatever you threw at it. Much like when you sit down next to user that tries out your application for the first time and you know… he/she “uses it wrong” so that it provokes a crash. You could argue that it’s their own fault, but deep inside, you know that you probably should have handled those characters in the number field a little better. (Authors note: Neither I nor Jayway as a company compare end users to monkeys, this was just an example…)
Now, fast forward a couple of years and we have Netflix that started move onto Amazon Web Services (you can read more about their experience here). One of their tasks was to build a robust system that should be able to handle any unexpected problems, Rambo style (yes, they actually called it their “Rambo Architecture”). Unexpected problems typically includes servers going down or processes getting killed.
In order to build this resilient system, Netflix created a service that had as its sole task to randomly attack and kill other services and processes. The service was called the Chaos Monkey. Since the Netflix servers was now under constant attack it forced them to build robust and resilient services that could cope with the unleashed monkey. To prove the worth of the Chaos Monkey, Jeff mentions that
AWS had a huge multi-day outage last week, which took several major websites down, along with a constellation of smaller sites.
Notably absent from that list of affected AWS sites? Netflix.
I guess the monkey did its job, eh?
How does it work?
I think that a good metaphor for the monkey is a vaccine. A vaccine is defined as
Any preparation used as a preventive inoculation to confer immunity against a specific disease, usually employing an innocuous form of the disease agent, as killed or weakened bacteria or viruses, to stimulate antibody production.
Well that kind of sounds the same as our monkey, doesn’t it? We introduce a disease into our system in order to prepare it for the real thing. The disease agent in our case is the monkey and the antibodies is our own code written to cope with unstable system. Our role as developers suddenly become the same as our own bodies immune system. We find and protect all the weak spots in our system and when we are faced with the real deal we are prepared. Neat.
My very own monkey
I’m currently working with a project where we, some time ago, involuntary faced our own monkey. Our system is connected with several services and different hardware, such as barcode scanners, printers and card readers. All of which has a tendency to go offline or jam when you least expect it. Especially the printer, a device which reliability has generally not improved since Windows 95, gave us problems in the beginning.
It would stop working when you least expected it and it would certainly not inform anyone that it was now useless. Since we couldn’t figure out the root cause right away we started to implement a robust system that would verify that the printer was operational at, and before, critical moments and then inform the user in a friendly way (instead of just crashing or refusing to print). When we finally found and fixed the root cause of the problem we were well prepared for failures that we could not control, such as paper jam.
If the printer wouldn’t have failed so frequently in the beginning we never would have spent the effort to inoculate our system against it. Now, since the printer disease is gone we have an immune system that can cope with any unsuspected printer failures.
Important to note is that we, compared to the Netflix team, didn’t implement our own chaos monkey, we got one without asking for it. When we wrote the code to make our system more resilient, if anyone would have asked us if we where glad for the faulty printer we would most definitively say no. But now we’re reaping the harvest from our hard work and it would not be possible without the help of our very own little chaos monkey.