On February 18th, TechCrunch confirmed that its leading blog site was completely down for almost two hours because the site's host, WordPress.com, had been crippled by an unscheduled routing change to a core router.
TechCrunch is one of millions of blogs hosted on WordPress.com, and an estimated 10.2 million blogs went down because of the change, eliminating over 5.5 million pageviews.
Matt Mullenweg, WordPress founder, called it their worst outage in 4 years.
In the postmortem analysis, Mullunweg stated the initial diagnosis was, "an unscheduled change to a core router by one of our datacenter providers messed up our network in a way we haven't experienced before, and broke the site." He also added that even with the investments and mechanisms to prevent a total failure, this event tripped those up as well and the entire site crashed for 110 minutes.
Mullenweg concluded with the statement, "I hope it will be much longer than four years before we face a problem like this again."
The last statement caught my eye because of the word hope. When dealing with aspects such as a network infrastructure, hope should only be considered with things you cannot control such as an act of nature.
But an unscheduled change is something that can be addressed, monitored, and managed, so a major problem like this does not occur in the future. With the proper control, visibility, automation, and intelligence, the vast majority issues like this one can be completely eliminated. And for those that aren't eliminated, there is a huge difference if the problem is identified immediately and solved. In this example, think how much less severe the issue would be if it was caught and solved in a matter of minutes, not hours.
Organizations have invested huge sums in redundancy and back up plans to reduce the risk of catastrophic events, but often, they rely on manual processes and hope for the best when dealing with day to day management of the network infrastructure.
In situations like this, I think the old adage, "an ounce of prevention is worth more than a pound of cure," is never more relevant. Netcordia was founded on the premise of helping organizations reduce chaos and take control of network aspects such as configuration, change, and compliance.
While I also hope that outages like this never occur for organizations across the world, I think it's more important that IT organizations realize they don't need to rely on luck for managing their network infrastructure, and that with proper configuration and change management, they can reduce the risk of catastrophic events like this outage occurring again.