Every successful IT organization has built-in redundancy and disaster recovery plans, but often overlook how these can often mask problems for a period of time and then when something else occurs - BAM - bigger problem or complete outage.
Wikipedia just reported a recent global outage that had a huge impact. When its data centers servers in Europe overheated, the standard quick failover procedure re-routed traffic to their Florida cluster. However, shortly after the failover switch, it was determined the mechanism was now broken, causing Wikimedia sites to stop working globally. Even though the problem was found quickly and resolved, there are still performance delays hours after the event.
This outage just brought back the memories and horror stories I've heard from organizations where redundancy caused a false sense of security and compounded the problem when a problem arose. It reminded me of a story a friend told me the other day.
In this organization, they had an Access Layer 2950 switch that had dual fiber uplinks to two 6500 Catalyst switches and one of his techs disturbed the wrong fiber patch. Since it was redundant, there was no service impact. Months later, there was a problem on the other fiber patch and the issue was catastrophic because now both uplinks were out of commission. They were caught completely blindsided because they thought they were safe because of a strong redundancy or disaster recovery plan. Once the problem happened, they spent a lot of time trying to figure out how it occurred and how to avoid it in the future.
The story just reminded me what happens when we get over confident, we get complacent. Instead of looking for potential issues lurking on the network or the devices that could cause a problem hours, days, weeks or months later, we tend to assume everything will work fine or our failover plans will save us. As networks get more complex and more critical, I suspect these issues will continue until another major outage occurs or someone decides to get ahead of the curve and find the solution that looks for these potential issues before they cause a bigger problem later.