Welcome to Infoblox NetMRI Community Sign in | Join | Help
in Search

Matt's Blog

RSS Feed

How Redundancy Hides Issues and Can Cause Bigger Problems Later

Every successful IT organization has built-in redundancy and disaster recovery plans, but often overlook how these can often mask problems for a period of time and then when something else occurs - BAM - bigger problem or complete outage.

Wikipedia just reported a recent global outage that had a huge impact. When its data centers servers in Europe overheated, the standard quick failover procedure re-routed traffic to their Florida cluster.  However, shortly after the failover switch, it was determined the mechanism was now broken, causing Wikimedia sites to stop working globally.  Even though the problem was found quickly and resolved, there are still performance delays hours after the event. 

This outage just brought back the memories and horror stories I've heard from organizations where redundancy caused a false sense of security and compounded the problem when a problem arose.  It reminded me of a story a friend told me the other day.

In this organization, they had an Access Layer 2950 switch that had dual fiber uplinks to two 6500 Catalyst switches and one of his techs disturbed the wrong fiber patch.  Since it was redundant, there was no service impact.  Months later, there was a problem on the other fiber patch and the issue was catastrophic because now both uplinks were out of commission.  They were caught completely blindsided because they thought they were safe because of a strong redundancy or disaster recovery plan.  Once the problem happened, they spent a lot of time trying to figure out how it occurred and how to avoid it in the future.

The story just reminded me what happens when we get over confident, we get complacent.  Instead of looking for potential issues lurking on the network or the devices that could cause a problem hours, days, weeks or months later, we tend to assume everything will work fine or our failover plans will save us.  As networks get more complex and more critical, I suspect these issues will continue until another major outage occurs or someone decides to get ahead of the curve and find the solution that looks for these potential issues before they cause a bigger problem later.

Comments

 

Roland said:

I definitely agree with you Matt. Many customers with redundant links but without monitoring tools don't really know if the secondary link is up or down and when the primary breaks, the network stops. Link redundancy hides the first link down but without any feedback and recovery is just a delay of the real downtime.

My advice is to always use a network monitoring tool and check that all the alternate paths are up&ready to take over in case of primary path failure. A simple snmp check of the status of the backup port is often good enough.

March 24, 2010 4:26 PM

Leave a Comment

(required) 
(optional)
(required) 
Submit

About mgowarty

Matt Gowarty is leading the product marketing aspects for Netcordia and positioning NetMRI in the Network Configuration and Change Management (NCCM) space. Matt has over 12 years of IT experience with focus on network and application management, telecommunications and performance management. Prior to joining Netcordia, Matt worked with leading companies including Visual Networks, Verizon, GTE and Fluke Networks. Over the past decade, Matt has been a thought leader in the performance management space being a frequent speaker and contributor for tradeshows, seminars, webinas and whitepaper with topics including MPLS management, VoIP, Managing the Impact of Change and Application Performance Management. Matt has his MBA from Penn State Univeristy and his BSBA from Robert Morris College.