Welcome to Infoblox NetMRI Community Sign in | Join | Help
in Search

Applied Infrastructure

Be Prepared: Handling Potential Network Failures

I was at VoiceCon two weeks ago, participating in a panel where I talked about network resiliency and presented my VoIP Troubleshooting and Monitoring tutorial.  Both presentations included examples of how you should be prepared for network failures.  I'm a proponent of understanding the causes of network problems and being able to quickly diagnose failures by looking at the problems that they cause.  Let's say that you want to be prepared to identify and react to a spanning tree loop.  First, you need to be able to quickly identify that a forwarding loop has formed.  Your NMS should show a CPU spike on switches in the STP domain in which the loop exists, due to processing BPDUs that are circulating.  Ports that are forwarding looping traffic will report high utilization.  A list of typical symptoms exist in the Cisco document "Troubleshooting STP on Catalyst Switches Running Cisco IOS System Software", Document ID: 28943.  Unidirectional links and similar problems are described in "Spanning Tree Protocol Problems and Related Design Considerations", Document ID: 10556.

Links must be shutdown or disconnected in order to break the loop.  This is where planning will pay off.  Examine the image below, taken from the Cisco "Troubleshooting STP" document referenced above.  A loop between the ADB switches, the ACB switches, or the AEB switches, is easily broken by disconnecting any link in the loop.  I would plan to take out the AB link because that would break any of the three loops that I identified.  If that doesn't take care of the loop, then the problem is likely due to a loop induced between VLANs or between two ports in one VLAN.  It could be due to a cabling mistake or a dual-homed server with bridging enabled between two interfaces.  In this case, you have to be prepared to isolate each switch until you find the combination that contains the loop (it may involve more than one switch).



Now imagine an STP domain that spans ten or more switches and you have the potential for a time-consuming troubleshooting task if you're not well prepared.  This is one fo the reasons why we at Netcraftsmen recommend that failure domains be limited in size.

If the STP loop you're troubleshooting is serious enough, you'll not be able to use the network to access the switches.  Someone will need to physically unplug the network connections.  Having them clearly labeled, with respect to the cable colors, labels and interface descriptions, will make your troubleshooting go faster.  And be prepared to properly reconnect the links if you've had to physically disconnect the cables.  It doesn't help if you quickly unplug three infrastructure links and then puzzle over which cables connect to which ports on the switch.

Now think about other common problems and how you'll tackle the troubleshooting tasks to quickly identify the source of the problem.  If your network uses a large number of static routes, be prepared to handle a routing loop where the interaction between a static route and the dynamic routing protocol creates a loop.

In a network supporting VoIP, you should understand the process used by phones to power-up, register, and operate.  You can use the OSI model to segregate problems into physical layer, data link layer, network layer, and application layer.  Knowing the types of problems at each layer allows you to quickly identify a few troubleshooting tasks to perform to identify the source of a problem.  An example is one-way audio; think about how you would diagnose its cause and how you might fix it.

How can you be prepared?  You need to know where and how you'll tackle specific problems.  What diagnostic tools do you need and are they in the appropriate locations?  Do you know the actions that you need to take to isolate problems or the diagnosis that you need to perform to gather enough information to characterize and identify the source of a problem?

  -Terry

Comments

No Comments

About tslattery

Terry Slattery, CCIE #1026, is a senior network engineer with decades of experience in the internetworking industry. Prior to joining Chesapeake NetCraftsmen as a full time consultant, Terry was the founder and CTO of Netcordia, and inventor of NetMRI, a suite of network management products. Terry started Netcordia as a consulting company in 2000 and transitioned to a network management product company in 2003. During the consulting days, he used his network design and implementation skills to lead a team in the design and implementation of a high availability network at a brokerage clearing house. Terry is the former President and founder of Chesapeake Computer Consultants, Inc., a networking and computer systems training and consulting company. He co-invented and patented the vLab(tm) internet-based remote lab system. He is co-author of the McGraw Hill text Advanced IP Routing in Cisco Networks. Terry led the team that developed the current Cisco IOS user interface under contract to Cisco Systems. Terry is experienced in the design and installation of large TCP/IP based networks and is a successful network protocol instructor. He is the second Cisco Certified Internetworking Expert (CCIE) #1026 and the first outside of Cisco. He enjoys membership on the Vanderbilt University Engineering School’s Industrial Advisory Board and the IEEE.

This Blog

Syndication