Welcome to Infoblox NetMRI Community Sign in | Join | Help
in Search

Terry's Blog

RSS Feed

Spanning Tree Protocol and Failure Domains

Co-worker Pete Welcher recently helped a customer whose network experienced a spanning tree loop (i.e. a melt-down).  Several things can be learned from thinking about the experience and how to avoid it.

Lesson #1:  Adequately plan the task.
Rush jobs carry a higher risk of problems than well-planned tasks. In this case, the server operations team needed to bring up a new server for a project and decided to not wait for an access-layer switch to provide connectivity.  Instead, a couple of ports on a core switch were used.  Something (not sure what) was misconfigured and a spanning tree loop was created.  The new server had high speed interfaces, so there was nothing that would limit the volume of forwarded traffic.

Lesson #2: Limit failure domain size.
The new server was connected to the data center core switch and the resulting forwarding loop took out the entire data center network, affecting all business operations.  Implementing smaller spanning tree domains would limit the scope of potential failures, allowing unaffected business operations to continue to operate.  Such separation may need to be pushed into the distribution or access layer to prevent a potential spanning tree loop from touching the core switches.

Lesson #3: Implement safety mechanisms.
Take advantage of various safety mechanisms like UDLD, loopguard, rootguard, and bpduguard to prevent the formation of STP loops.  While these mechanisms help prevent the formation of loops, they are not a replacement for limiting the size of an STP domain.  By limiting the size of an STP domain, you limit the number of systems affected by a failure.

Lesson #4: Don't put servers for a single function in a single subnet.
When one broadcast domain is affected by an STP problem or by a denial-of-service attack, the backup servers should be accessible in a separate subnet, hopefully in a backup data center.  Minimizing common infrastructure reduces the opportunity for complete system failure due to the failure of one or two key components.  A key example is the DNS servers, which are required these days for the proper function of many applications (hopefully the apps don't use hard-coded IP addresses). 

Lesson #5: Know your STP topology and how to quickly disconnect sections of the topology so that you can quickly identify the part of the network that contains the source of the STP loop.  You can then return the rest of the network to production while you fix the cause of the STP loop.

In summary, STP loops will quickly congest a network and drive the switch CPUs to 100% utilization.  Implement safety mechanisms and topologies to minimize the impact when they occur and be prepared to act quickly to diagnose them when they happen.

  -Terry

Comments

 

Spanning Tree Protocol and Failure Domains – Terry's Blog | Artyku??yNet said:

Pingback from  Spanning Tree Protocol and Failure Domains – Terry's Blog | Artyku??yNet

December 29, 2009 7:56 PM
 

Terry's Blog said:

While doing some other research recently, I ran across the Bridge Assurance feature in Cisco gear, which

January 6, 2010 11:00 AM

About tslattery

Terry Slattery, CCIE #1026, is a senior network engineer with decades of experience in the internetworking industry. Prior to joining Chesapeake NetCraftsmen as a full time consultant, Terry was the founder and CTO of Netcordia, and inventor of NetMRI, a suite of network management products. Terry started Netcordia as a consulting company in 2000 and transitioned to a network management product company in 2003. During the consulting days, he used his network design and implementation skills to lead a team in the design and implementation of a high availability network at a brokerage clearing house. Terry is the former President and founder of Chesapeake Computer Consultants, Inc., a networking and computer systems training and consulting company. He co-invented and patented the vLab(tm) internet-based remote lab system. He is co-author of the McGraw Hill text Advanced IP Routing in Cisco Networks. Terry led the team that developed the current Cisco IOS user interface under contract to Cisco Systems. Terry is experienced in the design and installation of large TCP/IP based networks and is a successful network protocol instructor. He is the second Cisco Certified Internetworking Expert (CCIE) #1026 and the first outside of Cisco. He enjoys membership on the Vanderbilt University Engineering School’s Industrial Advisory Board and the IEEE.

This Blog

Syndication