Welcome to Infoblox NetMRI Community Sign in | Join | Help
in Search

Terry's Blog

RSS Feed

Cisco Fan Failure

Have you ever had a Cisco router or switch shutdown due to a fan failure?  While looking through NetMRI's daily list of analysis issues, I found a Fan Failure issue (it is named "Device Fan Problem" in NetMRI's Analysis page).

It was really interesting to me because a fan failure produces a syslog message and it should have been caught by the NOC, who uses other tools to identify important syslog and trap messages.  Of course, the problem with syslog and SNMP traps is that they typically use UDP for their transport mechanism.  UDP packets are not retransmitted if a packet is discarded due to congestion or because it is damaged in transit.  Most network people know that UDP packets may not arrive at their destination, but because most networks are pretty reliable, we rarely see it.

Because UDP messages may be lost in transit, what can we do about network management that depends on UDP for much of its operation?  A good network management system will retry SNMP queries until it is able to retrieve the data that it needs.  In this case, NetMRI was able to gather information about a fan failure that had not made it into the logs.  While using SNMP polling to retrieve similar information to that reported by syslog may seem like a waste, I think it is important to track transient values or detect problems where the syslog message didn't make it to the syslog server.

When I saw the issue, I verified that it had failed.  [I like to verify that my tools are operating correctly and that I can trust them - so many NMS products produce false alarms that I've grown accustomed to checking them for proper operation.]  NetMRI was correct, the device CLI reported the failed fan.    A quick email to the support team allowed them to dispatch someone to repair it before the device overheated and shutdown, potentially causing an unplanned network outage.

I like to understand failure modes and how things should operate when a failure occurs and what I can do to minimize the impact of the failure.  In the case of UDP, I like to use alternate collection methods that aren't as timely as a log message, but that still let me know when things break.

  -Terry

Comments

No Comments

About tslattery

Terry Slattery, CCIE #1026, is a senior network engineer with decades of experience in the internetworking industry. Prior to joining Chesapeake NetCraftsmen as a full time consultant, Terry was the founder and CTO of Netcordia, and inventor of NetMRI, a suite of network management products. Terry started Netcordia as a consulting company in 2000 and transitioned to a network management product company in 2003. During the consulting days, he used his network design and implementation skills to lead a team in the design and implementation of a high availability network at a brokerage clearing house. Terry is the former President and founder of Chesapeake Computer Consultants, Inc., a networking and computer systems training and consulting company. He co-invented and patented the vLab(tm) internet-based remote lab system. He is co-author of the McGraw Hill text Advanced IP Routing in Cisco Networks. Terry led the team that developed the current Cisco IOS user interface under contract to Cisco Systems. Terry is experienced in the design and installation of large TCP/IP based networks and is a successful network protocol instructor. He is the second Cisco Certified Internetworking Expert (CCIE) #1026 and the first outside of Cisco. He enjoys membership on the Vanderbilt University Engineering School’s Industrial Advisory Board and the IEEE.

This Blog

Syndication