Colgate University Uses NetMRI
"I have upgraded 400+ IOS devices while I have a cup of coffee with a simple script in NetMRI. Being able to focus on other duties and not worry about missing changes made is great...and being able to rapidly replace a device quickly is even better."
—Don Rhodes, Network and Systems Administrator
More CustomersConnect With Us
The Network Monitor, Volume 7 Number 3
Take the Next Step:
Related Information
In this Issue:
The Top 25 Network Problems and Their Business Impact (Part 2)
We cover problems 14 through 25 and discuss how each problem impacts the business.
Knowing Your Network: Router Redundancy Protocols
We discuss three protocols in this article HSRP, VRRP and GLBP. Whichever you choose, a router redundancy protocol can increase network availability.
Customer Spotlight: Health First
Read how Health First was able to become proactive instead of reactive by using NetMRI as part of its winning team.
Network Analysis Tip #49: The 95th Percentile
This calculation provides a good measure of utilization. Learn more about this great tip!
The Top 25 Network Problems and Their Business Impact (Part 2)
It has been interesting to tell people why network analysis is important. We go through some examples, but they often get hung up on thinking about the problems we describe (and that NetMRI detects). For a business person, the problems often don’t mean much – what’s the business impact? For the network engineer, the problems are interesting, but need to be related to the business in order to communicate the importance to the business people.
To help bridge the gap between the business person and network engineer, we created a poster that shows the top 25 network problems and describes how each problem impacts the business. The poster is 24 inches x 36 inches and is suitable for framing or simply pinning up in your office or cube. You may use it to facilitate a conversation between business people, applications people, and network engineers. It allows everyone to see the problem and understand how it impacts the applications upon which the business operates.
Attendees at Interop this year were intrigued by it, looking through the problems to identify the one that bit them most recently. It would be interesting to play ‘poster bingo’ with it – how many of the problems do you find in your network over the coming few months? Would it help to automate the process of identifying and correcting the problems before they impact the business?
While each problem is numbered, the numbers themselves don’t indicate relative ranking. They are simply a means by which we can reference them. Since you can’t read the small poster image, let’s walk through each of the problems and discuss their business impact. We covered the first thirteen problems in The Network Monitor 7-2.
14 OSPF recalculations high
Routing protocol unstable; poor and inconsistent application performance. Link stability, link errors, or spanning tree stability can cause an OSPF topology to be unstable (see 7, 20). The routing protocol may intermittently select non-optimum paths (see 13). Applications experience high jitter or loss of connectivity if routes are flapping as a result.
15 Poor VoIP quality
Due to high jitter, delay, or packet loss; Choppy voice calls; Calls mysteriously disconnect. The root cause of poor VoIP quality can be many other problems. By monitoring delay, jitter, and packet loss, you can reduce the set of possible problems to examine. By identifying the range of phones that are reporting poor statistics, you can better identify the potential source of the problem.
16 Routing Neighbor changes high
Access via this router is negatively affected by a high number of neighbor changes (BGP, OSPF, EIGRP). Similar to problems 13 and 14, something is causing the neighbor relationships to change regularly, which affects the stability and reliability of the routing protocol. As a result, applications can experience high jitter or packets arriving out of order. Finding and fixing the cause of the neighbor changes will result in a more stable and efficient network.
17 OSPF area not connected to backbone
The disconnected OSPF area will not be reachable from other OSPF areas, impacting applications that need to communicate between areas. OSPF intra-area routing relies on connectivity through the backbone area (area 0). When an area is disconnected from the backbone, communications within the area works, but communications between systems in that area and systems in other areas will not work (the intra-area routes don’t exist). Users and systems within the area will report what seems to be intermittent connectivity, which is based on whether the destination is located within the area or in another area.
18 Unidirectional traffic flow
Typically the result of misconfigured routing, application traffic will be using non-optimum paths, increasing delay and potentially overloading other links and affecting other applications. Sometimes asymmetric routing is desired; however, it increases network complexity and complicates troubleshooting. Servers are often configured with incoming and outgoing interfaces, which may cause unicast flooding, a condition in which frames are sent to all ports in a VLAN. High traffic levels result, impacting the operation of all devices in the VLAN. In routed networks, a measure of zero packets in one direction on a link for long time periods indicates a potential routing misconfiguration.
19 Router interface down
Any router interface marked administratively up but is operationally down is likely to be a redundant connection that will cause an outage if the other connection also fails, affecting all applications that use it. Redundant networks hide first failures, so it is important to identify those failures before a second failure causes an outage. Best practices are to administratively shutdown router interfaces that are not supposed to be active, therefore making any interface in up/down state an indication of something that’s failed.
20 Unstable root bridge
Bridge priority not set; applications quit working over unstable VLANs. An inexpensive switch that has the same bridge priority but lower MAC address as the desired root bridge in a spanning tree will try to become the root bridge. But in a busy VLAN, it may not have the backplane bandwidth or CPU to handle the task and not send BPDUs as frequently as it should (2 seconds by default). When several BPDUs are missed, the other switches elect another switch as the root. The STP re-convergence will affect application connectivity. The change is difficult to troubleshoot because it is working by the time a network engineer looks at it. Application connectivity seems to be intermittent.
21 Duplex mismatch
Increasing link errors; Applications get slower as traffic volume increases. CRC errors, late collisions, and FCS errors are indicators of duplex mismatch. A server is installed and ping works, so it is declared functional, but as the traffic to it builds, errors increase. Finger pointing between the network, server, and application teams often results until the duplex mismatch is discovered. Vendor recommendations (Microsoft: fixed full duplex; Cisco: auto-negotiate) exacerbate the problem.
22 Downstream hub or switch
Unauthorized devices added to the network; Compromise to network integrity and security; See 20. Wireless routers, switches, hubs, and other network devices should be under a common administration in order to provide the best network security. Another switch could have a lower priority, making it the root bridge of a VLAN and causing stability problems (see 20). Rogue DHCP servers in wireless routers can cause intermittent connectivity problems within a subnet, unless specific configurations protect against it.
23 Port in ErrDisable state
The set of end stations connected via this port are disconnected from the network until the port is enabled (either automatically or by user control). A variety of configuration options allow switch ports to be disabled when certain conditions occur, such as receiving BPDUs or DHCP responses (see 20, 22). Some vendors will disable a port if it experiences too many errors. Automatically identifying these ports can avoid a trouble call from a user or server administrator who is having connectivity problems as a result of a port being disabled.
24 Unbalanced & unused ether-channels
Increased latency & jitter affecting sensitive applications like VoIP; Compromised redundancy. Packet distribution across an ether-channel may be unbalanced if a non-optimum packet distribution algorithm is selected. By changing the algorithm, the ether-channel packet distribution is more balanced and overall throughput increases. An unbalanced ether-channel will be more easily congested, resulting in application performance that’s less than expected.
25 HSRP or VRRP peer not found
Redundancy configured and not operating correctly; Outage when a second failure occurs. A connectivity or application outage may have not yet occurred, because one device in the redundant pair is still running. But the backup device is not known. The cause may be a broken link between devices, the redundant device has not yet been installed, or the redundant device, or its interface, has failed. When the second failure in the redundant configuration occurs, a network outage occurs, impacting applications. Knowing that a redundant configuration is not operational allows it to be corrected before important applications are affected.
Identifying and correcting these problems will allow your network to better service your business’ network requirements. Register to receive a copy of the Top 25 Network Problems and Their Business Impact poster here. See TNM 7-2 for the first half of the Top 25 Network Problems.
Knowing Your Network: Router Redundancy Protocols
Three protocols
Providing reliable default router for clients and servers on each subnet is important for a high availability network. A good way to provide a reliable default router is to use one of the virtual router redundancy protocols, of which there are three versions:
- Hot Standby Routing Protocol (HSRP)
- Virtual Router Redundancy Protocol (VRRP)
- Generic Load Balancing Protocol (GLBP)
The advantage of these protocols is that the end stations don’t need any modification or special software. The routers perform the work. You only have to configure a couple of routers on each subnet and you’ve made a significant improvement in network reliability. We will explore these three protocols and learn the differences and potential failure modes so we can make smart choices about which to implement and how to monitor their operation.
In all three protocols, if one router or interface dies, the network continues to work, so there will not be a call from anyone using the network. The risk is that no one may notice when one failure happens. The second failure often happens months later. Corporate executives then begin asking why the additional money was spent to create a reliable network when it just failed – a very embarrassing situation. Monitoring of the configuration and its operation is important to avoid such failures.
HSRP
The oldest of the three protocols is Cisco’s HSRP. Cisco has a patent on the technology (#5,473,599) and in RFC2281, said “... Cisco will license such claims on reasonable, nondiscriminatory terms for use in practicing the standard. More specifically, such license will be available for a one-time, paid up fee.”
In HSRP, a set of routers (typically two) will be configured such that one is the primary router and the second is a backup router. A virtual IP and MAC address is shared between the two routers. The two routers exchange hello messages periodically, defaulting to 3 seconds, with a default hold time of 10 seconds. The minimum value is 1 second, implying a minimum failure detection time of 3-4 seconds. If the backup router doesn’t receive a hello message within the hold time, it will become the master and begin forwarding traffic addressed to the virtual IP and MAC address.
The end stations on the subnet are configured to use the virtual IP address as the default router. The backup router will take over the virtual IP and MAC address if the primary router fails, within the parameters of the hold timer. There are other enhancements that allow tracking the state of other interfaces on the router so that if the forwarding path is blocked, a failover will occur. This is a great mechanism with minimal configuration overhead.
Load sharing is accomplished by configuring multiple HSRP groups, each with its own virtual IP address. Each end station will then need to be assigned one of several virtual IP addresses to use as its default router.
HSRP Failure Modes
The HSRP MIB (CISCO-HSRP-MIB) can be used to track HSRP groups and to make sure that each HSRP group contains at least two routers. A group comprised of a single router has had a failure or has not been fully configured. The second router must be fixed or added to the group before an outage hits the existing router. It’s sad, but common, to do a failure analysis in a redundant network to find that one router or interface died long ago, but no one noticed.
Another failure mode is that of an incorrect HSRP configuration. If you don’t specify a group number, it defaults to 0. The same applies to the virtual IP address. So a configuration statement of the form ip standby preempt 90 will cause HSRP group 0 to be created, using a virtual IP address of 0.0.0.0. In this case, HSRP is in initial state and in some IOS releases, it won’t appear in the output of show standby, but will appear in the MIB. If you’re running 12.2(25)S or later, there is a command enhancement to see groups in init state. Search for “Enhancement to the show standby Command” on Cisco’s web site.
VRRP
VRRP is the IETF version of router redundancy and is vendor independent (RFC 3768). However, in some documents referenced from the Wikipedia description of VRRP, a letter from someone at Cisco stated that they consider VRRP to be covered by the HSRP patent.
In VRRP, the master router’s IP address is used as the virtual IP address and the virtual MAC address will be 00-00-5E-00-01-{VRID}, where VRID is the virtual router ID. This allows for up to 255 VRRP routers on a single LAN segment. Note that the real MAC address of a router will never be used with VRRP, so MAC address classification of the vendor type won’t work for network discovery tools. If the master router or its interface dies, then a backup router will handle traffic to the virtual IP and MAC addresses. The protocol is preemptive, allowing the master router to resume service when it is repaired. Preemption prevents the problem where a router that has been operational the longest becomes the master.
VRRP’s default timers are hello: 1 sec; hold: 3 sec, and can be set to less than a second on Cisco routers. There is a skew timer, based on the router’s priority, that prioritizes the backup routers when the master fails. Authentication exists in the Cisco implementation, but there is no way to stop a hostile router from transmitting VRRP packets, and creating multiple masters. A more complete discussion exists in section 10 of the RFC.
VRRP Failure Modes
As with HSRP, the key failure mode is that a single failure takes out half of a redundancy pair. Operationally, VRRP is less chatty than HSRP in that only the master router sends hello packets. The backup routers only send packets when negotiating a new master. Unfortunately, this makes it more difficult to detect all routers that are part of a redundancy group within a subnet.
To detect the failure of half of a redundancy pair, you have to find all routers on the subnet and make sure that they are configured for VRRP and that the backup routers are hearing hello packets from the master router (the VRRP MIB is RFC2787).
If only one router within a subnet has VRRP configured, then it will be the master and you need to fix or install the backup router. (It would have been good for the protocol to use a bi-directional hello packet that lists all routers in the group, much like the OSPF Hello protocol. The routers could then report redundancy failure via syslog or SNMP traps even if there was no outage.)
Another error to detect is high numbers of transitions to the Master state, which indicates that something on the subnet is unstable and is causing the master to change periodically. More than a couple of changes per day or maybe up to 5 or 10 per week would be cause to investigate a possible stability problem.
Authentication failures are counted and every time a failure occurs, an SNMP trap is generated. The combination of these two mechanisms can help identify invalid authentication settings that prevent a VRRP peer relationship from forming. Beware when using authentication because if there is a mismatch of authentication types, it may be possible for two routers to become masters because the mismatched VRRP packets are simply discarded upon receipt (see the security note above about authentication).
The RFC explicitly says that a master VRRP router that does not ‘own’ the virtual IP address should not forward or accept packets addressed to the virtual IP address (remember, the virtual IP address is configured on an interface of the master router). This protects against the old master router’s connections from being taken over by a backup router. This also implies that you should not use the virtual IP address for other functions such as network management, ssh, telnet, etc. If a connection fails to the master using its virtual IP address, it is likely that a backup router has taken over as master and cannot accept the connection. By defining and using specific loopback addresses, you will always connect to the desired router.
GLBP
GLBP operates quite differently from the other protocols in that the backup routers are used for load distribution on a per-host basis, so the link capacity of the backup routers is available. It is a Cisco proprietary protocol in which one virtual IP address is used, so all the hosts have one default gateway. The Active Virtual Gateway (AVG - the master router) is chosen from among the GLBP group members and it assigns a virtual MAC address to each Active Virtual Forwarder (AVF - the backup router). The AVG replies to ARP requests with one of the AVF MAC addresses. In this way, the hosts on the network will use different default routers, even though only one default gateway IP address is assigned.
The GLBP routers communicate with each other using hello messages on UDP port 3222 every 3 seconds on multicast address 224.0.0.102. The timers can be set to sub-second values. If one member of the group fails, the other routers in the group take over forwarding for its MAC address. There can be up to four routers in the GLBP group, allowing a good distribution of host packets across a set of default routers.
Cisco provides options for disabling the load balancing option, in which case GLBP operates like HSRP. The default distribution is round-robin. There are two other interesting distribution options. One is Host Dependent, in which the host MAC address is used to determine which default gateway to use. The other option is Weighted, which allows routers that have greater link capacity to have greater weighting for default gateway assignment. Weighting can be driven by a tracked interface so that the router is removed as a default gateway if a tracked interface goes down. Note that the assignment is by host, not by the load offered by the host, so think of it as a load distribution protocol.
GLBP Failure Modes
If you’ve selected the default operation, the high priority router will not preempt when it joins the group. Over time, the router that has the greatest uptime will become the AVG. There is an option to enable preemption, so that the desired router becomes the AVG. In networks with two identical routers and links, this is not likely to be important, unless other factors drive the selection of the AVG.
GLBP does not require that all routers in the group be configured with the virtual IP address. If it isn’t configured, it will be learned from the hello messages of its neighbors. But what happens if the router dies that has the defined virtual IP address? The neighbors continue to pass around the virtual IP address until they are all rebooted, which may be many months later. Suddenly, GLBP fails because there is no defined virtual IP address! In practice, configure all the routers in the group with the desired virtual IP address. Sure, it is a bit of additional work, but well worth the effort to prevent the virtual address from disappearing.
As with the other first hop redundancy protocols, it is important to know when a router in the group has failed. GLBP doesn’t have a MIB yet, so you will need to have a management station that can issue CLI commands to monitor its configuration and operation.
Cisco recommends that GLBP be completely configured before enabling it. If you enable the group first, and then enter commands to change weightings or per-host mappings, you may find that the protocol started and made assignments that are contrary to what you desired. You should configure all the parameters for the group first, then enter the command glbp group ip [ip-address [secondary]] to specify the virtual IP address and enable the group.
As with the other protocols, regularly check for stability of the routers in the redundancy group. Don’t be embarrassed by a network outage due to all routers in the redundancy group being down at one time because you weren’t watching them. A good network management system is key to knowing when your redundant configuration has failed.
Summary
These protocols are used in high availability networks as a First Hop Redundancy Protocol (FHRP). Organizations should be aware of the key operational characteristics and failure modes of each protocol when deciding which one to implement. The key factors in making a choice are manageability, load sharing, and interoperability.
In a high availability network, knowing when the redundant configuration has been compromised is key. A monitoring system should identify redundancy groups that contain a single router or groups in which the routers or their interfaces are not stable. HSRP has the best SNMP support for determining existing group members. VRRP has a MIB, but the master does not know about backup routers, so data from all routers in the same subnet must be correlated – an additional step for network management systems to perform. GLBP does not yet have a MIB, so a CLI mechanism must be used to collect information about its configuration and operation.
Load distribution can be done with all three protocols, with varying levels of configuration. The load distribution mechanisms and configuration difficulty may be a deciding factor for some organizations.
Finally, interoperability with routers from other vendors may drive the selection of VRRP over the other two protocols.
Whichever you choose, a router redundancy protocol can increase network availability, even in the case of occasional device or interface failures.
Customer Spotlight: Health First
Hospital Network Hires NetMRI as ‘Other Guy on the Team’
On Florida’s Space Coast, Health First is the face of health care. Three not-for-profit hospitals—Cape Canaveral Hospital in Cocoa Beach, Holmes Regional Medical Center in Melbourne, and Palm Bay Community Hospital in Palm Bay—form the core of Health First’s family in Brevard County. In addition, 60 clinics provide care throughout the region. A small team of network engineers maintains consistency across all locations and approximately 7,000 network devices.
“We’re moving toward being paperless at every site, so uptime is essential for accessing medical records,” said Mark Davidson, senior network engineer. “We don’t have a lot of time to check ourselves. We have to make every minute count when we work. If someone leaves something out of a configuration, then the switch will act differently and impact uptime.”
Inevitably, the team lacked the manpower to check all configurations manually—especially with frequent changes such as new clinics and services. “A lot of our time was tied up in being reactive, instead of proactive,” Davidson said.
Automated Consistency Checking
Davidson recalled learning about NetMRI from Netcordia at VoiceCon. The solution enables IT staff to maintain consistent configurations across all devices, and quickly identify and solve potential issues with routing, subnets, VoIP, VLANs and other areas.
“I didn’t see anything comparable to NetMRI in terms of automated consistency checking,” Davidson said. “That capability is like adding another guy on our team that we don’t have to feed or keep cool.”
Health First also chose to implement the Policy Management Module and IP Telephony for Cisco. The Policy Management Module automatically detects configuration changes and deviations from policy, allowing Health First to ensure that configurations stay consistent with company and industry best practices.
IP Telephony for Cisco automates data collection across the entire infrastructure, proactively simplifying voice management, improving service quality and reducing risk.
Within an hour of taking NetMRI out of the box, Davidson had it up and running on his own. Use of the solution proved likewise simple for the entire team. “The product is so straightforward. It’s intuitive, so we have not needed any training so far.”
Fast Restoration of Configurations
Previously, engineers assessed the configurations of individual network devices by hand. With NetMRI, Health First automates the process of evaluating devices for configuration consistency. The solution scans the network every night and reports on any changes. Network engineers begin their days by looking at the network scorecard in NetMRI, which provides a high-level view of overall network performance. They can then drill down for more detail on any issues.
As IT staff go about their days, NetMRI continuously gathers data and remains on the lookout for changes. The team receives alerts, allowing them to troubleshoot issues immediately.
NetMRI also automatically restores configurations according to Health First’s standards. To date, the solution has proven very valuable in several situations where switches failed. NetMRI retains those configurations, allowing the team to restore them as needed.
“Before, we would have restored configurations manually, or someone would need more training to use another product. That’s too much work,” Davidson said.
At anytime, the staff can run real-time or historical reports on general performance or specific aspects. Davidson regularly provides a link to show the department head the status of the network. The team also looks at historical data to identify trends, and then take steps to proactively protect the network to avoid incidents.
Automation = Another Team Member
At once, NetMRI enables the Health First team to be more proactive, while also reducing the amount of time staff spend maintaining configurations. “NetMRI is the first place we go when we discover something is acting up,” Davidson said. “It quickly provides more detail about the devices involved. In some cases, that has saved us a few hours of diagnosis time.”
The level of automation also impresses Davidson, even down to filling out “Help” emails that automatically go to Netcordia. “You buy this tool to save you time, and it continues to do so,” he said.
Without NetMRI, Davidson feels the small team would otherwise have needed to grow to meet demand and frequent organization change. “NetMRI is that other person on the team, which costs considerably less than adding another person,” he said.
Decrease in Downtime
Most significantly, the Health First team notes downtime decreases in this uptime-critical environment. “Uptime has definitely improved with NetMRI,” Davidson said. “Since acquiring NetMRI three years ago, we have not had any configuration-related events.”
Looking ahead, the organization expects to improve the quality if its Voice-over-Internet Protocol network with the IP Telephony for Cisco module. IT staff are proactively alerted about issues, and can access detailed information to identify the source of problems.
Back to The Network Monitor Archive
Copyright © 2008 Netcordia, Inc. All rights reserved.
