Health First Uses NetMRI
"Uptime has definately improved with NetMRI. Since acquiring NetMRI three years ago, we have not had any configuration-related events...You buy this tool to save you time, and it continues to do so. NetMRI is that other person on the team, which costs considerably less than adding another person."
—Mark Davidson, Senior Network Engineer
Read Case StudyConnect With Us
The Network Monitor, Volume 7 Number 2
Take the Next Step:
Related Information
In this Issue:
The Top 25 Network Problems and Their Business Impact (Part 1)
We cover problems 1 through 13 and discuss how each problem impacts the business.
Knowing Your Network: Rapid Spanning Tree Protocol
Why you want to use RSTP and a brief overview of how it works.
Customer Spotlight: Wells Dairy
How this renowned company implemented NetMRI to improve productivity and significantly reduce operational costs.
Network Analysis Tip #17: Redundant Routing Peer Not Found
Daily verification of routing redundancy avoids surprises. We discuss the Manual and Automatic Determination.
The Top 25 Network Problems and Their Business Impact (Part 1)
It has been interesting to tell people why network analysis is important. We go through some examples, but they often get hung up on thinking about the problems we describe (and that NetMRI detects). For a business person, the problems often don’t mean much what’s the business impact? For the network engineer, the problems are interesting, but need to be related to the business in order to communicate the importance to the business people.
To help bridge the gap between the business person and network engineer, we created a poster that shows the top 25 network problems and describes how each problem impacts the business. The poster is 26 inches x 38 inches and is suitable for framing or simply pinning up in your office or cube. You may use it to facilitate a conversation between business people, applications people, and network engineers. It allows everyone to see the problem and understand how it impacts the applications upon which the business operates.
Attendees at Interop this year were intrigued by it, looking through the problems to identify the one that bit them most recently. It would be interesting to play ‘poster bingo’ with it how many of the problems do you find in your network over the coming few months? Would it help to automate the process of identifying and correcting the problems before they impact the business?
While each problem is numbered, the numbers themselves don’t indicate relative ranking. They are simply a means by which we can reference them. Since you can’t read the small poster image, let’s walk through each of the problems and discuss their business impact. We’ll cover problems 1-13 in this issue and problems 14-25 in the next issue.
- 1. Configuration not saved: Reboot will cause the new configuration to be lost. Due to a power outage on a network device, the operation of the network changes because the new configuration is replaced by the old one upon reboot.
- 2. Saved configurations don’t meet corporate policy: Source of many problems, from performance to reliability to security. Corporate policy may be due to regulatory policies (PCI, HIPAA, SOX), or may be based on accepted best practices. Checking that they are consistently applied across hundreds of routers and switches is nearly impossible to do with manual processes.
- 3. Bloated firewall rule set, unused ACL entries: Poor firewall performance; Open, unused rules, creating potential security problems. Identifying unused firewall rules makes understanding and maintaining firewall rule sets much easier, identifying unused rules that can be safely removed, resulting in improved network security.
- 4. Firewall connection count exceeded: New connections via the firewall fail; Business applications exhibit intermittent failure at high firewall loads; VPNs begin to fail. When the connection count of a busy firewall is exceeded, new connections are refused. The applications experience intermittent network connectivity as the connection count is exceeded and then drops, making it difficult to troubleshoot.
- 5. Link hog - downloading music or videos: Slower application response, impacting user productivity. When one application or user is consuming most of the bandwidth on a link, it impacts the other applications and users of that link. NetMRI uses its Getflow to immediately collect netflow data on a link that’s suddenly running at high utilization to identify applications and users of the link, allowing the network engineer to quickly understand the cause of the slowdown to other applications and take action if necessary.
- 6. Interface traffic congestion: Unpredictable application performance, impacting user productivity. When a router interface is congested, it starts discarding packets, so monitoring packet discards is an early indicator that the applications using the link need more bandwidth, or that a rogue application is now consuming bandwidth that’s needed by business applications.
- 7. Link problems & stability: Physical or DataLink errors cause slow or intermittent application performance; Link or interface stability can impact routing and spanning tree (see 13, 14, 15, 16, 20). Whenever a link has high errors or is unstable, applications will have problems making effective use of the link. When routing or spanning-tree protocols are impacted, the effects may spread to other parts of the network, depending on the network’s design.
- 8. Environmental limits exceeded: Fan failure, power supply problems, and high temperatures are indicators of problems that will likely cause a network device to reboot, affecting any applications relying on the device. Identifying and correcting environmental problems will make the network, and the applications that depend on it, more reliable.
- 9. Memory utilization increasing: A bug in the device’s operating system is consuming more memory and when no free memory exists, the device will reboot, disrupting applications that are transiting the device. Imagine troubleshooting a network problem that occurs every two weeks as the device runs out of memory and reboots. We’ve seen this happen in production networks. The business impact depends on how often it occurs and what applications are affected.
- 10. Incorrect serial bandwidth setting: Causes routing protocols to make non-optimum routing decisions. If the bandwidth is too low, it can affect the operation of the routing protocol itself, making routes unstable. Remote branches will experience unreliable application operation, which will be difficult to troubleshoot because you’ll have to catch it when it is happening. As applications begin using more link bandwidth, the routing protocol can become unstable. If you need to alter network traffic paths, use policy based routing mechanisms instead of changing link bandwidth parameters. Also make sure tunnels have accurate bandwidth settings.
- 11. No QoS: Important business applications are not prioritized, yielding unpredictable or poor performance during times of interface congestion. Applications like VoIP or SAP are susceptible to high jitter and packet loss when QoS is not used. Configurations that match corporate policy for QoS deployment are important (see 2).
- 12. QoS Queue Drops: Important business applications are slow; Business needs have changed since the queue definitions were created. A network design for four concurrent VoIP calls will not perform well when more people are hired and the number of concurrent calls increases. Similar conditions exist for other applications. Queue drops are an early indicator of potential problems that require a network change.
- 13. Route flaps: Poor application performance as packets take the wrong or inefficient paths in the network. It may be caused by unstable links or improperly configured routing protocol timers (see 2, 7). Packets may also arrive out of order, which some applications cannot tolerate. Varying paths will also cause high jitter, which affects time sensitive applications like VoIP and SAP. Studies have shown that people can deal with relatively high delay as long as the variance in delay is constant. But high variance in application response will drive people crazy.
Identifying and correcting these problems will allow your network to better service your business’ network requirements. Register to receive a copy of the Top 25 Network Problems and Their Business Impact poster here.
Knowing Your Network: Rapid Spanning Tree Protocol
If you’re not running the Rapid Spanning Tree Protocol in your switched network, you should consider it. There are a number of significant improvements over the 802.1D spanning tree protocol that decrease the convergence to sub-second times when a spanning tree change occurs. Note that RSTP is used for both per-VLAN spanning tree and for multi-VLAN spanning tree. We’ll discuss only the rapid-pvst mode.
The Rapid Spanning Tree Protocol (RSTP) has been around since around 2001 and was ratified as IEEE standard 802.1w in 2004. It is based on the Cisco extensions Port Fast, Uplink Fast, and Backbone Fast to 802.1D. There are several new port roles in addition to the Root and Designated roles, shown in Figure 1. An Alternate port is an alternate path to the root bridge while a Backup port is one that receives a BPDU from another port on the same switch (implying that there is a downstream hub).
The port states have been reduced to only three: Discarding, Learning, and Forwarding. The old states of Blocking, Listening, and Learning are mapped into the Discarding state. The result is reduced time for a port to make it to the Forwarding state.
The big difference with RSTP, however, is in how BPDUs operate. Every switch generates BPDUs and downstream switches acknowledge upstream BPDUs. This basically implements a keepalive mechanism that quickly identifies failed links or ports and is the fundamental mechanism that drives the faster convergence time.
There is a fall-back mode that allows a switch to properly communicate with another switch that is running the older style 802.1D spanning tree algorithm. The compatibility mode is on a per-port basis, so the existence of a legacy switch will affect only the link to its neighbors, not the entire spanning tree. The Cisco document, Understanding Rapid Spanning Tree Protocol (802.1w), is a good explanation of the differences and how RSTP works. The important concept is that RSTP’s improvements reduce convergence time when a failure occurs.
To enable and verify RSTP on a Cisco IOS switch (the Juniper config is similar enough that I don’t need to repeat it here). Note that a mode change will cause the spanning tree in the switch to reset, so a connectivity outage will occur.
Router(config)#spanning-tree mode rapid-pvst Router#show spanning-tree summary Switch is in rapid-pvst mode ...
Note that CatOS doesn’t allow rapid-pvst if BackboneFast is enabled. A Cisco document, Spanning Tree from PVST+ to Rapid-PVST Migration Configuration Example, contains an interesting comment about running a mixed STP domain:
"In mixed mode, you do not receive the complete advantages of rapid-PVST+. The overall convergence time is the same as the convergence time of PVST+ mode. In order to take full advantage of rapid-PVST+, all the switches in the spanning tree topology must run the rapid-PVST+"
Given this constraint, how do you know that a spanning tree is running RSTP and that it isn’t constrained to slower convergence times? First, make sure that all switches are running RSTP by running the command show spanning-tree summary or using SNMP to query the value of stpxSpanningTreeType (Cisco-stp-extensions-mib). Next, you need to make sure that all interfaces are connected to peers that are also running RSTP. With rapid-pvst+, this is done on a per-VLAN basis:
Switch#show spanning-tree vlan 80 VLAN0080 Spanning tree enabled protocol rstp Root ID Priority 24656 Address 001e.133b.8480 This bridge is the root Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec Bridge ID Priority 24656 (priority 24576 sys-id-ext 80) Address 001e.133b.8480 Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec Aging Time 300 Interface Role Sts Cost Prio. Nbr Type --------- ---- --- ---- ----- --- -------- Gi1/0/10 Desg FWD 19 128.10 P2p Peer(STP) Gi1/0/14 Desg FWD 4 128.14 P2p
The differences in the Type field indicate that the peer on interface Gi1/0/10 is running STP and needs to have its config updated while the peer via interface Gi1/0/14 is running RSTP. The corresponding SNMP variable is stpxRPVSTPortStatus (CISCO-STP-EXTENSIONS-MIB):
[root@QAecampus03 root]# snmpwalk -c qasnmp 220.20.40.5 1.3.6.1.4.1.9.9.82.1.13.1.1.3.80 SMPv2-SMI::enterprises.9.9.82.1.13.1.1.3.80.10 = Hex-STRING: 10 SNMPv2-SMI::enterprises.9.9.82.1.13.1.1.3.80.14 = Hex-STRING: 00
The value of 10 indicates a peer switch not running Rapid PVST+, while a value of 00 indicates a peer that is correctly configured.
Customer Spotlight: Wells Dairy
Wells’ Dairy, Inc. was started in 1913 by Fred H. Wells when he purchased a horse, delivery wagon, a few cans and jars, and the good will of the business from Ray Bowers, a dairy farmer in Le Mars, Iowa all for $250. Since its inception, Wells’ Dairy has grown and evolved over the years, including the establishment of its nationally known “Blue Bunny” ice cream in 1935.
Fast forward to 2007. Despite its down on the farm roots, Wells’ Dairy is now the world’s largest family owned and managed manufacturer of dairy products in one location.
The Challenge
Wells’ Dairy has a sophisticated enterprise network in place to support the facilities, technology, and people that are necessary to keep growing. The company needed a network management solution that could keep up with increasing demands on its network infrastructure.
The IT department had to deal with very time consuming issues, including spending excessive amounts of time and energy tracking down improperly configured devices, locating new servers, and tracing IP addresses to switchports. Spending an inordinate amount of time locating issues made it difficult for the IT staff to be proactive.
The Solution
A trusted friend informed Wells’ Dairy network architect, Jim Kirby, about a proactive network management solution, NetMRI. “We weren’t originally looking for this particular solution, but NetMRI is one of those products that once you see it in action, you wonder how you can live without it,” said Jim. “We saw a good ROI and that was all it took to make the decision to install it on our network.”
NetMRI is a proactive analysis solution that monitors, detects, and reports on network issues before they become problems for IT administrators or end-users. It plugged right in to Wells’ Dairy’s existing network and gathered data from all network elements regardless of vendor. Immediately upon installation, NetMRI’s built-in analytics engine notified Wells’ Dairy of potential problems and vulnerabilities. It discovered Duplex mismatches and misconfigured devices, which previously took an excessive amount of time to find.
“NetMRI plays a vital role in the ongoing maintenance of our growing network,” said John Johnson, network engineer at Wells’ Dairy. “It acts as an extension of our IT department, which provides an immediate return on investment for our company.”
NetMRI provides Wells’ Dairy with a complete audit of the network on a daily basis. The automated reports provide great insight to the overall health of the network. The network scorecard gives a quick snap shot of Wells’ Dairy’s network strengths and the areas that need attention.
NetMRI’s ability to store all configuration files along with history of changes to the device makes it a great asset to Wells’ Dairy. Research indicates that 60% of network problems are caused by incorrect configuration changes. By comparing Wells’ Dairy’s currently running configurations with its previous configurations, NetMRI aids in identifying the most recent configuration changes when a network problem is reported.
Summary
Since implementing NetMRI, the Wells’ Dairy IT staff has improved productivity and significantly reduced operational costs, resulting in more efficient management of their enterprise network. The IT department is more proactive and responsive in identifying issues on the network ultimately saving Wells’ Dairy valuable time and money. NetMRI simplifies the complex challenges of managing the network infrastructure so that Wells’ Dairy can focus on what’s important selling dairy products around the world.
Back to The Network Monitor Archive
Copyright © 2008 Netcordia, Inc. All rights reserved.
