Welcome to Infoblox NetMRI Community Sign in | Join | Help
in Search

Applied Infrastructure

Handling NMS Performance Data, Part 2

I described collecting network performance data in last week's blog.  This week, I want to describe how to efficiently store the collected data.  I have heard the stories about vendors who used a relational DB to store interface performance data and how those systems didn't perform well at large scale - over 50,000 interfaces per polling engine.

Most NMS developers are actually good database developers, so they naturally prefer storing data right into a relational database.  It makes their life easy because they can run SQL queries that do a lot of work for them.  It is also a common interface that they can use for all their interactions with the data.  But there's a cost to taking this approach.  The DB API is relatively heavy-weight because of its relational capabilities.  What we have is a typical optimization tradeoff.  Is the time the developers spend more important than the time the system spends handling the data?  A number of NMS development efforts have had poor performance because the wrong tradeoffs were selected.

What causes the slow performance?  A relational database is powerful because it allows the developer to easily create relations between data and make powerful queries against that data and its relationships.  It reduces data storage in many cases because it can store metadata in one place and reference it from multiple places.  In a network, the metadata might be the device's name, its management addresses, location, etc, all referenced by a unique device ID.  An interface or configuration entry in the DB can simply reference the device by its ID to get access to the higher-level meta-data about the device.  One change in the meta-data is reflected immediately in all references to that data instead of having it duplicated for each interface.  This is all good.

The problem occurs when high volumes of data need to be handled.  The performance problem is because a relational DB needs to index the data as it is inserted into the database in order to quickly extract it.  If indexing is not done, the DB read operations take longer.  So there's a performance penalty on either the inserts or the reads (which are called 'selects' in the SQL language).  On top of the insert operation, we need to add DB logging, which is similar to real-time backups (most DBs will allow the log to be played back from a known checkpoint in order to bring a DB back up to date in case of a system crash).  Even though the log may be (and should be) on a different disk than the DB itself, the DB uses memory and CPU to perform the logging.  The ease of use comes with a price.

Is there an alternative?  Yes.  All NMS systems roll up the collected data over longer time intervals, typically an hour.  The roll-up calculations are typically to record values such as MIN, MAX, AVG, and 95th Percentile.  These are the values that are used in performance thresholding, error rate thresholds, trend analysis, and correlation.  Keep the collected data that is required for the roll-up period in an in-memory cache (memory is inexpensive these days, so use it to optimize system performance).  An efficient data structure will allow very rapid access to the data in the cache.  The roll-up data is created from the cache and stored in the DB.  This approach allows the power of the relational DB to be applied to the summaries, which is what is normally done.  The raw data in the cache is then written directly into the filesystem, using an on-disk data structure that makes it easy to access the raw data.

Why does this work well?  In normal use, the raw data is rarely accessed.  It is used to create the roll-up summary data that is used for network performance trending.  The network staff typically examines only a few interfaces each day, so the best case is to optimize the raw data storage mechanism.  The result is a big performance boost over using the DB to store raw data.

What are the advantages of this approach?
* Reduced database storage requirements.
* Improved database performance.
* Less contention for database resources and disk I/O.
* Raw data is more efficiently stored.
* Historical raw data can be easily moved to a SAN for long-term storage.
* Detailed displays of performance data is easily performed as long as the raw data is easily accessed.
* Micro sampling of specific interfaces can be done without a major impact on the polling engine.
* Remote collectors can perform the periodic roll-up calculations and forward only the required data to the NMS analysis engine.  Or, even better, keep all the data locally and have the central analysis system download rules to the polling engine where preliminary identification can be performed, matching those interfaces against a given criteria.

Using these techniques, an NMS can increase its data collection performance and decrease its database storage requirements.  The end result is an increase in overall system performance, which can be applied to making the UI run faster.  And that's a good thing.

  -Terry
 

Comments

 

dyarashus said:

What about a compromise of configuring the database to keep the raw data tables in RAM rather than on disk? That would dramatically improve access times, while not requiring programmers to recreate functionality they already have.

The obvious risk is that all the raw data would be lost when the system was interrupted, but you could mitigate against that by some sort of periodic snapshot to disk.  That sounds to me like it would have some promise unless you're saying that it's the computation of the indexing that's the performance issue, not the I/O involved with all the transactions.

October 12, 2009 3:48 PM
 

tslattery said:

I am suggesting that the most recent data samples be kept in RAM to make the calculations of statistical data run very fast.  As the data becomes old enough to no longer be needed for statistical calculations, write it to disk.

It is the combination of indexing and the I/O of writing, then reading the data from the database that is the problem.  It is possible to have a very fast system - see what Statseeker does in handling data from thousands of interfaces with a fast polling cycle.

 -Terry

November 11, 2009 8:15 PM

About tslattery

Terry Slattery, CCIE #1026, is a senior network engineer with decades of experience in the internetworking industry. Prior to joining Chesapeake NetCraftsmen as a full time consultant, Terry was the founder and CTO of Netcordia, and inventor of NetMRI, a suite of network management products. Terry started Netcordia as a consulting company in 2000 and transitioned to a network management product company in 2003. During the consulting days, he used his network design and implementation skills to lead a team in the design and implementation of a high availability network at a brokerage clearing house. Terry is the former President and founder of Chesapeake Computer Consultants, Inc., a networking and computer systems training and consulting company. He co-invented and patented the vLab(tm) internet-based remote lab system. He is co-author of the McGraw Hill text Advanced IP Routing in Cisco Networks. Terry led the team that developed the current Cisco IOS user interface under contract to Cisco Systems. Terry is experienced in the design and installation of large TCP/IP based networks and is a successful network protocol instructor. He is the second Cisco Certified Internetworking Expert (CCIE) #1026 and the first outside of Cisco. He enjoys membership on the Vanderbilt University Engineering School’s Industrial Advisory Board and the IEEE.

This Blog

Syndication