rebuilding nagios at SGI
At work (SGI) I'm currently building a new
monitoring infrastructure using
nagios. The prototype server I've got set
up is also using
nagvis. I've been using nagios for close
to eight years now at a variety of places, but I'm new to nagvis. It's a
really great tool!
Basically, there's an add-on called ndomod that will hook into nagios and
log all events to a mysql database. The nagvis tool, then, will hook into
this database and display selected "maps". A "map" in nagvis (from what I've
seen so far) is basically an image background, on top of which are dropped
icons, gadgets, text-annotations, etc, which are interfaced with the nagios
events logged in mysql. For instance, some people have taken a picture of
several racks of equipment, then dropped little status icons on the picture
in nagvis, each icon being a live link back to the nagios host for the
device. The user can get all sorts of useful info on the host/service in
question by hovering their mouse cursor over the icon and it's color is
determined by the status of the nagios host/service. Slick.
Even better, you can drop "gadgets" on a map. A gadget is basically just a
web link (typically to a PHP script) which is passed info about the
host/service it's supposed to represent - things like the current value,
and the warning/critical thresholds for the host/service check in question.
The example gadget that comes with nagvis is a speedometer-type gauge but
one could easily make other custom gadgets. Pretty cool.
Basically, nagvis is a nice way to make user-friendly or management-friendly
"dashboards" for visualizing the status of complex services. For instance,
I'll have a nagvis map for our Clarify server (where support calls and
tickets are logged) containing one icon for the over-all status of Clarify,
plus icons for the Clarify webservers, appservers, database servers, the
SAN switches the DB servers use, perhaps the StorageArrays on the SAN as well,
the web service/s on the webservers and appservers, the Oracle database
service on the DB servers, etc. All the stuff that has to be up and running
in order for someone to say "Clarify is working fine" will be summarized on
this map. And we'll have maps for top-level views of all the critical things
at SGI.
I'm also trying out a variety of web-based admin tools for nagios. One is
called lilac. It's based on the old
"fruity" web interface, which appears to be a stale project now. At first
glance it's functional, but it's templates don't generate nagios templates -
it's used strictly for templating within lilac itself. Also, I haven't been
able to coerce it into letting me set hostinfo icons for hosts. I can set
the external-info URLs but not the icons. Go figure.
I'm in the middle of trying to test another one called
nagiosQL but I don't have it running
just yet.
I may simply stick to the text-file configs for now. They offer a lot of
flexibility (especially since the latest nagios allows for multiple
inheritance) but I'm also considering integrating our nagios infrastructure
with DCSi. At SGI, I setup a custom LDAP schema with object classes and
attributes for tracking servers/devices and applications. So we can quickly
see who owns what servers, where they're physically located (what building,
room, and footprint/rack they're in), what apps are run, who owns those
apps, who are the emergency contacts for servers/apps, notes about servers
and apps that might be useful to whoever is on-call, how the server's console
is connected (DRAC, IRIXconsole, cyclade, etc).
Our nagios server
uses the external info URLs (hostextinfo stuff) so that when nagios sends
us a page we can simply click on a link next to the host in nagios and get
a quick view of all this info. Anyway, I'm considering banging out some
quick perl, php, or java code to query the LDAP server and generate nagios
configs from it. It would be nice to be able to drop in a record into LDAP
about a new server/service and have nagios automagically start doing the
right set of default host/service checks for it...