You
Gotta Stop, Look And Listen, Baby...
Greg Bromage: Australia

It
sneaks up on you, when you least expect it. Around about 3:15pm. Probably on
a Thursday.
It
begins with a phone call at your desk. "I've got this strange message
on my screen. It says that I can't access my mail server. Why can't I get my
mail?". While you're going through the usual troubleshooting things,
your mobile rings. "I've been working on this document for, like, 5
hours and my machine has frozen."
A
quick ping tells you the cause of the problem - the server has
crashed. So you start it up and the reason is obvious.
Only
150kb disk space free.
This
story is repeated hundreds of time every day, all around the world. Whether
it's running out of disk space, database servers straining under load or
forgetting to update the anti-virus software until it's too late.
System
Administrators are continually being asked (or told) to do more with less
and amongst all the additional duties piled on to us, the one thing that is
always pushed to the bottom of the list is perhaps the most essential one of
all: System Monitoring.
Remember
the days when you used to actually look at the log files? That was
back in the days when you knew the configuration of each and every machine
from memory and could guess, to within a few megabytes, the available
capacity. Those halcyon days can be yours again, with just a few simple
steps.
System
Monitoring needs to be a priority. Block it out in your calendar if you
have to. Lock the door. Forward your phone to someone else. Do whatever it
takes but make it known to everyone (especially your boss) that for, say,
one afternoon a week you are incommunicado. They won't be happy about it.
Convince them that having 1 person unavailable for 2 hours a week is better
and cheaper than having the entire network offline for 7 or 8 hours whilst
you fix a simple problem that could have been avoided.
Make
a checklist of what you need to check so that you go through the same
routine. Include disk space, memory and CPU utilisation for every server.
Traffic statistics, either gross levels or break it down by protocol. Just
how much e-mail does your site receive per day? If you don't know, then how
can you tell at what point your internet link can't cope?
Keep
a spreadsheet of the numbers you find. This is a great way to justify
not only the time you're taking to do the monitoring, but also to justify
your budget. Management people (especially accounts) like colourful graphs.
Learn phrases like: "Based on the current trend, we'll need to buy more
disk space in September next year." Knowing that sort of thing, and
having documentation to back it up, also makes for a better working
relationship come budget time.
Establish
a baseline. How are you going to know what's out of the ordinary until
you know what "normal" is. Get to know your network traffic.
Automate
it where possible: Remember how computers were going to make life easier
for us humans? Yeah, me neither... But, consider the virtues of the humble
command scheduler (at or cron, depending on whether you're in
a MS or *nix world). Why go to each server to collect the data when you can
have each server automatically collect the statistics and e-mail it to you?
Do
you have any thoughts on the subject? Or some time-saving scripts to
automate your monitoring? Send them along to me at greg.bromage@js.com.au.