Tuesday, March 27, 2012

Adding custom services for Windows host to be monitored by Nagios (Monitoring MS SQL Database Table)

Nagios has many in-built services which can be easily configured for a host so that it can be monitored. Services for Windows host like check cpu load, drive space, running processes can be monitored by using various parameters in check_nt command.
While monitoring, situation comes when you have to monitor a specific service which may be a part of an application, size of particular file or a table in database and many others. In these cases, you have to make your own customized program/script to monitor them. Nagios should be configured to run this customized program and get its output and show the service status.

NRPE is Nagios client which can run your own customized program. For Windows, NRPE functions is embedded in NSClient++. In this post, monitoring a table in MS SQL database will be discussed.
Lets assume there is MS SQL table named ‘job’ in a database named ‘mydatabase’. We have to count number of rows whose ‘job_done_status’ has value 1 in last one hour. If there is no rows with ‘job_done_status’ of value 1, then give a CRITICAL error message to Nagios, otherwise give OK message. The batch script for monitoring this table is given below.
@cls
@echo off
rem job status check

osql /U myuser /P mypass /S db_server /d mydatabase -h-1 /Q "SELECT COUNT(*) as Result FROM job where job_date BETWEEN DATEADD(hh,-1,GETDATE()) AND GETDATE() AND job_done_status=1" /o Job_Status
set /p num=<Job_Status
echo Number of job done in last 1 hour is %num%
if %num% == 0 (
exit 2Re
) else (
exit 0
)

In above batch script, we have used osql command line tool for querying MS SQL database. Please refer to osql documentation for the parameters that has been used above. After querying the count of ‘job_done_status’ of value 1 in last one hour, the result is finally set to variable ‘num’. Please note that the value of this ‘num’ is printed with echo command. This printed/echoed line is actually shown in Nagios web in Status Information column. Nagios recognize the status of service by the value output from the program.
Status
Output
Critical 2
Warning 1
Ok 0
In above script, warning status has not been used. If value of ‘num’ is 0, then it returns 2 which is CRITICAL status and if more than 0 then it return 0 which is status OK.
Lets make a batch file named check_job.bat with above script and put it in location C:\. Now open nsc.ini. In the section [NRPE Handlers], add a line.
[NRPE Handlers]
command[check_job_status]="C:\check_job.bat"

Note that, the NRPE command to run our batch file is check_job_status. After adding above line, restart NSClient++.
In Nagios server side, open your configuration file inside objects folder. Generally Windows host are kept in windows.cfg file. Add following lines in that file.
define service{
use generic-service
host_name My_DB_Server
service_description Check Job Status
check_command check_nrpe!check_job_status
}

Restart Nagios: /etc/init.d/nagios restart


My_DB_Server is the host where we have kept our script check_job.bat. Note that, we are using same command check_job_status which we had declared in nsc.ini. Also, note that we are using check_nrpe command. For this check_nrpe command to work, you have to install NRPE in Nagios server also and after installing you have to add following lines in comnand.cfg file.
define command{
command_name check_nrpe
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}

Don't forget to restart Nagios every time when you change configuration file in Nagios.

After doing all above processes, you will see a new service ‘Check Job Status’ for host ‘My_DB_Server’. If the Status in the web is OK, then in Status Information, ‘Number of job done in last 1 hour is XX’ will be printed where XX is a value more than 0. If the Status is CRITICAL, then in Status Information, ‘Number of job done in last 1 hour is 0’ will be printed colored red.

No comments:

Post a Comment