Nagios Plugin to Check Extreme Networks Devices

Over at INEX we’ve embarked on a forklift upgrade of the primary peering LAN using Extreme Networks Summit x670’s and x460’s. As usual, we need to monitor these 24/7 and we have just written a new Extreme Networks chassis monitoring script which should work with most Extreme devices.

It will check and generate alerts on the following items:

  • a warning if the device was recently rebooted;
  • a warning / critical if any found temperature sensors are in a non-normal state;
  • a warning / critical if any found fans are in a non-normal state;
  • a warning / critical if any found PSUs are in a non-normal state (or missing);
  • a warning / critical if the 5 sec CPU utilisation is above set thresholds;
  • a warning / critical if the memory utilisation is above set thresholds.

You’ll find the script in this Github repository: barryo/nagios-plugins.

A some verbose output follows:

./check_chassis_extreme.php -c  -h 10.11.12.13 -v 
 
CPU: 5sec - 4% 
Last reboot: 4007.7666666667 minutes ago 
Uptime: 2.8 days. 
PSU: 1 - presentOK (Serial: 1429W-xxxxx; Source: ac) 
PSU: 2 - presentOK (Serial: 1429W-xxxxx; Source: ac) 
Fan: 101 - OK (4388 RPM) 
Fan: 102 - OK (9273 RPM) 
Fan: 103 - OK (4428 RPM) 
Fan: 104 - OK (9273 RPM) 
Fan: 105 - OK (4551 RPM) 
Fan: 106 - OK (9452 RPM) 
Temp: 39'C 
Over temp alert: NO 
Memory used in slot 1: 29% 
OK - CPU: 5sec - 4%. Uptime: 2.8 days.  PSUs: 1 - presentOK; 
  2 - presentOK;. Overall system power state: redundant power 
  available. Fans: [101 - OK (4388 RPM)]; [102 - OK (9273 RPM)];
  [103 - OK (4428 RPM)]; [104 - OK (9273 RPM)]; [105 - OK (4551 
  RPM)]; [106 - OK (9452 RPM)];. Temp: 39'C. 
  Memory (slot:usage%): 1:29%.

Enabling External Commands in Nagios / Ubuntu

I get caught by the following quite often (too many Nagios installations!):

Error: Could not stat() command file ‘/var/lib/nagios3/rw/nagios.cmd’!

The external command file may be missing, Nagios may not be running, and/or Nagios may not be checking external commands. An error occurred while attempting to commit your command for processing.

The correct way to fix this in Ubuntu is:

service nagios3 stop
dpkg-statoverride --update --add nagios www-data 2710 /var/lib/nagios3/rw
dpkg-statoverride --update --add nagios nagios 751 /var/lib/nagios3
service nagios3 start

Nagios / Icinga Alerts via Pushover

I came across Pushover recently which makes it easy to send real-time notifications to your Android and iOS devices. And easy it is. It also allows you to set up applications with logos so that you can have multiple Nagios installations shunting alerts to you via Pushover with each one easily identifiable. After just a day playing with this, it’s much nicer than SMS’.

So, to set up Pushover with Nagios, first register for a free Pushover account. Then create a new application for your Nagios instance. I set the type to Script and also upload a logo. After this, you will be armed with two crucial pieces of information: your application API tokan/key ($APP_KEY) and your user key ($USER_KEY).

To get the notification script, clone this GitHub repository or just down this file – notify-by-pushover.php.

You can test this immediately with:

echo "Test message" | \
    ./notify-by-pushover.php HOST $APP_KEY $USER_KEY RECOVERY OK

The parameters are:

USAGE: notify-by-pushover.php  <$APP_KEY> \
    <$USER_KEY> <NOTIFICATIONTYPE>

Now, set up the new notifications in Nagios / Icinga:

# 'notify-by-pushover-service' command definition
define command{
    command_name notify-by-pushover-service
    command_line /usr/bin/printf "%b" "$NOTIFICATIONTYPE$: \
        $SERVICEDESC$@$HOSTNAME$: $SERVICESTATE$           \
        ($SERVICEOUTPUT$)" |                               \
      /usr/local/nagios-plugins/notify-by-pushover.php     \
        SERVICE $APP_KEY $CONTACTADDRESS1$                 \
        $NOTIFICATIONTYPE$ $SERVICESTATE$
}

# 'notify-by-pushover-host' command definition
define command{
  command_name notify-by-pushover-host
  command_line /usr/bin/printf "%b" "Host '$HOSTALIAS$'    \
        is $HOSTSTATE$: $HOSTOUTPUT$" |                    \
      /usr/local/nagios-plugins/notify-by-pushover.php     \
        HOST $APP_KEY $CONTACTADDRESS1$ $NOTIFICATIONTYPE$ \
        $HOSTSTATE$
}

Then, in your contact definition(s) add / update as follows:

define contact{
  contact_name ...
  ...
  service_notification_commands ...,notify-by-pushover-service
  host_notification_commands ...,notify-by-pushover-host
  address1 $USER_KEY
}

Make sure you break something to test that this works!

Recovering MySQL Master-Master Replication

MySQL Master-Master replication is a common practice and is implemented by having the auto-increment on primary keys increase by n where n is the number of master servers. For example (in my.conf):

auto-increment-increment = 2
auto-increment-offset = 1

This article is not about implementing this but rather about recovering from it when it fails. A work of caution – this former of master-master replication is little more than a useful hack that tends to work. It is typically used to implement hot stand-by master servers along with a VRRP-like protocol on the database IP. If you implement this with a high volume of writes; or with the expectation to write to both without application knowledge of this you can expect a world of pain!

It’s also essential that you use Nagios (or another tool) to monitor the slave replication on all masters so you know when an issue crops up.

So, let’s assume we have two master servers and one has failed. We’ll call these the Good Server (GS) and the Bad Server (BS). It may be the case that replication has failed on both and then you’ll have the nightmare of deciding which to choose as the GS!

  1. You will need the BS to not process any queries from here on in. This may already be the case in a FHRP (e.g. VRRP) environment; but if not, use combinations of stopping services, firewalls, etc to stop / block access to the BS. It is essential that the BS does not process any queries besides our own during this process.
  2. On the BS, execute STOP SLAVE to prevent it replicating from the GS during the process.
  3. On the GS, execute:
    1. STOP SLAVE; (to stop it taking replication information from the bad server);
    2. FLUSH TABLES WITH READ LOCK; (to stop it updating for a moment);
    3. SHOW MASTER STATUS; (and record the output of this);
  4. Switch to the BS and import all the data from the GS via something like: mysqldump -h GS -u root -psoopersecret –all-databases  –quick  –lock-all-tables | mysql -h BS -u root -psoopersecret; Note that I am assuming that you are replicating all databases here. Change as appropriate if not.
  5. You can now switch back to the GS and execute UNLOCK TABLES to allow it to process queries again.
  6. On the BS, set the master status with the information your recorded from the GS via: CHANGE MASTER TO master_log_file=’mysql-bin.xxxxxx’, master_log_pos=yy;
  7. Then, again on the BS, execute START SLAVE. The BS should now be replication from the GS again and you can verify this via SHOW SLAVE STATUS.
  8. We now need to have the GS replicate from the BS again. On the BS, execute SHOW MASTER STATUS and record the information. Remember that we have stopped the execution of queries on the BS in step 1 above. This is essential.
  9. On the GS, using the information just gathered from the BS, execute: CHANGE MASTER TO master_log_file=’mysql-bin.xxxxxx’, master_log_pos=yy;
  10. Then, on the GS, execute START SLAVE. You should now have two way replication again and you can verify this via SHOW SLAVE STATUS on the GS.
  11. If necessary, undo anything from step 1 above to put the BS back into production.

There is a –master-data switch for mysqldump which would remove the requirement to lock the GS server above but in our practical experience, there are various failure modes for the BS and the –master-data method does not work for them all.

Monitoring SSL Certificate Expiry Dates with Nagios

It is good practice to separate Nagios checks of your web server being available from checking SSL certificate expiry. The latter need only be run once per day and should not add unnecessary noise to a more immediately important web service failure.

To use check_http to monitor SSL certificate expiry dates, first ensure you have a daily service definition – let’s call this service-daily. Now create two service commands as follows:

define command{
    command_name check_cert
    command_line /usr/lib/nagios/plugins/check_http -S \
        -I $HOSTADDRESS$ -w 5 -c 10 -p $ARG1$ -C $ARG2$
}

define command{
    command_name check_named_cert
    command_line /usr/lib/nagios/plugins/check_http -S \
        -I $ARG3$ -w 5 -c 10 -p $ARG1$ -C $ARG2$
}

The second is useful for checking named certificates on additional IP addresses on web servers serving multiple SSL domains.

We can use these to check SSL certificates for POP3, IMAP, SMTP and HTTP:

define service{
    use service-daily
    host_name mailserver
    service_description POP3 SSL Certificate
    check_command check_cert!993!21
}

define service{
    use service-daily
    host_name mailserver
    service_description IMAP SSL Certificate
    check_command check_cert!995!21
}

define service{
    use service-daily
    host_name mailserver
    service_description SMPT SSL Certificate
    check_command check_cert!465!21
}

define service{
    use service-daily
    host_name webserver
    service_description SSL Cert: www.example.com
    check_command check_named_cert!443!21!www.example.com
}

define service{
    use service-daily
    host_name webserver
    service_description SSL Cert: www.example.net
    check_command check_named_cert!443!21!www.example.net
}