Thursday, March 18, 2010

Monitoring MySQL Replication with mk-heartbeat.

A quick howto on using mk-heartbeat to monitor MySQL Replication and graphing it with zenoss/any other snmp grapher/monitor


Master End

  • Download Mk-heartbeat

wget http://www.maatkit.org/get/mk-heartbeat
  • Create a database and grant for maatkit

create database maatkit;
grant all on maatkit.* to maatkit@localhost identified by 'Whatever';
  • Start the heartbeat and let it auto-create the heartbeat table

perl mk-heartbeat -u maatkit -pWhatever --database maatkit --table heartbeat --update --create-table

Slave End

  • Verify that the DDL replicated

mysql> select * from maatkit.heartbeat;
+----+---------------------+
| id | ts |
+----+---------------------+
| 1 | 2010-03-02 16:30:32 |
+----+---------------------+
  • Let mk-heartbeat run

mk-heartbeat -u maatkit -pWhatever --database maatkit --table heartbeat --monitor  --interval 15 --file /tmp/heartbeat --daemonize
  • Fix file selinux context (I don't think it's the right one but it works )

chcon system_u:object_r:snmpd_var_run_t:s0 /tmp/heartbeat
  • Why not look at how far we are lagging by querying (Wow! We are lagging)

mysql> select ts,now() from maatkit.heartbeat;
+---------------------+---------------------+
| ts | now() |
+---------------------+---------------------+
| 2010-03-16 12:52:45 | 2010-03-17 17:59:23 |
+---------------------+---------------------+
1 row in set (0.00 sec)

mysql> select ts,now() from maatkit.heartbeat;
+---------------------+---------------------+
| ts | now() |
+---------------------+---------------------+
| 2010-03-16 12:53:15 | 2010-03-17 17:59:29 |
+---------------------+---------------------+
1 row in set (0.00 sec)

mysql> select ts,now() from maatkit.heartbeat;
+---------------------+---------------------+
| ts | now() |
+---------------------+---------------------+
| 2010-03-16 12:53:30 | 2010-03-17 17:59:32 |
+---------------------+---------------------+
  • Integrate with SNMP using dedicated to this purpose

  • Extend snmpd.conf

extend .1.3.6.1.4.1.30630.2.2.1 mk-heartbeat-lag /usr/local/bin/repl_lag 1
extend .1.3.6.1.4.1.30630.2.2.2 mk-heartbeat-lag /usr/local/bin/repl_lag 2
extend .1.3.6.1.4.1.30630.2.2.3 mk-heartbeat-lag /usr/local/bin/repl_lag 3
extend .1.3.6.1.4.1.30630.2.2.4 mk-heartbeat-lag /usr/local/bin/repl_lag 4
  • Create repl_lag script

#!/bin/sh
#Monitor file
LAGFILE=/tmp/heartbeat
case "$1" in
'1')
#Current LAG
cat $LAGFILE|awk '{print $1}'|sed -e "s/s//"
;;
'2')
#1 minute average
cat $LAGFILE|awk -F[ '{print $2}'|awk '{print $1}'|sed -e "s/s,//"
;;
'3')
#5 Minute average
cat $LAGFILE|awk -F[ '{print $2}'|awk '{print $2}'|sed -e "s/s,//"
;;
'4')
cat $LAGFILE|awk -F[ '{print $2}'|awk '{print $3}'|sed -e "s/s//"
;;
esac
  • Restart snmpd and test

snmpwalk -c Community -v2c my.mu.company.net .1.3.6.1.4.1.30630.2.2.4
SNMPv2-SMI::enterprises.30630.2.2.4.1.0 = INTEGER: 1
SNMPv2-SMI::enterprises.30630.2.2.4.2.1.2.16.109.107.45.104.101.97.114.116.98.101.97.116.45.108.97.103 = STRING: "/usr/local/bin/repl_lag"
SNMPv2-SMI::enterprises.30630.2.2.4.2.1.3.16.109.107.45.104.101.97.114.116.98.101.97.116.45.108.97.103 = STRING: "4"
SNMPv2-SMI::enterprises.30630.2.2.4.2.1.4.16.109.107.45.104.101.97.114.116.98.101.97.116.45.108.97.103 = ""
SNMPv2-SMI::enterprises.30630.2.2.4.2.1.5.16.109.107.45.104.101.97.114.116.98.101.97.116.45.108.97.103 = INTEGER: 5
SNMPv2-SMI::enterprises.30630.2.2.4.2.1.6.16.109.107.45.104.101.97.114.116.98.101.97.116.45.108.97.103 = INTEGER: 1
SNMPv2-SMI::enterprises.30630.2.2.4.2.1.7.16.109.107.45.104.101.97.114.116.98.101.97.116.45.108.97.103 = INTEGER: 1
SNMPv2-SMI::enterprises.30630.2.2.4.2.1.20.16.109.107.45.104.101.97.114.116.98.101.97.116.45.108.97.103 = INTEGER: 4
SNMPv2-SMI::enterprises.30630.2.2.4.2.1.21.16.109.107.45.104.101.97.114.116.98.101.97.116.45.108.97.103 = INTEGER: 1
SNMPv2-SMI::enterprises.30630.2.2.4.3.1.1.16.109.107.45.104.101.97.114.116.98.101.97.116.45.108.97.103 = STRING: "0.41"
SNMPv2-SMI::enterprises.30630.2.2.4.3.1.2.16.109.107.45.104.101.97.114.116.98.101.97.116.45.108.97.103 = STRING: "0.41"
SNMPv2-SMI::enterprises.30630.2.2.4.3.1.3.16.109.107.45.104.101.97.114.116.98.101.97.116.45.108.97.103 = INTEGER: 1
SNMPv2-SMI::enterprises.30630.2.2.4.3.1.4.16.109.107.45.104.101.97.114.116.98.101.97.116.45.108.97.103 = INTEGER: 0
SNMPv2-SMI::enterprises.30630.2.2.4.4.1.2.16.109.107.45.104.101.97.114.116.98.101.97.116.45.108.97.103.1 = STRING: "0.41"

Monitor End

  • Log into zenoss and navigate to Devices->Server->[Linux o r whatever you use]

  • Click on the Templates tab

  • From the Available Perfomance Templates tab, click on add template

  • Type in an in. Mine was MySQL Replication

  • Add a Description and save

  • Under Data sources, Add a new snmp datasource. Example: ID: MySQL Replicationg Lag

  • In the Data Souce add the OID derived from the snmpwalk e.g 1.3.6.1.4.1.30630.2.2.1.4.1.2.16.109.107.45.104.101.97.114.116.98.101.97.116.45.108.97.103.1

  • The type is a gauge (Like a speed-o-meter ) rather than a counter (Which is like the miles covered)

  • Click Save

  • You may add the IP/Hostname of a test device and click on test (One of your MySQL snmpd hosts)

  • Save and iterate for the other counters (1min avg, 5 min avg and 15 min avg)

  • Go back to the Performance Template page

  • Under Thresholds, Add a new threshold say MySQL Lag Warn

  • Click on the datapoints to apply the threshold to. I opted to track the realtime Lag

  • Put the number of seconds on the max field. 120 is 2 minutes of lag. Any more and I will be paged

  • Add a critical threshold

  • Go back to the Performance Template page

  • Add a Graph Definition.

  • Add the datapoints created earlier under Graph points. The relevant associated thresholds should also be auto-added.

  • You may want to resequence the points

  • Click save.

  • Naviagate back to Device List using the menu bar on the left.

  • Click on a MySQL host that you want to add a replication lag graph for

  • On the dropdown menu, select More->Templates

  • Under Perfomance Templates, Select Bind Templates

  • Select all the templates you are interested in (Mine was Device and MySQL Replication)

  • Click OK

  • Click on the perf tab should give you a graph.

Here's mine showing a tunnel break that caused a > 24 hour lag and the subsequent catching up

And the network graph.


Yeah... It's a slow link :)