Hello!
MySQL replication has been known to easily break, as a result of a large multitude of potential causes.
Sometimes the replication can even break if an erroneous query is executed on the master server.
With all the potential issues that may break replication, we thought it prudent to write an automated check script that can run on a scheduled basis (i.e. every 10-15 minutes), check the Slave status, report on any errors if applicable and attempt to repair replication.
We have built this script to exit and send mail alerts if any step of the checking and repairing process fails or generates an error in itself.
The script also generates a lock file to ensure that no more than one check process can run at any given time. We feel this script could be best used for scenarios for remote MySQL slaves, for example. Adding this extra layer may ensure a more reliable replication.
The repair process is simply 3 MySQL Commands :
stop slave; reset slave; slave start;
The above directives assume that you have a master.info with the mysql master server information statically set. No CHANGE MASTER commands have to be executed as a result. Resetting the slave clears the error and resumes replication, and all the queries missed during the time it failed should be queued and applied after it starts again.
Here is the script :
#!/bin/sh # Slave replication auto recovery and alert # Star Dot Hosting 2012 currentmonth=`date "+%Y-%m-%d"` lock_file=/tmp/slave_alert.lck echo "MySQL Replication Check Script" > /var/log/replication_check.log 2>&1 echo "------------------------------" >> /var/log/replication_check.log 2>&1 echo "$currentmonth" >> /var/log/replication_check.log 2>&1 echo "" >> /var/log/replication_check.log 2>&1 # Check if lock file exists if [ -f $lock_file ]; then echo "Lock file exists! Possible conflict!" >> /var/log/replication_check.log 2>&1 mail_alert exit 1 else touch $lock_file fi # Fix slave function fix_replication () { mysql -u root --password="XXXXX" -Bse "stop slave" >> /var/log/replication_check.log 2>&1 if [ "$?" -eq 0 ]; then echo "Stop slave succeeded..." >> /var/log/replication_check.log 2>&1 else echo "Slave recover function failed" >> /var/log/replication_check.log 2>&1 mail_alert exit 1 fi mysql -u root --password="XXXXX" -Bse "reset slave" >> /var/log/replication_check.log 2>&1 if [ "$?" -eq 0 ]; then echo "Reset slave succeeded..." >> /var/log/replication_check.log 2>&1 else echo "Slave recover function failed" >> /var/log/replication_check.log 2>&1 mail_alert exit 1 fi mysql -u root --password="XXXXX" -Bse "slave start" >> /var/log/replication_check.log 2>&1 if [ "$?" -eq 0 ]; then echo "Slave start succeeded." >> /var/log/replication_check.log 2>&1 else echo "Slave recover function failed" >> /var/log/replication_check.log 2>&1 mail_alert exit 1 fi } # Alert function function mail_alert () { cat /var/log/replication_check.log | mail -s "Replication check errors!" your@email.com } # Check if Slave is running properly Slave_IO_Running=`mysql -u root --password="XXXXX" -Bse "show slave statusG" | grep Slave_IO_Running | awk '{ print $2 }'` Slave_SQL_Running=`mysql -u root --password="XXXXX" -Bse "show slave statusG" | grep Slave_SQL_Running | awk '{ print $2 }'` Last_error=`mysql -u root --password="XXXXX" -Bse "show slave statusG" | grep Last_error | awk -F : '{ print $2 }'` # If no values are returned, slave is not running if [ -z $Slave_IO_Running -o -z $Slave_SQL_Running ]; then echo "Replication is not configured or you do not have the required access to MySQL" exit 1 fi # If everythings running, remove lockfile if it exists and exit if [ $Slave_IO_Running == 'Yes' ] && [ $Slave_SQL_Running == 'Yes' ]; then rm $lock_file echo "Replication slave is running" >> /var/log/replication_check.log 2>&1 echo "Removed Alert Lock" >> /var/log/replication_check.log 2>&1 elif [ $Slave_SQL_Running == 'No' ] || [ $Slave_IO_Running == 'No' ]; then echo "SQL thread not running on server `hostname -s`!" >> /var/log/replication_check.log 2>&1 echo "Last Error:" $Last_error >> /var/log/replication_check.log 2>&1 fix_replication mail_alert rm $lock_file fi echo "Script complete!" >> /var/log/replication_check.log 2>&1 exit 0