...
On restoration of the service we reviewed the state of the box and found that in /var/lib/mysql that there were a lot of relay logs. This was unusual so we checked the mysql replication status and found that replication from the master (prod-cacti01-fra-de.geant.org) was broken and had been broken since 25th February according to the mysqld.log, due to a mis-match in the default values of id in three tables where inserts from the master were trying to write. We fixed the mismatch and replication started and continued to flow. There was a further issue were replication from prod-cacti02-vie-at.geant.org slave to the master prod-cacti01-fra-de.geant.org. This was due to the backup showing an earlier binlog number that was expected by master for its replication. As that was momentary we reset the slave process on prod-cacti01-fra-de.geant.org and replication again continued from vie to fra. As no futher issues were being alerted systemically we closed the incident and made NOC aware that this situation would need review on Monday morning.
Incident severity: CRITICAL
Data loss: YES
Time line of the events as they unfolded is as below:
...
- SWD to propose a solution to avoid Multi-master split brain issue in future and system can be made more robust.
- Setup monitoring of two way synchronisation of RRD files.
- During investigation a bug in relation to Cacti GCS plugin also identified - debug and fix the issue,