...
On restoration of the service we reviewed the state of the box and found that in /var/lib/mysql that there were a lot of relay logs. This was unusual so we checked the mysql replication status and found that replication from the master (prod-cacti01-fra-de.geant.org) was broken and had been broken since 25th February according to the mysqld.log, due to a mis-match in the default values of id in three tables where inserts from the master were trying to write. We fixed the mismatch and replication started and continued to flow. There was a further issue were replication from prod-cacti02-vie-at.geant.org slave to the master prod-cacti01-fra-de.geant.org. This was due to the backup showing an earlier binlog number that was expected by master for its replication. As that was momentary we reset the slave process on prod-cacti01-fra-de.geant.org and replication again continued from vie to fra. As no futher issues were being alerted systemically we closed the incident and made NOC aware that this situation would need review on Monday morning.
Time line of the events as they unfolded is as below:
Date | Time | Notes |
---|---|---|
06/03/2020 | 11:21 | First critical alert received. Decision to review and see how fast the partition review would consume space |
06/03/2020 | 17:10 | Alert was reviewed again and found to be consuming more space than expected |
06/03/2020 | 17:30 | Logged on and added new disk via VMware UI. Logged onto server and attempted to extend the existing LVM in the ususal manner. The server produced errors when the physical volume was created with a message about a missing UUID which pvs confirmed. Remediation to retrieve the situation were unsuccessful and a reboot was requested to confirm if a device rescan would fix the issue or provide more information. |
06/03/2020 | 19:54 | Emergency ticket to perform a reboot at 21:00 and was approved by NOC. |
06/03/2020 | 21:00 | Unfortunately the VM did not boot so we were forced to restore from backup. Mutliple fsck options were tried but were not successful |
06/03/2020 | 21:30 | Restore from backup was requested. |
06/03/2020 | 22:25 | After issues with the restore a good VM version was restored and booted. |
06/03/2020 | 22:30 | Investigated the mass of relay logs in /var/lib/mysql |
06/03/2020 | 22:33 | Logged in to mysql on vie and reset the id default value in data_template_data_rra, poller_item, data_input_data tables |
06/03/2020 | 22:53 | Logged into prod-cacti01-fra-de.geant.net to fixe the replication break from restore showing older binlog entry than expeted. |