...
Time line of the events as they unfolded is as below:
Date | Time | Notes |
---|---|---|
06/03/2020 | 11:21 | First critical alert received. Decision to review and see how fast the partition review would consume space |
06/03/2020 | 17:10 | Alert was reviewed again and found to be consuming more space than expected |
06/03/2020 | 17:30 | Logged on and added new disk via VMware UI. Logged onto server and attempted to extend the existing LVM in the ususal manner. The server produced errors when the physical volume was created with a message about a missing UUID which pvs confirmed. Remediation to retrieve the situation were unsuccessful and a reboot was requested to confirm if a device rescan would fix the issue or provide more information. |
06/03/2020 | 19:54 | Emergency ticket to perform a reboot at 21:00 and was approved by NOC. |
06/03/2020 | 21:00 | Unfortunately the VM did not boot so we were forced to restore from backup. Mutliple fsck options were tried but were not successful |
06/03/2020 | 21:30 | Restore from backup was requested. |
06/03/2020 | 22:25 | After issues with the restore a good VM version was restored and booted. |
06/03/2020 | 22:30 | Investigated the mass of relay logs in /var/lib/mysql |
06/03/2020 | 22:33 | Logged in to mysql on vie and reset the id default value in data_template_data_rra, poller_item, data_input_data tables |
06/03/2020 | 22:53 | Logged into prod-cacti01-fra-de.geant.net to fixe the replication break from restore showing older binlog entry than expeted. |
06/03/2020 | 23:14 | Notified NOC that this fix will need review on Monday. |