...
Date | Time | Notes |
---|---|---|
06/03/2020 | 11:21 | First critical alert received. Decision to review and see how fast the partition review would consume space |
06/03/2020 | 17:10 | Alert was reviewed again and found to be consuming more space than expected |
06/03/2020 | 17:30 | Logged on and added new disk via VMware UI. Logged onto server and attempted to extend the existing LVM in the ususal manner. The server produced errors when the physical volume was created with a message about a missing UUID which pvs confirmed. Remediation to retrieve the situation were unsuccessful and a reboot was requested to confirm if a device rescan would fix the issue or provide more information. |
06/03/2020 | 19:54 | Emergency ticket to perform a reboot at 21:00 and was approved by NOC. |
06/03/2020 | 21:00 | Unfortunately the VM did not boot so we were forced to restore from backup. Mutliple fsck options were tried but were not successful |
06/03/2020 | 21:30 | Restore from backup was requested. |
06/03/2020 | 22:25 | After issues with the restore a good VM version was restored and booted. |
06/03/2020 | 22:30 | Investigated the mass of relay logs in /var/lib/mysql |
06/03/2020 | 22:33 | Logged in to mysql on vie and reset the id default value in data_template_data_rra, poller_item, data_input_data tables |
06/03/2020 | 22:53 | Logged into prod-cacti01-fra-de.geant.net to fixe the replication break from restore showing older binlog entry than expeted. |
06/03/2020 | 23:14 | Notified NOC that this fix will need review on Monday but that replication was fixed. No notification at this point that anything was broken. |
Cacti runs "unison" to perform a two ways synchronization. Unison stopped working the first time as a consequence of the filesystem corruption, and didn't work with the restored system, because the two VMs were not in sync. We have removed the DB created by Unison and started Unison from scratch on both systems and the sync started working again. | ||