You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »


IT were alerted to a disk space issue with prod-cacti02-vie-at.geant.org on Friday 6th March 2020 for /var partition.

The partition space continued to grow reasonably quickly and on assessment it was estimated that the partition may not make it through the weekend.

A new disk was added in preparation to grow the partition on to the new disk space.

An emergency maintenance ticket was created and a member of the NOC approved for the maintenance to go ahead on Friday night at 9pm, to avoid interruption to users using the system should anything go wrong.

During the addition of the disk it was found that the LVM of the physical disk could not be completed due to an error with a missing UUID. Despites efforts to fix the issue while the machine was online none were to prove successful. 

We took the decision to reboot the machine in the hope that the rescan of the devices would fix or make clear what the issue was, unfortunately the OS would not boot and we were forced to revert to the last known good backup, Thursday 5th 23:10.

After a few issues with VMware not relinquishing information about the newly added 3rd disk, we had to delete the VM prod-cacti02-vie-at.geant.org altogether and then let the backup replace the box. This proved more successful and the box was booted.

On restoration of the service we reviewed the state of the box and found that in /var/lib/mysql that there were a lot of relay logs. This was unusual so we checked the mysql replication status and found that replication from the master (prod-cacti01-fra-de.geant.org) was broken and had been broken since 25th February according to the mysqld.log, due to a mis-match in the default values of id in three tables where inserts from the master were trying to write. We fixed the mismatch and replication started and continued to flow. There was a further issue were replication from prod-cacti02-vie-at.geant.org slave to the master prod-cacti01-fra-de.geant.org. This was due to the backup showing an earlier binlog number that was expected by master for its replication. As that was momentary we reset the slave process on prod-cacti01-fra-de.geant.org and replication again continued from vie to fra. As no futher issues were being alerted systemically we closed the incident and made NOC aware that this situation would need review on Monday morning.

Time line of the events as they unfolded is as below:


DateTimeNotes
06/03/202011:21First critical alert received. Decision to review and see how fast the partition review would consume space

06/03/2020 

17:10Alert was reviewed again and found to be consuming more space than expected
06/03/202017:30Logged on and added new disk via VMware UI. Logged onto server and attempted to extend the existing LVM in the ususal manner. The server produced errors when the physical volume was created with a message about a missing UUID which pvs confirmed. Remediation to retrieve the situation were unsuccessful and a reboot was requested to confirm if a device rescan would fix the issue or provide more information.
06/03/202019:54Emergency ticket to perform a reboot at 21:00 and was approved by NOC.  
06/03/202021:00Unfortunately the VM did not boot so we were forced to restore from backup. Mutliple fsck options were tried but were not successful
06/03/202021:30Restore from backup was requested.
06/03/202022:25After issues with the restore a good VM version was restored and booted.
06/03/202022:30Investigated the mass of relay logs in /var/lib/mysql
06/03/202022:33Logged in to mysql on vie and reset the id default value in data_template_data_rra, poller_item, data_input_data tables 
06/03/202022:53Logged into prod-cacti01-fra-de.geant.net to fixe the replication break from restore showing older binlog entry than expeted.
06/03/202023:14Notified NOC that this fix will need review on Monday but that replication was fixed. No notification at this point that anything was broken.


Cacti runs "unison" to perform a two ways synchronization. Unison stopped working the first time as a consequence of the filesystem corruption, and didn't work with the restored system, because the two VMs were not in sync. We removed the DB created by Unison and started Unison from scratch on both systems and the sync started working again. 





















  • No labels