Cacti production incident - 06-03-2020

IT were alerted to a disk space issue with prod-cacti02-vie-at.geant.org on Friday 6th March 2020 for /var partition.

The partition space continued to grow reasonably quickly and on assessment it was estimated that the partition may not make it through the weekend.

A new disk was added in preparation to grow the partition on to the new disk space.

An emergency maintenance ticket was created and a member of the NOC approved for the maintenance to go ahead on Friday night at 9pm, to avoid interruption to users using the system should anything go wrong.

During the addition of the disk it was found that the LVM of the physical disk could not be completed due to an error with a missing UUID. Despites efforts to fix the issue while the machine was online none were to prove successful.

We took the decision to reboot the machine in the hope that the rescan of the devices would fix or make clear what the issue was, unfortunately the OS would not boot and we were forced to revert to the last known good backup, Thursday 5th 23:10.

After a few issues with VMware not relinquishing information about the newly added 3rd disk, we had to delete the VM prod-cacti02-vie-at.geant.org altogether and then let the backup replace the box. This proved more successful and the box was booted.

On restoration of the service we reviewed the state of the box and found that in /var/lib/mysql that there were a lot of relay logs. This was unusual so we checked the mysql replication status and found that replication from the master (prod-cacti01-fra-de.geant.org) was broken and had been broken since 25th February according to the mysqld.log, due to a mis-match in the default values of id in three tables where inserts from the master were trying to write. We fixed the mismatch and replication started and continued to flow. There was a further issue were replication from prod-cacti02-vie-at.geant.org slave to the master prod-cacti01-fra-de.geant.org. This was due to the backup showing an earlier binlog number that was expected by master for its replication. As that was momentary we reset the slave process on prod-cacti01-fra-de.geant.org and replication again continued from vie to fra. As no futher issues were being alerted systemically we closed the incident and made NOC aware that this situation would need review on Monday morning.

Page tree

Cacti production incident - 06-03-2020