Incident description
The server that runs all the Wordpress site (wordpress1.geant.org) became unreachable at 12:10:59 CET
Incident severity: CRITICAL
Data loss: NO
Monitoring alerted: YES
Timeline
Time (CET) | |
---|---|
12:10 | Apache server stop accepting incoming requests |
12:12 | Chris Atherton reported on #it channel that site aac-project.eu is not working correctly |
12:21 | Konstantin Lepikhov confirmed the issue with wordpress1 site on #devops channel |
12:23 | Dick Visser connected to VM via console and confirmed that network is down (gateway not reachable) |
12:29 | Massimiliano Adamo have restarted network service inside VM, after that everything started working and network came up. |
12:30 | Konstantin Lepikhov announced that problem fixed. |
Total downtime: 20 minutes.
Analysis
As part of BAU and the handover of his responsibilities, Dick Visser was working on migrating a VM from the University of Amsterdam, into the GEANT VMware cluster in Frankfurt.
...
According to the logs, a minute later the wordpress1.geant.org VM (which has IP address 83.97.92.46) was live migrated from fra-prd-esx01 to fra-prd-esx02. This process started at 12:10:49 and finished 10 seconds later at 12:10:59.
...
The reason for this needs to be further investigated, because DRS moving VMs around is common practise, and this should not impact VMs at all.
Logs/screendump
DRS migration:
Monitoring alerted: YES