Incident description
The server that runs all the Wordpress site (wordpress1.geant.org) become became unreachable at 12:10:59 CET
Incident severity: CRITICAL
Data loss: NO
Monitoring alerted: YES
Timeline
Time (CET) | |
---|---|
12:10 | Apache server stop accepting incoming requests |
12:12 | Chris Atherton reported on #it channel that site aac-project.eu is not working |
correctly | |
12:21 | Konstantin Lepikhov confirmed the issue with |
wordpress1 site on #devops channel | |
12:23 | Dick Visser connected to VM via console and confirmed that network is down ( |
gateway not reachable) | |
12:29 | Massimiliano Adamo have restarted network service inside VM, after that everything started working and network |
came up. | |
12:30 | Konstantin Lepikhov announced that problem fixed. |
Total downtime: 20 minutes.
Current situation
We're currently investigating the nature of the issue. I could either VMWware network adapter issue or something related network configuration in VMWhare cluster.
.
Analysis
As part of BAU and the handover of his responsibilities, Dick Visser was working on migrating a VM from the University of Amsterdam, into the GEANT VMware cluster in Frankfurt.
The IP address that was allocated for this purpose was 83.97.92.49. This was still in use by a test VM called eventr2-test.geant.org. This test VM wasn't used anymore and it was halted, and subsequently deleted at 11:51:15.
The migration of the VM (name: nagios.terena.org) into the Frankfurt cluster was then started at 11:57, and finished at 12:01.
This went smoothly and the VM was powered up at 12:02:10, and once signed in to the console, it's IP address and gateway were assigned, and tested.
Once connectivity was confirmed, the VM was halted and powered up again at 12:09:48 to make sure everything worked as expected, so that the VM comes back up when it's rebooted unintendedly.
According to the logs, a minute later the wordpress1.geant.org VM (which has IP address 83.97.92.46) was live migrated from fra-prd-esx01 to fra-prd-esx02. This process started at 12:10:49 and finished 10 seconds later at 12:10:59.
This was done by DRS, Dynamic Resource Scheduler. It's purpose is to optimise performance by distributing VMs evenly across the various hypervisors, and this is a standard feature of VMware, and happens fully automatically.
The last known-good log entry in the apache log file is from 5 seconds before this:
www.eduroam.org:443 2018-02-22 12:10:54.342563 105.57.49.4 "GET /wp-content/plugins/templatesnext-toolkit/css/owl.carousel.css?ver=2.2.1 HTTP/1.1" 200 934 "https://www.eduroam.org/what-is-eduroam/" "Mozilla/5.0 (Linux; Android 5.0; Zn1 Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/34.0.0.0 Mobile Safari/537.36"
Logging in to the console of the VM showed that everything was running, but there was no network connectivity. Even the gateway was not reachable.
After Massimiliano Adamo restarted the network on the VM at 12:29:18, everything started working again.
The timing of all events indicates that powering on the newly migrated VM triggered DRS to migrate the wordpress1.geant.org VM to another hypervisor. But during this process the network connectivity to the VM was lost.
The reason for this needs to be further investigated, because DRS moving VMs around is common practise, and this should not impact VMs at all.
Logs/screendump
DRS migration:
Monitoring alerted: YES