...
Date | Time (UTC) | Description | |||||||
---|---|---|---|---|---|---|---|---|---|
| 12:52:37 | the first evidence of this incident appeared in the logs of prod-poller-processor.geant.org
| |||||||
| afternoon | Several performance issues started to become apparent:
| |||||||
| 19:08 | Keith Slater (and others) alerted on the #brian channel that data was missing in the BRIAN gui | |||||||
| 20:30 | Bjarke Madsen replied that is seemed related to service problems seen earlier in the day | |||||||
| 21:12 | Massimiliano Adamo replied on #swd-private that we had raised an issue with VMWare | |||||||
| 23:28 | Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down | |||||||
| 12:53 | continuous failures writing to Influx, or resolving the hostname:
| |||||||
| 08:12 | ||||||||
| 02:34-08:11 | also lots of i/o errros in the logs
| |||||||
| 07:34 | Keith Slater took ownership of informing APM's | |||||||
| 08:12 | Pete Pedersen stopped the system and fixed the corrupt partition. | |||||||
| 08:26:55 | System was rebooted. | |||||||
| 08:26:55 | haproxy failed to start because it couldn't resolve prod-inventory-provider0x.geant.org
| |||||||
| 08:27:07 | Kapacitor tasks failed to run because the haproxy service wasn't running, for example:
| |||||||
| 08:41:11 | puppet ran and restarted haproxy. this time dns resolution was back to normal, and haproxy successfully started ... but Kapacitor tasks were still in a non-executing state | |||||||
| 09:27:10 | manual restart of Kapacitor. Normal system behavior restored | |||||||
| 10:39 | Sam Roberts copied the lost data points from UAT to production
| |||||||
31 May 2022 | 11:56 | Keith Slater informed APMs - BRIAN is back to normal operation. |
...