...
Date | Time (UTC) | Description | |||||
---|---|---|---|---|---|---|---|
| 12:52:37 | the first evidence of this incident appeared in the logs of prod-poller-processor.geant.org
| |||||
| afternoon | Several performance issues started to become apparent:
| |||||
| 19:08 | Keith Slater (and others) alerted on the #brian channel that data was missing in the BRIAN gui | |||||
| 20:30 | Bjarke Madsen replied that is seemed related to service problems seen earlier in the day | |||||
| 21:12 | Massimiliano Adamo replied on #swd-private that we had raised an issue with VMWare | |||||
| 23:28 | Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down | |||||
| 12:53 | continuous failures writing to Influx, or resolving the hostname:
| |||||
| 08:12 | ||||||
| 02:34-08:11 | also lots of i/o errros in the logs
| |||||
| 07:34 | Keith Slater took ownership of informing APM's | |||||
| 08:12 | Pete Pedersen stopped the system and fixed the corrupt partition. | |||||
| 08:26:55 | System was rebooted. | |||||
| 08:26:55 | haproxy failed to start because it couldn't resolve prod-inventory-provider0x.geant.org
| |||||
| 08:27:07 | Kapacitor tasks failed to run because the haproxy service wasn't running, for example:
| |||||
| 08:41:11 | puppet ran and restarted haproxy. this time dns resolution was back to normal, and haproxy successfully started ... but Kapacitor tasks were still in a non-executing state | |||||
| 09:27:10 | manual restart of Kapacitor. Normal system behavior restored | |||||
| 10:39 | Sam Roberts copied the lost data points from UAT to production
|
...