...
- Failure to connect or write data to InfluxDB
- Local system partition errors
- cf. IT incident: 30052022
...
Date | Time (UTC) | Description | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| 12:52:37 | the first evidence of this incident appeared in the logs of prod-poller-processor.geant.org
| |||||||||
| 12:53 | continuous failures writing to Influx, or resolving the hostname:
| |||||||||
| 08:12 | ||||||||||
| 02:34-08:11 | also lots of i/o errros in the logs
| |||||||||
| 08:12 | System was stoppedPete Pedersen stopped the system and fixed the corrupt partition. | |||||||||
| 08:26:55 | System rebooting and starting upwas rebooted. | |||||||||
| 08:26:55 | haproxy failed to start because it couldn't resolve prod-inventory-provider0x.geant.org
| |||||||||
| 2008:27:5007 | Kapacitor tasks failed to run because the haproxy service wasn't running, for example:
| |||||||||
| 08:41:11 | puppet ran and restarted haproxy. this time dns resolution was back to normal, and haproxy successfully started ... but Kapacitor tasks were still in a non-executing state | |||||||||
| 09:27:10 | manual restart of Kapacitor. Normal system behavior restored | |||||||||
| 10:39 | Sam Roberts copied the lost data points from UAT to production
|
Proposed Solution
- The core issue seems to be related to VMWare and IT need to provide a solution