Incident Description
New BRIAN bitrate traffic was not computed or saved available for approximately 21 hours.
...
The reason for degradation:
- cf. IT incident: 30052022
- Local system partition errorscorruption
- Failure to connect or write data to InfluxDB
...
Incident severity:
Intermittent Temporary service outage Status colour Red title CRITICAL
Data loss: Status subtle true colour Blue title No
...
Timeline
All times are in UTC
Date | Time (UTC) | Description |
---|---|---|
| 12:52:37 |
The first evidence of this incident appeared in the logs of |
.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| afternoon | Several performance issues started |
being reported across the network: |
| ||
| 19:08 | Keith Slater (and others) alerted on the |
| 20:30 | Bjarke Madsen replied that |
it seemed related to service problems seen earlier in the day | ||
| 21:12 | Massimiliano Adamo replied on |
VMWare regarding storage device failure. | ||
| 23:28 | Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down |
| 12:53 |
For the duration of this event, Kapacitor continuously logged failures regarding writing to or communicating with InfluxDB, as below:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This means that while Kapacitor was receiving live network counters in real time, the results of the rate calculations weren't being saved to InfluxDB. | |
| 08:12 |
| 02:34-08:11 |
There were many incidents of disk i/o |
failure logged over the duration of the event, indicating filesystem/disk corruption. For example:
|
language | text |
---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| 07:34 | Keith Slater took ownership of informing APM's |
| 08:12 | Pete Pedersen stopped the system and fixed the corrupt partition. |
| 08:26:55 | System was rebooted. |
| 08:26:55 | There was a network DNS failure during the boot process and |
, because |
|
couldn't be resolved: May 31 |
08:26:55 |
prod-poller-processor |
haproxy[976]: |
[ALERT] |
150/082655 |
(976) |
: |
parsing |
[/etc/haproxy/haproxy.cfg:30] |
: |
'server |
: |
could |
not |
resolve |
address |
May |
31 |
08:26:55 |
prod-poller-processor |
haproxy[976]: |
[ALERT] |
150/082655 |
(976) |
: |
parsing |
[/etc/haproxy/haproxy.cfg:31] |
: |
'server |
: |
could |
not |
resolve |
address |
| 08:27:07 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Since the Kapacitor tasks weren't running, network counters were still not being processed or saved to InfluxDB. | ||
| 08:41:11 |
At this time |
DNS resolution was back to normal, and |
But | |
| 09:27:10 |
Manual restart of Kapacitor . Normal |
BRIAN data processing of real-time data was restored. | ||
| 10:39 | Sam Roberts copied the |
data points lost during the incident from UAT to production
| ||
31 May 2022 | 11:56 | Keith Slater informed APMs - BRIAN is back to normal operation. |
Proposed Solution
- The core issue seems to be related to VMWare and IT need to provide a solution. S.M.A.R.T. A previously-known issue with the Kapacitor tasks stopping due to unchecked errors meant that the services were not executing for longer than necessary. (cf. POL1-529)alerts have been found in the vCenter, but monitoring has not been configured to detect these alerts.
- This incident suggests that a previously logged technical debt issue (POL1-529), which has been considered medium/low priority, could be prioritized for development:
- fixing this issue could generally help with temporary DNS resolution errors, however the DNS issues were secondary in this incident and fixing this issue wouldn't have prevented the overall outage
- while VMWare disk corruption and network dns failures are external events and out of the control of SWD, a further investigation for potential improvement in processing resiliency is described in POL1-607.