Incident Description
New BRIAN bitrate traffic was not computed or saved available for approximately 21 hours.
...
The reason for degradation:
- cf. IT incident: 30052022
- Local system partition corruption
- Failure to connect or write data to InfluxDB
- Local system partition errors
- cf. IT incident: 30052022
The impact of this service degradation was:
...
Incident severity:
Intermittent Temporary service outage Status colour Red title CRITICAL
Data loss: Status subtle true colour Blue title No
Total duration of incident: 21 hours/On going (as of 22:22 UTC)
Timeline
All times are in UTC
Date | Time (UTC) | Description |
---|---|---|
| 12:52:37 |
The first evidence of this incident appeared in the logs of |
.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| afternoon | Several performance issues started |
being reported across the network: |
| ||
| 19:08 | Keith Slater (and others) alerted on the |
| 20:30 | Bjarke Madsen replied that |
it seemed related to service problems seen earlier in the day | ||
| 21:12 | Massimiliano Adamo replied on |
VMWare regarding storage device failure. | ||
| 23:28 | Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down |
| 12:53 |
For the duration of this event, Kapacitor continuously logged failures regarding writing to or communicating with InfluxDB, as below:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This means that while Kapacitor was receiving live network counters in real time, the results of the rate calculations weren't being saved to InfluxDB. | |
| 08:12 |
| 02:34-08:11 |
There were many incidents of disk i/o |
failure logged over the duration of the event, indicating filesystem/disk corruption. For example:
|
language | text |
---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| 07:34 | Keith Slater took ownership of informing APM's |
| 08:12 | Pete Pedersen stopped the system and fixed the corrupt partition. |
| 08:26:55 | System was rebooted. |
| 08:26:55 | There was a network DNS failure during the boot process and |
, because |
|
couldn't be resolved: May |
31 |
08:26:55 |
prod-poller-processor |
haproxy[976]: |
[ALERT] |
150/082655 |
(976) |
: |
parsing |
[/etc/haproxy/haproxy.cfg:30] |
: |
'server |
: |
could |
not |
resolve |
address |
May |
31 |
08:26:55 |
prod-poller-processor |
haproxy[976]: |
[ALERT] |
150/082655 |
(976) |
: |
parsing |
[/etc/haproxy/haproxy.cfg:31] |
: |
'server |
: |
could |
not |
resolve |
address |
| 08:27:07 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Since the Kapacitor tasks weren't running, network counters were still not being processed or saved to InfluxDB. | ||
| 08:41:11 |
At this time |
DNS resolution was back to normal, and |
But | |
| 09:27:10 |
Manual restart of Kapacitor . Normal |
BRIAN data processing of real-time data was restored. | ||
| 10:39 | Sam Roberts copied the |
data points lost during the incident from UAT to production
| ||
31 May 2022 | 11:56 | Keith Slater informed APMs - BRIAN is back to normal operation. |
Proposed Solution
- The core issue seems to be related to VMWare and IT need to provide a solution. S.M.A.R.A previously-known issue with the Kapacitor tasks stopping due to unchecked errors meant that the services were not executing for longer than necessaryT. alerts have been found in the vCenter, but monitoring has not been configured to detect these alerts.
- This incident suggests that a previously logged technical debt issue (POL1-529), which has been considered medium/low priority, could be prioritized for development:
- fixing this issue could generally help with temporary DNS resolution errors, however the DNS issues were secondary in this incident and fixing this issue wouldn't have prevented the overall outage
- while VMWare disk corruption and network dns failures are external events and out of the control of SWD, a further investigation for potential improvement in processing resiliency is described in POL1-607.