...
- cf. IT incident: 30052022
- Local system partition errorscorruption
- Failure to connect or write data to InfluxDB
...
Incident severity: Intermittent Temporary service outage
Data loss: Status |
---|
subtle | true |
---|
colour | Blue |
---|
title | No |
---|
|
...
Timeline
All times are in UTC
Date | Time (UTC) | Description |
---|
| 12:52:37 |
the The first evidence of this incident appeared in the logs of prod-poller-processor.geant.org |
Code Block |
---|
language | text |
---|
theme | Confluence |
---|
|
May 30 . remove_spikes_interface_rates is one of several stream functions in the data processing pipeline required for the data displayed in BRIAN.
May 30 12:52:37
|
ts=2022-05-30T12:52:37.802Z
|
task=remove_spikes_gwsd_rates
|
ts=2022-05-30T12:52:38.069Z
|
task=remove_spikes_interface_rates
|
UTC"
|
| afternoon | Several performance issues started |
to become apparentbeing reported across the network: |
ems - EMS was failing to resolve the influx cluster hostnames
puppet was failing or taking a very long time to complete on many vm's
|
| 19:08 | Keith Slater (and others) alerted on the #brian channel that data was missing in the BRIAN gui |
| 20:30 | Bjarke Madsen replied that |
is it seemed related to service problems seen earlier in the day |
| 21:12 | Massimiliano Adamo replied on #swd-private that we had raised an issue with |
VMWare continuous failures writing to Influx, or resolving the hostname: Code Block |
May 31 For the duration of this event, Kapacitor continuously logged failures regarding writing to or communicating with InfluxDB, as below:
May 31 00:49:08
|
ts=2022-05-31T00:49:08.133Z
|
ts=2022-05-31T01:26:44.163Z
|
host"
This means that while Kapacitor was receiving live network counters in real time, the results of the rate calculations weren't be saved to InfluxDB. |
| 08:12 |
| 02:34-08:11 |
also lots of There were many incidents of disk i/o |
errros in the logsfailure logged over the duration of the event, indicating filesystem/disk corruption. For example:
May 31
|
Code Block |
---|
|
May 31
dm-0-8
|
| 07:34 | Keith Slater took ownership of informing APM's |
| 08:12 | Pete Pedersen stopped the system and fixed the corrupt partition. |
| 08:26:55 | |
| 08:26:55 | There was a network DNS failure during the boot process and haproxy failed to start |
because it couldn't resolve , because prod-inventory-provider01.geant.org and prod-inventory- |
provider0xcode [/etc/haproxy/haproxy.cfg:30]
|
[/etc/haproxy/haproxy.cfg:31]
|
|
| 08:27:07 | Kapacitor tasks failed to run because the haproxy service wasn't running, for example:
|
code ts=2022-05-31T08:27:07.962Z
|
node=inventory_enrichment2
|
text="urllib3.exceptions.MaxRetryError:
|
HTTPConnectionPool(host='localhost',
|
NewConnectionError('<urllib3.connection.HTTPConnection
|
refused',))"
Since the Kapacitor tasks weren't running, network counters were still not being processed or saved to InfluxDB. |
| 08:41:11 | puppet ran automatically and restarted haproxy .
At this time |
dns DNS resolution was back to normal, and haproxy successfully started |
... but But Kapacitor tasks were still in a non-executing state, therefore data was still not being processed. |
| 09:27:10 |
manual Manual restart of Kapacitor . Normal |
system behavior restoredBRIAN data processing of real-time data was restored. |
| 10:39 | |
lost data points lost during the incident from UAT to production interface_rates dscp32_rates gwsd_rates multicast_rates
|
31 May 2022 | 11:56 | Keith Slater informed APMs - BRIAN is back to normal operation. |
Proposed Solution
- The core issue seems to be related to VMWare and IT need to provide a solution.
- This incident suggests that a previously logged technical debt issue (POL1-529), which has been considered medium/low priority, could be prioritized for development:
- fixing this issue could generally help with temporary DNS resolution errors, however the DNS issues were secondary in this incident and fixing this issue wouldn't have prevented the overall outage
A previously-known issue with the Kapacitor tasks stopping due to unchecked errors meant that the services were not executing for longer than necessary. (cf. POL1-529)