...
Bjarke Madsen confirmed that the UAT data processing pipeline had fortunately not failed for the affected time period and the data could therefore be copied from there into the production system. This was done on <TBD> .
A prior attempt at data restoration was made around 12:30 UTC on , to make Kapacitor re-process counters into rates for the timespan with missing data. Counters were available on production in the outage timespan, but were not converted to rates.
During this attempt at data restoration using Kapacitor's ability to re-process older counters (first attempted on GWS Direct data) an incorrect argument was given to the replay command to limit the processing to within the outage timespan.
This caused the GWS Direct data to be modified past the outage timespan, modifying the tags and making it temporarily unavailable until around 15:30 UTC, as the measurement had to be re-created with the tags restored.
Timeline
Detail the incident timeline.
...
Date/time (UTC) | Action | Actor | ||||||||
09:17 | Ashley Brown noticed that the most recent Inventory Provider update failed and asked on the dashboard users Slack channel. Erik Reid investigated. | |||||||||
09:31 | noticed this in the production Inventory Provider logs and asked Sam Roberts on Slack to investigate
| |||||||||
09:41 | Bjarke Madsen asked on Slack if anyone had information about Sensu check failure notifications. Erik Reid shared the critical error info:
| Bjarke Madsen | ||||||||
09:44 | Bjarke Madsen noticed that the kapacitor speed removal process was failing, because the Inventory Provider /poller/speeds api was returning errors:
| |||||||||
09:50 | The Inventory Provider update that occurred on (TT#2023111334002463) included the code changes that were failing. It was decided to roll this back. | |||||||||
09:58 | The Inventory Provider was rolled back in production and the data processing pipeline functionality was restored. | |||||||||
10:13 | The team decided there were 2 issues:
| |||||||||
10:28 | Sam Roberts found that the failure was being caused when the /poller/speeds processor computed the aggregate speed for ae6 on mx2.zag.hr.geant.net | |||||||||
12:17 | Sam Roberts found that the failure on computing the aggregate speed for mx2.zag.hr.geant.net/ae6 is because the Inventory cache data included et-4/0/1, et-4/0/2 and et-4/0/2.0. A logical interface in the list was unexpected and the processing failed when parsing this name. Sam Roberts heard from Robert Latta that the OC had been testing on this interface, but the details weren't clear.Sam Roberts weren't clear. | |||||||||
12:30 | Bjarke Madsen attempted to restore GWS Direct rates in the outage timespan, but an error with a command caused the data to be modified past the outage duration, rendering the data unavailable temporarily. | |||||||||
15:02 | Sam Roberts prepared a MR for Inventory Provider to fix both of the issues above. | |||||||||
16:03 | Ashley Brown explained to Robert Latta and Erik Reid the test configuration that was enabled on mx2.zag.hr.geant.net. The details are described in
| |||||||||
15:30 | Bjarke Madsen restored availability of GWS Direct rates and copied over missing data in the outage duration from UAT. | |||||||||
13:30 | Bjarke Madsen restored (interface/scid) rates by copying from UAT to production | |||||||||
Root Cause Identification
...