...
Date/time (UTC) | Action | Actor | ||||||||
09:17 | Ashley Brown noticed that the most recent Inventory Provider update failed and asked on the dashboard users Slack channel. Erik Reid investigated. | |||||||||
09:31 | noticed this in the production Inventory Provider logs and asked Sam Roberts on Slack to investigate
| |||||||||
09:41 | Bjarke Madsen asked on Slack if anyone had information about Sensu check failure notifications. Erik Reid shared the critical error info:
| Bjarke Madsen | ||||||||
09:44 | Bjarke Madsen noticed that the kapacitor speed removal process was failing, because the Inventory Provider /poller/speeds api was returning errors:
| |||||||||
09:50 | The Inventory Provider update that occurred on (TT#2023111334002463) included the code changes that were failing. It was decided to roll this back. | |||||||||
09:58 | The Inventory Provider was rolled back in production and the data processing pipeline functionality was restored. | |||||||||
10:13 | The team decided there were 2 issues:
| |||||||||
10:28 | Sam Roberts found that the failure was being caused when the /poller/speeds processor computed the aggregate speed for ae6 on mx2.zag.hr.geant.net | |||||||||
12:17 | Sam Roberts found that the failure on computing the aggregate speed for mx2.zag.hr.geant.net/ae6 is because the Inventory cache data included et-4/0/1, et-4/0/2 and et-4/0/2.0. A logical interface in the list was unexpected and the processing failed when parsing this name. Sam Roberts heard from Robert Latta that the OC had been testing on this interface, but the details weren't clear. | |||||||||
12:30 | Bjarke Madsen attempted to restore GWS Direct rates in the outage timespan, but an error with a command caused the data to be modified past the outage duration, rendering the data unavailable temporarily. | |||||||||
15:02 | Sam Roberts prepared a MR for Inventory Provider to fix both of the issues above. | |||||||||
16:03 | Ashley Brown explained to Robert Latta and Erik Reid the test configuration that was enabled on mx2.zag.hr.geant.net. The details are described in
| |||||||||
21 15:30 | Bjarke Madsen restored availability of GWS Direct rates and copied over missing data in the outage duration from UAT. | |||||||||
13:30 | Bjarke Madsen restored (interface/scid) rates by copying from UAT to production | |||||||||
...