Bjarke Madsen confirmed that the UAT data processing pipeline had fortunately not failed for the affected time period and the data could therefore be copied from there into the production system. This was done on <TBD> 21 Nov 2023.
A prior attempt at data restoration was made around 12:30 UTC on 17 Nov 2023, to make Kapacitor re-process counters into rates for the timespan with missing data. Counters were available on production in the outage timespan, but were not converted to rates.
During this attempt at data restoration using Kapacitor's ability to re-process older counters (first attempted on GWS Direct data) an incorrect argument was given to the replay command to limit the processing to within the outage timespan.
This caused the GWS Direct data to be modified past the outage timespan, modifying the tags and making it temporarily unavailable until 20 Nov 2023 around 15:30 UTC, as the measurement had to be re-created with the tags restored.

Timeline

Detail the incident timeline.

...

Date/time (UTC)

Action

Actor

17 Nov 2023 09:17

Ashley Brown noticed that the most recent Inventory Provider update failed and asked on the dashboard users Slack channel. Erik Reid investigated.

Erik Reid

17 Nov 2023 09:31

noticed this in the production Inventory Provider logs and asked Sam Roberts on Slack to investigate

Code Block

language	bash

Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: 2023-11-17 09:14:12,473 - inventory_provider.tasks.worker (605) - ERROR - unhandled exception loading srx2.ch.office.geant.net info: ncclient timed out while waiting for an rpc reply.
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: Traceback (most recent call last):
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/inventory_provider/tasks/worker.py", line 576, in reload_router_config_chorded
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: hostname, update_callback=self.log_warning)
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/inventory_provider/tasks/worker.py", line 650, in retrieve_and_persist_interface_info
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: interface_info_str = juniper.get_interface_info_for_router(hostname, InventoryTask.config["ssh"])
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/inventory_provider/juniper.py", line 495, in get_interface_info_for_router
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: reply = _raw_rpc(router, etree.Element('get-interface-information'))
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/inventory_provider/juniper.py", line 151, in _raw_rpc
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: obj = router.rpc(command)
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/ncclient/manager.py", line 251, in execute
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: huge_tree=self._huge_tree).request(*args, **kwds)
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/ncclient/operations/third_party/juniper/rpc.py", line 52, in request
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: return self._request(rpc)
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/ncclient/operations/rpc.py", line 381, in _request
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: raise TimeoutExpiredError('ncclient timed out while waiting for an rpc reply.')
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: ncclient.operations.errors.TimeoutExpiredError: ncclient timed out while waiting for an rpc reply.

Erik Reid

17 Nov 2023 09:41

Bjarke Madsen asked on Slack if anyone had information about Sensu check failure notifications. Erik Reid shared the critical error info:

Code Block

language	bash

2023-11-17 09:29:05,546 - brian_monitoring_checks.inventory - DEBUG - using inventory base api url: http://localhost:8080
2023-11-17 09:29:05,560 - brian_monitoring_checks.influx - DEBUG - select count(egress), count(ingress), count(errorsOut), count(errorsIn), count(egressv6), count(ingressv6) from interface_rates where time > now() - 20m group by hostname, interface_name
2023-11-17 09:29:05,832 - brian_monitoring_checks.influx - DEBUG - closing influx session
2023-11-17 09:29:09,736 - brian_monitoring_checks.check_counters - ERROR - no ingress/egress data in the last 20m for these routers: {'rt1.sof.bg.geant.net', 'mx1.bud.hu.geant.net', 'rt1.rig.lv.geant.net', 'qfx.fra.de.geant.net', 'mx1.dub2.ie.geant.net', 'rt1.buc.ro.geant.net', 'rt1.bra.sk.geant.net', 'rt1.kau.lt.geant.net', 'rt1.mil2.it.geant.net', 'srx1.ch.office.geant.net', 'rt2.rig.lv.geant.net', 'rt2.chi.md.geant.net', 'rt1.chi.md.geant.net', 'rt2.tar.ee.geant.net', 'rt2.bra.sk.geant.net', 'mx1.par.fr.geant.net', 'rt1.cor.ie.geant.net', 'qfx.lon2.uk.geant.net', 'mx1.fra.de.geant.net', 'rt1.mar.fr.geant.net', 'mx1.poz.pl.geant.net', 'qfx.par.fr.geant.net', 'rt2.cor.ie.geant.net', 'rt1.lju.si.geant.net', 'mx1.dub.ie.geant.net', 'mx2.ath.gr.geant.net', 'srx1.am.office.geant.net', 'mx2.lis.pt.geant.net', 'mx2.zag.hr.geant.net', 'rt2.kau.lt.geant.net', 'mx1.lon.uk.geant.net', 'rt2.kie.ua.geant.net', 'mx1.gen.ch.geant.net', 'rt1.ams.nl.geant.net', 'rt1.pra.cz.geant.net', 'mx1.ams.nl.geant.net', 'srx2.ch.office.geant.net', 'srx2.am.office.geant.net', 'rt1.fra.de.geant.net', 'rt2.ams.nl.geant.net', 'mx1.buc.ro.geant.net', 'mx1.ath2.gr.geant.net', 'mx1.ham.de.geant.net', 'rt1.por.pt.geant.net', 'mx1.mad.es.geant.net', 'mx1.vie.at.geant.net', 'rt1.kie.ua.geant.net', 'mx1.sof.bg.geant.net', 'rt2.bru.be.geant.net', 'rt1.bil.es.geant.net', 'rt1.tar.ee.geant.net', 'rt1.ham.de.geant.net', 'rt1.bru.be.geant.net', 'mx1.lon2.uk.geant.net'}
2023-11-17 09:29:09,740 - brian_monitoring_checks.check_counters - DEBUG - check returned with status: 2

Bjarke Madsen

17 Nov 2023 09:44

Bjarke Madsen noticed that the kapacitor speed removal process was failing, because the Inventory Provider /poller/speeds api was returning errors:

Code Block

language	bash

Sensu has detected a problem with this host.

Notification Type: PROBLEM
Host: prod-service-proxy-monitoring02.geant.org
State: DOWN
Check: brian-kapacitor-tasks
Occurrences: 1
Date/Time: 2023-11-16 20:20:48
Info: task "remove_spikes_gwsd_rates" has been restarted (was not executing)
task "interface_rates" is executing
task "multicast_rates" is executing
task "remove_spikes_dscp32_rates" has been restarted (was not executing)
task "gwsd_rates" is executing
task "remove_spikes_multicast_rates" has been restarted (was not executing)
task "service_enrichment" is executing
task "service_enrichment_lambda" is executing
task "remove_spikes_interface_rates" has been restarted (was not executing)
task "dscp32_rates" is executing

17 Nov 2023 09:50

The Inventory Provider update that occurred on 15 Nov 2023 (TT#2023111334002463) included the code changes that were failing. It was decided to roll this back.

Erik Reid , Bjarke Madsen

17 Nov 2023 09:58

The Inventory Provider was rolled back in production and the data processing pipeline functionality was restored.

Bjarke Madsen

17 Nov 2023 10:13

The team decided there were 2 issues:

the ncclient connection error causing the Inventory Provider update to fail: this was a transient error, caused by network connection issues (this is considered a part of
Jira
server Jira
serverId 5228d933-268f-3077-a879-21fb01eb8d41
key DBOARD3-822
)
the data processing pipeline failure caused by the /poller/speeds

Sam Roberts , Bjarke Madsen , Erik Reid

17 Nov 2023 10:28

Sam Roberts found that the failure was being caused when the /poller/speeds processor computed the aggregate speed for ae6 on mx2.zag.hr.geant.net

Sam Roberts

17 Nov 2023 12:17

Sam Roberts found that the failure on computing the aggregate speed for mx2.zag.hr.geant.net/ae6 is because the Inventory cache data included et-4/0/1, et-4/0/2 and et-4/0/2.0. A logical interface in the list was unexpected and the processing failed when parsing this name.

Sam Roberts heard from Robert Latta that the OC had been testing on this interface, but the details weren't clear.Sam Roberts weren't clear.

Sam Roberts

17 Nov 2023 12:30

Bjarke Madsen attempted to restore GWS Direct rates in the outage timespan, but an error with a command caused the data to be modified past the outage duration, rendering the data unavailable temporarily.

17 Nov 2023 15:02

Sam Roberts prepared a MR for Inventory Provider to fix both of the issues above.

17 Nov 2023 16:03

Ashley Brown explained to Robert Latta and Erik Reid the test configuration that was enabled on mx2.zag.hr.geant.net. The details are described in

Jira

server	Jira
serverId	5228d933-268f-3077-a879-21fb01eb8d41
key	DBOARD3-833

21 Nov 2023 15:30

Bjarke Madsen restored availability of GWS Direct rates and copied over missing data in the outage duration from UAT.

21 Nov 2023 13:30

Bjarke Madsen restored (interface/scid) rates by copying from UAT to production

Root Cause Identification

...

Page tree

Versions Compared

Old Version 3

New Version 4

Key

Timeline

Root Cause Identification

Page tree

Page History

Versions Compared

Old Version 3

New Version 4

Key

Timeline

Root Cause Identification