...

cf. IT incident: 30052022
Local system partition errorscorruption
Failure to connect or write data to InfluxDB

...

Incident severity:

Status

colour	Red
title	CRITICAL

Intermittent Temporary service outage

Data loss:

Status

subtle	true
colour	Blue
title	No

...

Timeline

All times are in UTC

Date	Time (UTC)	Description
30 May 2022	12:52:37

the

The first evidence of this incident appeared in the logs of prod-poller-processor.geant.org

Code Block

language	text
theme	Confluence

May 30

. remove_spikes_interface_rates is one of several stream functions in the data processing pipeline required for the data displayed in BRIAN.

May 30 12:52:37

prod-poller-processor

kapacitord[124994]:

ts=2022-05-30T12:52:37.802Z

lvl=error

msg="failed

to

write

points

to

InfluxDB"

service=kapacitor

task_master=main

task=remove_spikes_gwsd_rates

node=influxdb_out3

err=timeout

May

30

12:52:38

prod-poller-processor

kapacitord[124994]:

ts=2022-05-30T12:52:38.069Z

lvl=error

msg="encountered

error"

service=kapacitor

task_master=main

task=remove_spikes_interface_rates

node=remove_spikes2

err="keepalive

timedout,

last

keepalive

received

was:

2022-05-30

12:52:28.069298439

+0000

`UTC"`
30 May 2022	afternoon	Several performance issues started

to become apparent

being reported across the network:

ems

EMS was failing to resolve the influx cluster hostnames `puppet` was failing or taking a very long time to complete on many vm's
30 May 2022	19:08	Keith Slater (and others) alerted on the `#brian` channel that data was missing in the BRIAN gui
30 May 2022	20:30	Bjarke Madsen replied that

is

it seemed related to service problems seen earlier in the day
30 May 2022	21:12	Massimiliano Adamo replied on `#swd-private` that we had raised an issue with

VMWare

VMWare regarding storage device failure.
30 May 2022	23:28	Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down
30 May 2022	12:53

continuous failures writing to Influx, or resolving the hostname: Code BlockMay 31

For the duration of this event, Kapacitor continuously logged failures regarding writing to or communicating with InfluxDB, as below:

May 31 00:49:08

prod-poller-processor

kapacitord[54933]:

ts=2022-05-31T00:49:08.133Z

lvl=error

msg="failed

to

write

points

to

InfluxDB"

service=kapacitor

task_master=main

task=interface_rates

node=influxdb_out12

err=timeout

May

31

01:26:44

prod-poller-processor

kapacitord[54933]:

ts=2022-05-31T01:26:44.163Z

lvl=error

msg="failed

to

connect

to

InfluxDB,

retrying..."

service=influxdb

cluster=read

err="Get

https://influx-cluster.service.ha.geant.org:8086/ping:

dial

tcp:

lookup

influx-cluster.service.ha.geant.org

on

83.97.93.200:53:

no

such

host"

This means that while Kapacitor was receiving live network counters in real time, the results of the rate calculations weren't be saved to InfluxDB.

31 May 2022

08:12

31 May 2022

02:34-08:11

also lots of

There were many incidents of disk i/o

errros in the logs

failure logged over the duration of the event, indicating filesystem/disk corruption. For example:

May 31

Code Block

language	text

May 31

02:34:59

prod-poller-processor

kernel:

[1754481.561423]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561427]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

02

05

63

f0

00

08

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561442]

EXT4-fs

warning

(device

dm-0):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

149650

(offset

0

size

0

starting

block

702078)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561494]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561497]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

d3

18

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561529]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561532]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

af

dd

b0

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561550]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561553]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

f3

b3

d0

00

08

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561567]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1086s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561569]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

aa

75

90

00

18

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561579]

EXT4-fs

warning

(device

dm-4):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

73

(offset

4927488

size

8192

starting

block

1005746

)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561632]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561635]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

b1

90

00

08

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561675]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561678]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

d3

48

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561696]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561699]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

d3

68

00

a0

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561758]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561761]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

af

c6

68

00

20

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561789]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561792]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

b4

f0

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561814]

EXT4-fs

warning

(device

dm-0):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

131114

(offset

0

size

0

starting

block

1029894)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561852]

EXT4-fs

warning

(device

dm-0):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

131105

(offset

14843904

size

12288

starting

block

261672)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.626924]

JBD2:

Detected

IO

errors

while

flushing

file

data

on

`dm-0-8`
31 May 2022	07:34	Keith Slater took ownership of informing APM's
31 May 2022	08:12	Pete Pedersen stopped the system and fixed the corrupt partition.
31 May 2022	08:26:55	System was rebooted.
31 May 2022	08:26:55	There was a network DNS failure during the boot process and `haproxy` failed to start

because it couldn't resolve

, because prod-inventory-provider01.geant.org and prod-inventory-

provider0x

provider02.geant.org

code

couldn't be resolved:

May

31

08:26:55

prod-poller-processor

haproxy[976]:

[ALERT]

150/082655

(976)

:

parsing

[/etc/haproxy/haproxy.cfg:30]

:

'server

prod-inventory-provider01.geant.org'

:

could

not

resolve

address

'prod-inventory-provider01.geant.org'.

May

31

08:26:55

prod-poller-processor

haproxy[976]:

[ALERT]

150/082655

(976)

:

parsing

[/etc/haproxy/haproxy.cfg:31]

:

'server

prod-inventory-provider02.geant.org'

:

could

not

resolve

address

`'prod-inventory-provider02.geant.org'.`
31 May 2022	08:27:07	`Kapacitor` tasks failed to run because the `haproxy` service wasn't running, for example:

code

May

31

08:27:07

prod-poller-processor

kapacitord[839]:

ts=2022-05-31T08:27:07.962Z

lvl=info

msg="UDF

log"

service=kapacitor

task_master=main

task=service_enrichment

node=inventory_enrichment2

text="urllib3.exceptions.MaxRetryError:

HTTPConnectionPool(host='localhost',

port=8080):

Max

retries

exceeded

with

url:

/poller/interfaces

(Caused

by

NewConnectionError('<urllib3.connection.HTTPConnection

object

at

0x7f749f4a2978>:

Failed

to

establish

a

new

connection:

[Errno

111]

Connection

refused',))"

Since the Kapacitor tasks weren't running, network counters were still not being processed or saved to InfluxDB.

31 May 2022

08:41:11

puppet ran automatically and restarted haproxy.

At this time

dns

DNS resolution was back to normal, and haproxy successfully started

... but

But `Kapacitor` tasks were still in a non-executing state, therefore data was still not being processed.
31 May 2022	09:27:10

manual

Manual restart of Kapacitor. Normal

system behavior restored

BRIAN data processing of real-time data was restored.
31 May 2022	10:39	Sam Roberts copied the

lost

data points lost during the incident from UAT to production `interface_rates` `dscp32_rates` `gwsd_rates` `multicast_rates`
31 May 2022	11:56	Keith Slater informed APMs - BRIAN is back to normal operation.

Proposed Solution

The core issue seems to be related to VMWare and IT need to provide a solution.
This incident suggests that a previously logged technical debt issue (POL1-529), which has been considered medium/low priority, could be prioritized for development:
- fixing this issue could generally help with temporary DNS resolution errors, however the DNS issues were secondary in this incident and fixing this issue wouldn't have prevented the overall outage
A previously-known issue with the Kapacitor tasks stopping due to unchecked errors meant that the services were not executing for longer than necessary. (cf. POL1-529)

Page tree

Versions Compared

Old Version 10

New Version 11

Key

Timeline

Proposed Solution

Page tree

Page History

Versions Compared

Old Version 10

New Version 11

Key

Timeline

Proposed Solution