Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Incident Description

New BRIAN bitrate traffic was not computed or saved available for approximately 21 hours.

...

The reason for degradation:

  • cf. IT incident: 30052022
  • Local system partition corruption
  • Failure to connect or write data to InfluxDB
  • Local system partition errors
  • cf. IT incident: 30052022


The impact of this service degradation was:

...

Incident severity: 

Status
colourRed
titleCRITICAL
 Intermittent Temporary service outage

Data loss: 

Status
subtletrue
colourBlue
titleNo

Total duration of incident: 21 hours/On going (as of  22:22 UTC)


Timeline

All times are in UTC

DateTime (UTC)Description

 

12:52:37
the

The first evidence of this incident appeared in the logs of prod-poller-processor.geant.org

Code BlockMay 30

. remove_spikes_interface_rates is one of several stream functions in the data processing pipeline required for the data displayed in BRIAN.


May 30 12:52:37

prod-poller-processor

kapacitord[124994]:

ts=2022-05-30T12:52:37.802Z

lvl=error

msg="failed

to

write

points

to

InfluxDB"

service=kapacitor

task_master=main

task=remove_spikes_gwsd_rates

node=influxdb_out3

err=timeout

May

30

12:52:38

prod-poller-processor

kapacitord[124994]:

ts=2022-05-30T12:52:38.069Z

lvl=error

msg="encountered

error"

service=kapacitor

task_master=main

task=remove_spikes_interface_rates

node=remove_spikes2

err="keepalive

timedout,

last

keepalive

received

was:

2022-05-30

12:52:28.069298439

+0000

UTC"

 

afternoon

Several performance issues started

to become apparent

being reported across the network:

ems
  • EMS was failing to resolve the influx cluster hostnames
  • puppet was failing or taking a very long time to complete on many vm's

 

19:08

Keith Slater (and others) alerted on the #brian channel that data was missing in the BRIAN gui

 

20:30

Bjarke Madsen replied that

is

it seemed related to service problems seen earlier in the day

 

21:12

Massimiliano Adamo replied on #swd-private that we had raised an issue with

VMWare 

VMWare regarding storage device failure.

 

23:28

Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down

 

12:53
continuous failures writing to Influx, or resolving the hostname:

For the duration of this event, Kapacitor continuously logged failures regarding writing to or communicating with InfluxDB, as below:


May 31

Code BlockMay 31

00:49:08

prod-poller-processor

kapacitord[54933]:

ts=2022-05-31T00:49:08.133Z

lvl=error

msg="failed

to

write

points

to

InfluxDB"

service=kapacitor

task_master=main

task=interface_rates

node=influxdb_out12

err=timeout

May

31

01:26:44

prod-poller-processor

kapacitord[54933]:

ts=2022-05-31T01:26:44.163Z

lvl=error

msg="failed

to

connect

to

InfluxDB,

retrying..."

service=influxdb

cluster=read

err="Get

https://influx-cluster.service.ha.geant.org:8086/ping:

dial

tcp:

lookup

influx-cluster.service.ha.geant.org

on

83.97.93.200:53:

no

such

host"


This means that while Kapacitor was receiving live network counters in real time, the results of the rate calculations weren't being saved to InfluxDB.

 

08:12

 

02:34-08:11
also lots of

There were many incidents of disk i/o

errros in the logs

failure logged over the duration of the event, indicating filesystem/disk corruption.  For example:


May 31

Code Block
languagetext
May 31

02:34:59

prod-poller-processor

kernel:

[1754481.561423]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561427]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

02

05

63

f0

00

00

08

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561442]

EXT4-fs

warning

(device

dm-0):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

149650

(offset

0

size

0

starting

block

702078)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561494]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561497]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

d3

18

00

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561529]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561532]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

af

dd

b0

00

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561550]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561553]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

f3

b3

d0

00

00

08

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561567]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1086s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561569]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

aa

75

90

00

00

18

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561579]

EXT4-fs

warning

(device

dm-4):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

73

(offset

4927488

size

8192

starting

block

1005746

)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561632]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561635]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

b1

90

00

00

08

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561675]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561678]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

d3

48

00

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561696]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561699]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

d3

68

00

00

a0

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561758]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561761]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

af

c6

68

00

00

20

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561789]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561792]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

b4

f0

00

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561814]

EXT4-fs

warning

(device

dm-0):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

131114

(offset

0

size

0

starting

block

1029894)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561852]

EXT4-fs

warning

(device

dm-0):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

131105

(offset

14843904

size

12288

starting

block

261672)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.626924]

JBD2:

Detected

IO

errors

while

flushing

file

data

on

dm-0-8

 

07:34
Keith Slater took ownership of informing APM's

 

08:12

Pete Pedersen stopped the system and fixed the corrupt partition.

 

08:26:55

System was rebooted.

 

08:26:55

There was a network DNS failure during the boot process and haproxy failed to start

because it couldn't resolve

, because prod-inventory-provider01.geant.org and prod-inventory-

provider0x

provider02.geant.org

couldn't be resolved:


May 31
Code BlockMay 31
08:26:55
prod-poller-processor
haproxy[976]:
[ALERT]
150/082655
(976)
:
parsing
[/etc/haproxy/haproxy.cfg:30]
:
'server
:
could
not
resolve
address
May
31
08:26:55
prod-poller-processor
haproxy[976]:
[ALERT]
150/082655
(976)
:
parsing
[/etc/haproxy/haproxy.cfg:31]
:
'server
:
could
not
resolve
address

 

08:27:07

Kapacitor tasks failed to run because the haproxy service wasn't running, for example:

Code Block

May

31

08:27:07

prod-poller-processor

kapacitord[839]:

ts=2022-05-31T08:27:07.962Z

lvl=info

msg="UDF

log"

service=kapacitor

task_master=main

task=service_enrichment

node=inventory_enrichment2

text="urllib3.exceptions.MaxRetryError:

HTTPConnectionPool(host='localhost',

port=8080):

Max

retries

exceeded

with

url:

/poller/interfaces

(Caused

by

NewConnectionError('<urllib3.connection.HTTPConnection

object

at

0x7f749f4a2978>:

Failed

to

establish

a

new

connection:

[Errno

111]

Connection

refused',))"


Since the Kapacitor tasks weren't running, network counters were still not being processed or saved to InfluxDB.

 

08:41:11

puppet ran automatically and restarted haproxy.

At this time

dns

DNS resolution was back to normal, and haproxy successfully started

... but

But Kapacitor tasks were still in a non-executing state, therefore data was still not being processed.

 

09:27:10
manual
Manual restart of Kapacitor.  Normal
system behavior restored
BRIAN data processing of real-time data was restored.

 

10:39

Sam Roberts copied the

lost

data points lost during the incident from UAT to production

  • interface_rates
  • dscp32_rates
  • gwsd_rates
  • multicast_rates
31 May 202211:56

Keith Slater informed APMs - BRIAN is back to normal operation.

Proposed Solution

  • The core issue seems to be related to VMWare and IT need to provide a solution. S.A previously-known issue with the Kapacitor tasks stopping due to unchecked errors meant that the services were not executing for longer than necessary. (cf. POL1-529)M.A.R.T. alerts have been found in the vCenter, but monitoring has not been configured to detect these alerts.
  • This incident suggests that a previously logged technical debt issue (POL1-529), which has been considered medium/low priority, could be prioritized for development:
    • fixing this issue could generally help with temporary DNS resolution errors, however the DNS issues were secondary in this incident and fixing this issue wouldn't have prevented the overall outage 
  • while VMWare disk corruption and network dns failures are external events and out of the control of SWD, a further investigation for potential improvement in processing resiliency is described in POL1-607.