Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

EMS (via https://events.geant.org) has , and other services, have been unavailable for a few minutes at a time, throughout the day. This was due to PostgreSQL service being unavailable


The reason for degradation:


The impact of this service degradation was that the following services were unavailable:

...


Incident severity: 

Status
colourRed
titleCRITICAL
 Intermittent service outage

...

Total duration of incident: 13 hours/On going (as of  22:22 UTC)21 hours


Timeline

All times are in UTC

DateTimeDescription

 

13:10:00 

First error in indico.log of PostgreSQL being unavailable:

OperationalError: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address: Name or service not known

 

13:20First user query about EMS login problem (Slack #it)

 

13:24

Service restored and acknowledged by users on Slack #it

(First of many service unavailable then available again periods)

 

13:27

Ian Galpin starts investigating and finds the DNS resolving error: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address

 

16:08

IT confirms that there is a VMWare storage issue, via Massimiliano Adamo on Slack #swd-private

it's a storage issue.

 

20:50

Additional outages occur, IT still working on issue with VMWare

...


 

23:28

Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down

 

07:33 - 07:46

Pete Pedersen , Massimiliano Adamo , Allen Kong work on restoring AD service by shutting down the FRA node and starting it back up

 

08:20

Ian Galpin can't log into EMS servers, Pete Pedersen , Massimiliano Adamo , Allen Kong looking into it

Allen Kong on #service-issue-30052022:

@massimiliano.adamo I'm looking at prod-event01 and seeing lots of nasty logical block warnings...can you check?

 

08:50

Pete Pedersen and Massimiliano Adamo restarted EMS VMs and ran filesystem checks



PETE AND MAX FIXED puppet/postgres PLEASE FILL IN

 

09:45

Ian Galpin tested and verified that service was restored for EMS/Map/CompendiumDB

Mandeep Saini updated #it







Proposed Solutions

  • The core issue seems to be related to VMWare storage and IT need to provide a solution for monitoring the health of the VM storage
  • The HA of primary services (EMS, Map, etc) and dependant core services (PostgreSQL etc) needs to be improved:
    • Additional monitoring configuration or setup changes are required to more easily identify the root cause of an outage
      • If a core service goes down (e.g. no PostgreSQL primary set), that alarm will be hard to spot in the hundreds of subsequent alarms
  • PETE AND MAX TO PLEASE FILL IN