...
EMS (via https://events.geant.org) has , and other services, have been unavailable for a few minutes at a time, throughout the day. This was due to PostgreSQL service being unavailable
The reason for degradation:
- prod-events01.geant.org and prod-events02.geant.org could not resolve the hostname for the PostgreSQL server (prod-postgres.geant.org)
- The first PostgreSQL server (prod-postgres01.geant.org) and the replication witness (prod-postgres-witness.geant.org) failed to start up correctly due to the VMWare storage problems
- This resulted in there being no PostgreSQL primary node and therefore the DNS entry prod-postgres-witness.geant.org->primary.prod-postgres.service.ha.geant.net. not resolving to an actual host
- Multiple sites in GÉANT's infrastructure experienced service interruption
- The underlying cause seems to be a storage error on VMWare. IT are investigating along with VMWare support
- TODO: Add link to cf. IT incident page: 30052022
The impact of this service degradation was that the following services were unavailable:
...
- //events.geant.org
- https://map.geant.com
- https://compendiumdatabase.geant.orgUsers could not access EMS
Incident severity:
Intermittent service outage Status colour Red title CRITICAL
...
Total duration of incident: 13 hours/On going (as of 22:22 UTC)21 hours
Timeline
All times are in UTC
Date | Time | Description |
---|---|---|
| 13:10:00 | First error in indico.log of PostgreSQL being unavailable: OperationalError: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address: Name or service not known |
| 13:20 | First user query about EMS login problem (Slack #it) |
| 13:24 | Service restored and acknowledged by users on Slack #it (First of many service unavailable then available again periods) |
| 13:27 | Ian Galpin starts investigating and finds the DNS resolving error: |
| 16:08 | IT confirms that there is a VMWare storage issue, via
|
| 20:50 | Additional outages occur, IT still working on issue with VMWare |
...
| 23:28 | Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down |
| 07:33 - 07:46 | Pete Pedersen , Massimiliano Adamo , Allen Kong work on restoring AD service by shutting down the FRA node and starting it back up |
| 08:20 | Ian Galpin can't log into EMS servers, Pete Pedersen , Massimiliano Adamo , Allen Kong looking into it Allen Kong on #service-issue-30052022:
|
| 08:50 | Pete Pedersen and Massimiliano Adamo restarted EMS VMs and ran filesystem checks |
PETE AND MAX FIXED puppet/postgres PLEASE FILL IN | ||
| 09:45 | Ian Galpin tested and verified that service was restored for EMS/Map/CompendiumDB Mandeep Saini updated #it |
Proposed Solutions
- The core issue seems to be related to VMWare storage and IT need to provide a solution for monitoring the health of the VM storage
- The HA of primary services (EMS, Map, etc) and dependant core services (PostgreSQL etc) needs to be improved:
- Additional monitoring configuration or setup changes are required to more easily identify the root cause of an outage
- If a core service goes down (e.g. no PostgreSQL primary set), that alarm will be hard to spot in the hundreds of subsequent alarms
- Additional monitoring configuration or setup changes are required to more easily identify the root cause of an outage
- PETE AND MAX TO PLEASE FILL IN