Incident Description
EMS (via https://events.geant.org), and other services (GÉANT Connectivity Map, Compendium Database), have been unavailable for a few minutes at a time, throughout the day. This was due to PostgreSQL service being unavailable
The reason for degradation:
- prod-events01.geant.org and prod-events02.geant.org could not resolve the hostname for the PostgreSQL server (prod-postgres.geant.org)
- The first PostgreSQL server (prod-postgres01.geant.org) and the replication witness (prod-postgres-witness.geant.org) failed to start up correctly due to the VMWare storage problems
- This resulted in there being no PostgreSQL primary node and therefore the DNS entry prod-postgres-witness.geant.org->primary.prod-postgres.service.ha.geant.net. not resolving to an actual host
- Multiple sites in GÉANT's infrastructure experienced service interruption
- The underlying cause seems to be a storage error on VMWare. IT are investigating along with VMWare support
- cf. IT incident: 30052022
The impact of this service degradation was that the following services were unavailable:
Incident severity: CRITICAL Intermittent service outage
Data loss: NO
Total duration of incident: 21 hours
Timeline
All times are in UTC
Date | Time | Description |
---|---|---|
| 13:10:00 | First error in indico.log of PostgreSQL being unavailable: OperationalError: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address: Name or service not known |
| 13:20 | First user query about EMS login problem (Slack #it) |
| 13:24 | Service restored and acknowledged by users on Slack #it (First of many service unavailable then available again periods) |
| 13:27 | Ian Galpin starts investigating and finds the DNS resolving error: |
| 16:08 | IT confirms that there is a VMWare storage issue, via
|
| 20:50 | Additional outages occur, IT still working on issue with VMWare |
| 23:28 | Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down |
| 07:33 - 07:46 | Pete Pedersen , Massimiliano Adamo , Allen Kong work on restoring AD service by shutting down the FRA node and starting it back up |
| 08:20 | Ian Galpin can't log into EMS servers, Pete Pedersen , Massimiliano Adamo , Allen Kong looking into it Allen Kong on #service-issue-30052022:
|
| 08:50 | Pete Pedersen and Massimiliano Adamo restarted EMS VMs and ran filesystem checks |
PETE AND MAX FIXED puppet/postgres PLEASE FILL IN | ||
| 09:45 | Ian Galpin tested and verified that service was restored for EMS/Map/CompendiumDB Mandeep Saini updated #it |
Proposed Solutions
- The core issue seems to be related to VMWare storage and IT need to provide a solution for monitoring the health of the VM storage
- The HA of primary services (EMS, Map, etc) and dependant core services (PostgreSQL etc) needs to be improved:
- Additional monitoring configuration or setup changes are required to more easily identify the root cause of an outage
- If a core service goes down (e.g. no PostgreSQL primary set), that alarm will be hard to spot in the hundreds of subsequent alarms
- Additional monitoring configuration or setup changes are required to more easily identify the root cause of an outage
- Postgres
- FAILURE: fail-over fail to work because the it required a witness to be active and it had failed because it is the same VM cluster as the primary DB server
- Proposed solution:
- add two more witness servers so we have one on every site
- second suggestion would be to change the way the postgres clients connect so the are cluster aware (ie: all node are listed in connection string and have the targetServerType=primary option) and connect directly to postgres, this will remove the the need for the consul config / intermediate (less parts, less to break)