...
Date | Time | Description |
---|---|---|
| 13:10:00 | First error in indico.log of PostgreSQL being unavailable: OperationalError: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address: Name or service not known |
| 13:20 | First user query about EMS login problem (Slack #it) |
| 13:24 | Service restored and acknowledged by users on Slack #it (First of many service unavailable then available again periods) |
| 13:27 | Ian Galpin starts investigating and finds the DNS resolving error: |
| 16:08 | IT confirms that there is a VMWare storage issue, via
|
| 20:50 | Additional outages occur, IT still working on issue with VMWare |
| 23:28 | Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down |
| 07:33 - 07:46 | Pete Pedersen , Massimiliano Adamo , Allen Kong work on restoring AD service by shutting down the FRA node and starting it back up |
| 08:20 | Ian Galpin can't log into EMS servers, Pete Pedersen , Massimiliano Adamo , Allen Kong looking into it Allen Kong on #service-issue-30052022:
|
| 08:50 | Pete Pedersen and Massimiliano Adamo restarted EMS VMs and ran filesystem checks |
09:30 | PETE AND MAX FIXED puppet/postgres PLEASE FILL IN Massimiliano Adamo and Pete Pedersen recover the primary db server by rebooting the server in recovery mode and running fsck to recover the data the same thing had to be done the witness node . As for the secondary postgres server has shutdown because it had no witness. | |
| 09:45 | Ian Galpin tested and verified that service was restored for EMS/Map/CompendiumDB Mandeep Saini updated #it |
...
- The core issue seems to be related to VMWare storage and IT need to provide a solution for monitoring the health of the VM storage
- The HA of primary services (EMS, Map, etc) and dependant core services (PostgreSQL etc) needs to be improved:
- Additional monitoring configuration or setup changes are required to more easily identify the root cause of an outage
- If a core service goes down (e.g. no PostgreSQL primary set), that alarm will be hard to spot in the hundreds of subsequent alarms
- Additional monitoring configuration or setup changes are required to more easily identify the root cause of an outage
- Postgres
- FAILURE: fail-over fail to work because the it required a witness to be active and it had failed because it is the same VM cluster as the primary DB server
Jira server Jira serverId 5228d933-268f-3077-a879-21fb01eb8d41 key DEVOPS-27 - Proposed solution:
- add two more witness servers so we have one on every site
- second suggestion would be to change the way the postgres clients connect so the are cluster aware (ie: all node are listed in connection string and have the targetServerType=primary option) and connect directly to postgres, this will remove the the need for the consul config / intermediate (less parts, less to break)
- FAILURE: fail-over fail to work because the it required a witness to be active and it had failed because it is the same VM cluster as the primary DB server