...
The reason for degradation:
- cf. IT Incident: 30052022 - The underlying cause seems to be a storage error on VMWare. IT are investigating along with VMWare support.
- Multiple sites in GÉANT's infrastructure experienced service interruption
- prod-events01.geant.org and prod-events02.geant.org could not resolve the hostname for the PostgreSQL server (prod-postgres.geant.org)
- The first PostgreSQL server (prod-postgres01.geant.org) and the replication witness (prod-postgres-witness.geant.org) failed to start up correctly due to the VMWare storage problemscf
- This resulted in there being no PostgreSQL primary node and therefore the DNS entry prod-postgres-witness.geant.org->primary.prod-postgres.service.ha.geant.net. not resolving to an actual host
- Multiple sites in GÉANT's infrastructure experienced service interruption
- The underlying cause seems to be a storage error on VMWare. IT are investigating along with VMWare support
- .
The impact of this service degradation was that the following services were unavailable:
...
- The core issue seems to be related to VMWare storage and IT need to provide a solution for monitoring the health of the VM storage
- However, the HA of primary services (EMS, Map, etc) and dependant core services (PostgreSQL etc) needs to be improved:
- Additional monitoring configuration or setup changes are required to more easily identify the root cause of an outage
- If a core service goes down (e.g. no PostgreSQL primary set), that alarm will be hard to spot in the hundreds of subsequent alarms
- Postgres failPostgres fail-over failed to work because it required a witness to be active and the witness failed because it is hosted on the same VM cluster as the primary DB server. Thus the proposed solution is to add to add two more witness servers so we have one on every site (FRA, PRA, LON )
Jira server Jira serverId 5228d933-268f-3077-a879-21fb01eb8d41 key DEVOPS-27
- Additional monitoring configuration or setup changes are required to more easily identify the root cause of an outage