Incident Description
EMS (via https://events.geant.org) has been unavailable for a few minutes at a time, throughout the day.
The reason for degradation:
- prod-events01.geant.org and prod-events02.geant.org could not resolve the hostname for the PostgreSQL server (prod-postgres.geant.org)
- Multiple sites in GÉANT's infrastructure experienced service interruption
- The underlying cause seems to be a storage error on VMWare. IT are investigating along with VMWare support
- TODO: Add link to IT incident page
The impact of this service degradation was:
- Users could not access EMS
Incident severity: CRITICAL Intermittent service outage
Data loss: NO
Total duration of incident: 13 hours/On going (as of 22:22 UTC)
Timeline
All times are in UTC
Date | Time | Description |
---|---|---|
| 13:10:00 | First error in indico.log of PostgreSQL being unavailable: OperationalError: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address: Name or service not known |
| 13:20 | First user query about EMS login problem (Slack #it) |
| 13:24 | Service restored and acknowledged by users on Slack #it |
| 13:27 | Ian Galpin starts investigating and finds the DNS resolving error: |
| 16:08 | IT confirms that there is a VMWare storage issue, via
|
| 20:50 | Additional outages occur, IT still working on issue with VMWare |
Proposed Solution
- The core issue seems to be related to VMWare and IT need to provide a solution