Incident Description
EMS (via https://events.geant.org) has been unavailable for a few minutes at a time, throughout the day.
...
Total duration of incident: 13 hours/On going (as of 22:22 UTC)
Timeline
All times are in UTC
Date | Time | Description |
---|---|---|
| 13:10:00 | First error in indico.log of PostgreSQL being unavailable: OperationalError: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address: Name or service not known |
| 13:20 | First user query about EMS login problem (Slack #it) |
| 13:24 | Service restored and acknowledged by users on Slack #it |
| 13:27 | Ian Galpin starts investigating and finds the DNS resolving error: |
| 16:08 | IT confirms that there is a VMWare storage issue, via
|
| 20:50 | Additional outages occur, IT still working on issue with VMWare |
Proposed Solution
- The core issue seems to be related to VMWare and IT need to provide a solution