Incident Description
Users of EMS could not logon to EMS. They were presented with the login screen, which took them to the IDP selection page (as per normal). After successful authentication on the IDP, they where redirected to EMS. However, instead of being logged in on EMS, they where logged out.
The reason for degradation:
- EMS/Indico stores user sessions in redis
- prod-events01.geant.org and prod-events02.geant.org could not resolve the hostname for the redis server (master.production-events-redis.service.ha.geant.net)
- With the connection to redis lost, Indico could not create or manage user sessions
The impact of this service degradation was:
- Users could not manage their events, for example:
- Editing the event
- Accessing registation lists
- Sending out reminder emails
Incident severity:
Partial service degradation Status subtle true colour Yellow title Med
...
Total duration of incident: 15 hours
Timeline
All times are in UTC
Date | Time | Description | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 21:55:53 | First error in indico.log of redis being unavailable: ConnectionError: Error -2 connecting to master.production-events-redis.service.ha.geant.net:6379. Name or service not known. | ||||||||||||||
| 10:42 | First user query about EMS login problem (Slack #general) | ||||||||||||||
| 11:14 | Ian Galpin identified the dns resolution problem
| ||||||||||||||
| 12:06 | Service degradation incident email sent out to product owner (Steffie Bosman) | ||||||||||||||
| 12:12 | Massimiliano Adamo identified a problem with PowerDNS
consul DNS resolution seemed to work:
| ||||||||||||||
| 12:30 | Massimiliano Adamo resolved the PowerDNS issue by disabling the packetcache config option:
The following GitHub issue might explain the issue: https://github.com/PowerDNS/pdns/issues/8160 | ||||||||||||||
| 13:01 | Service restored email sent out to product owner (Steffie Bosman) |
Proposed Solution
- Additional monitoring (Sensu checks) will be added
...
- These will check that specific hostnames resolve
- This is an action item for DevOps team
- the issue was solved by disabling Packet Cache on PowerDNS, which is enabled by default: https://docs.powerdns.com/recursor/settings.html#disable-packetcache