Incident description
In order to fix the automatic renewal of the OV certificates, we renewed one of our existing certificates: *.dante.net
When OV renewal started working again, the new certificate was put in place and triggered a Haproxy service restart.
A missing DNS record for dndrdc01.win.dante.org.uk caused the Haproxy service to fail to restart. The server dndrdc01 was decommissioned on 22 May 2020, but the record was cleaned up recently. We still don't know the exact time when this record was expunged (we don't have access to the Windows DNS server), and we know that today morning even uat-haproxy went down for the same reason.
- A change request was not raised because this was considered a low risk operation, which is run as an unattended job, nightly, every day. In fact, the action didn't cause any issue on itw own, but it triggered another issue.
- The first certificate that was going to expire would have caused the same issue, maybe the coming night
- We could not fpresee that a DNS record was deleted this same day.
Incident severity: CRITICAL
Data loss: NO
Timeline
Time (CET) | |
---|---|
17 Mar, 10:47 | /var/log/haproxy_1.log shows the error about happroxy being down |
17 Mar, 10:55 | disabled puppet on prod-haproxy02 and failed over the connection over it |
Total downtime: 7 minutes
Proposed Solution
Using a test certificate would not protect us.
The way around, in the event of an unattended renewal, we would have experienced 4 hours of downtime. The renewal happens overnight, and we'd have discovered this problem the morning after. Forcing a manual renewal we have only had few minutes of downtime.
Possible solution: do not trigger an unattended service restart.
Sensu will start alerting 30 days before expiration and we can run the procedure manually (replacing an expiring certificate will not need a RFC).
The downside is that we will do it during working hours, while the cron job runs overnight.
At the same the LDAP backend, kept working, because HAProxy relied for 1 year on the hot-stanby:
- we need to monitor each Haproxy backend (and check why one backend is in Warning state)
- this change on May 2020, would have brought down IDP, but Haproxy use a standby not and was able to guarantee service continuity