Incident description
A missing DNS record for dndrdc01.win.dante.org.uk caused the Haproxy service to fail to restart (the server dndrdc01 was decommissioned on 22 May 2020, but the record was cleaned up recently).
In order to fix the automatic renewal of the OV certificates, we renewed one of our existing certificates: *.dante.net
When OV renewal started working again, the new certificate was put in place and triggered a Haproxy service restart.In the event of an unattended renewal, we would have experienced 4 hours of downtime (the renewal happens overnight, and we'd have discovered this problem the morning after. Forcing a manual renewal we have only had few minutes of downtime
A missing DNS record for dndrdc01.win.dante.org.uk caused the Haproxy service to fail to restart (the server dndrdc01 was decommissioned on 22 May 2020, but the record was cleaned up recently).
A change request was not raised because this is an unattended job, which is run nightly, every day and I have only run in advance the job that would have been triggered in the night. The first certificate that was going to expire would have caused the same issue.
Incident severity: CRITICAL
...
Total downtime: 7 minutes
Proposed Solution
not available.
Using a test certificate would not protect us.
The way around, the real certificate would have been changed overnight, causing at least 4 hours of downtimein the event of an unattended renewal, we would have experienced 4 hours of downtime (the renewal happens overnight, and we'd have discovered this problem the morning after. Forcing a manual renewal we have only had few minutes of downtime.
Possible solution: do not trigger an unattended service restart.
Sensu will start alerting 30 days before expiration and we can run the procedure manually (replacing an expiring certificate will not need a RFC).
The downside is that we will do it during working hours, while the cron job runs overnight.