Incident description
On 27th July 2019, the host names opsdb1.dante.net and opsdb2.dante.net could not be resolved from the both Dashboard boxes.
Incident severity: CRITICAL
Data loss: NO
Timeline
Time (CET) |
---|
28 Jul, |
22: |
00 | Issue Reported by OC |
28 Jul, |
22: |
30 | Picked up by |
Robert L | |
28 Jul, 22:45 | Fixed by updating the Dashboard application to point at prod-opsdb01,geant.net (and 02). |
29 Jul, 08:10 | The issue reported to Devops for root cause analyses. |
29 Jul |
Fixed by turning off SSL temporarily to restore the service. Initial investigation revealed certificate has expired but later turned out that wasn't the case.
Further investigations were carried out to avoid such failures in future
The actual cause identified for the failure - due to IT patching certificates were automatically changed.
, 09:30 | Proposal was discussed between IT and SWD to avoid such failures in future. |
30 Jul, 11:30 |
20 Jun, 16.30 |
Total downtime: 09:14 hours.
...