Incident description
Crowd uses Active Directory (AD) as the back-end to authenticate users. The check between crowd and AD occurs over an SSL channel (ldaps - port 636) and is secured by certs in the java keystore (cacerts) on crowd. Authentication fails if the cert has expired or is wrong. In this case, the communication between the Domain controllers AD and Crowd authentication was broken (which resulted in Dashboard not able to authenticate) due to the certificate change on the domain controllersAD.
The certificates on the domain controllers AD were changed because IT team patched and upgraded subordinate PKI server which is the certificate issuing authority for all the Microsoft Windows boxes. This resulted in automated change of certificates even though the previous certificates were not expired yet. This is not an expected behavior after the patching window is done so not something which will/should happen every time we IT team patch servers.
Incident severity: CRITICAL
...
Time (CET) | |
---|---|
18 Jun, 23:16 | Issue Reported by OC |
19 Jun, 07:35 | Picked up by Michael H the following morning |
19 Jun, 08:30 | Fixed by turning off SSL temporarily to restore the service, while further investigation was carried out. Initial investigation revealed certificate has expired but later turned out that wasn't the case. |
19 Jun, 10:30 | Further investigations were carried out to avoid such failures in future |
19 Jun, 16:08 | The actual cause identified for the failure - due to IT patching certificates were automatically changed. |
20 Jun, 909:30 A.M. | Proposal was discussed between IT and SWD to avoid such failures in future. |
16:45 | DevOps confirmed that there are no backups or extra copies on VMware storage |
17:00 | Konstantin Lepikhov called Qaiser Ahmed in Slack, no response. |
17:00 | Dick Visser confirmed that he has backups on server at Amsterdam university (those are daily backups taken directly by VMs itself). |
18:26 | Qaiser Ahmed confirmed on #devops channel that whole folder called AMS_UBUNTU on vmware cluster is not backed up and there's no data left. |
18:30 | Dick Visser recreated new VMs in the VMWare cluster and started the restore process |
20:30 | Dick Visser restored the backup and brought all sites online. |
20:45 | Konstantin Lepikhov made an official announcement on the #it and #general Slack channels about the incident and the resolution. |
21:00 | Dick Visser started restore of filesender-prod.geant.org. |
21:50 | Dick Visser finished restore of filesender-prod.geant.org, with the exception of user files as these aren't backed up due to privacy issues, the fact this is a demonstration service. |
Total downtime: 5:39 hours.
Proposed Solution
Nagios checks which could help to prevent authentication problems:
...
20 Jun, 11:30 | Part one of Nagios check in the proposed solution implemented |
20 Jun, 16.30 | New certs provided by IT installed on crowd servers. SSL switched back on (crowd ↔ AD). |
Total downtime: 09:14 hours.
Proposed Solution
1) Any PKI patches on domain controller should go via change control. The changes should be communicated to DevOps by IT so that preventive measures could be taken in timely manner.
STATUS: IT agreed
2) In this particular case it wasn't the case that the certificate has expired but it could also cause similar issue. So as a proactive measure some nagios automated checks can be introduced:
- “check_keystore" to check expiry date of the cert. Issue a warning (email alert) if the cert expires within 30 days.
STATUS - This check has already been implemented, from 20th Jun 2018, 11:30
...
- "check_ssl_connection": checks that an SSL connection can be established from crowd to each of the AD servers. Alert email sent when the connection fails - which means the connection is already broken but we’ll know sooner and can react more quickly. (There’s a java class “SSLPoke.class that can be used for this)
I already have a draft check_keystore.sh script that does 1 --- I just need to plug it into nagios. I haven’t done 2 yet.
Lessons learned
...
- STATUS - Need to be discussed with Konstantin to plan implementation.
3) Crowd authentication proved unreliable at various occasions, on the other side, federated login proved to be more reliable and working successfully with various operational services. As a long term solution the authentication for Dashboard application should be changed from Crowd to federated.
STATUS: Planning would be influenced by the decision on Dashboard V 3.0
...