Incident description
During the testing CloudBolt (an interface from which people can order self service VMs) configuration was applied to our configuration management (Puppet) server, which changed configuration on some nodes pointing them to wrong environment (test instead of production). CloudBolt in this configuration, is seen by Puppet as ENC (External Node Classifier). The ENC pushes some configuration to all the servers managed by puppet, and when you use a ENC, you cannot switch the configuration back, and it must be done from CloudBolt. This caused the services to pick the configuration from wrong environment that caused the outage.
Some other production services environment was also pointed to test but there was no loss of service as the configuration for test and production was same.
Incident severity: CRITICAL
Data loss: NO
Affected Services
Following services were inaccessible for GEANT staff members during the outage because they all use GEANT Staff IdP for authentication:
- SharePoint (e.g. Intranet, Partner Portal)
- GEANT wiki
- EventR
- wordpress sites
- Compendium
- Filesender
- FoD
- Lifesize
- BOX
- sympa (lists at prod-lists01.geant.net server)
Cause
- Failure to follow the change management process - needs further investigation.
- Lack of planning. Preparing an isolated environment would have taken up to a week of work but we have been asked to make this work the same day. Massimiliano Adamo and Michael Haller asked to postpone and to meet on Friday.
- Massimiliano Adamo asked twice in the meeting for objections and if we intended to take an action. It was a team decision.
- severe lack of knowledge of puppet integration with CloudBolt, from the side of the consultants. The consultants have been asked repeatedly certain questions and they were either evading the answer or providing wrong replies (i.e.: CloudBolt doesn't provide Agent Bootstrap, but they said it does This implies that they don't know the implementation details). The consultants have been asked clearly about the environment settings.
agents must be pre-installed and configured in the image (from CloudBolt documentation: http://docs.cloudbolt.io/configuration-managers/puppet/index.html )
Timeline
Time (CET) | |
---|---|
03 Aug, 12:36 | Issue Reported by Cristian Bandea on slack channel #techies |
03 Aug, 13:02 | Andrew Jarvis sent direct Slack to Dick Visser and Konstantin Lepikhov but no response. |
03 Aug, 13:25 | Massimiliano Adamo, started investigating |
03 Aug, 13:35 | Andrew Jarvis contacted Massimiliano Adamo |
03 Aug, 13:46 | user-7da5d pointed out to CloudBolt and found the issue (wrong puppet environment) |
03 Aug, 14:00 | user-7da5d have switched off prod-idp01 VM leaving only prod-idp02 functioning |
03 Aug, 16:00 | Saltstack was used to fix all the servers at once |
Total downtime: 1.5 hours
Resolution
CloudBolt changes were reverted and all production and UAT VMs were restored back in their environment.
6 Comments
Dick Visser
If you had followed the change management process this would still have occured but hopefully just outside office hours.
The real cause could be that the entire environments is maybe too closely coupled...
Massimiliano Adamo
agree. But - honestly - there was also a complete lack of knowledge of the consultants about puppet.
They didn't advice us properly and I made the mistake to trust them, because I asked the same question to them a lot of times.
Massimiliano Adamo
I am thinking again about it. We asked to postpone and discuss it before going ahead.
And at the end it was a team decision (the same team that would have said "yes" within the change management flow?).
I asked twice the people in the meeting and we all agreed. IMO a team failure.
Konstantin Lepikhov
Do we have a list of servers affected by this change?
Massimiliano Adamo
Every production server, running puppet in FRA datacenter was impacted, but that doesn't mean that it failed to run the service.
Some service would switch to test, without even failing.
Some other service would get different passwords (different values in hiera for different environments)
Some other service may entirely break
I don't know what would happen to keepalive nodes, when they have different configurations.... for instance, memcached on test idp connects to localhost, but on production the configuration across nodes was completely messed up)
Temoor Khan
CNIS & BOD servers were down as well which were fixed by Michael Haller