Incident description
During the testing CloudBot framework puppet server was connected to CloudBot service CloudBolt configuration was applied to our configuration management (Puppet) server, which changed configuration on some nodes pointing them to wrong environment (test instead of production). This change triggered puppet running on nodes to change everything and break service functions. . CloudBolt in this configuration, is seen by Puppet as ENC (External Node Classifier). The ENC pushes some configuration to all the servers managed by puppet, and when you use a ENC, you cannot switch the configuration back, and it must be done from CloudBolt. This caused the services to pick the configuration from wrong environment that caused the outage.
Some other production services environment was also pointed to test but there was no loss of service as the configuration for test and production was same.
Incident severity: CRITICAL
Data loss: NOOTRS tickets: 1 (in SWD queue) 0 in OTRS
Cause
Failure to follow the change management process - needs further invetigation.
Timeline
Time (CET) | |||
---|---|---|---|
03 Aug, 12:36 | Issue Reported by Cristian Bandea on slack channel #techies | ||
03 Aug, 13:4025 | Picked up by Konstantin Lepikhov on his return from lunch | 03 Aug, 13:40 | Massimiliano Adamo started looking into this because problem was related to memcache configuration which he introduced recently |
03 Aug, 13:46 | user-7da5d pointed out to CloudBolt and found the issue (wrong puppet environment) | ||
03 Aug, 14:00 | user-7da5d have switched off prod-idp01 VM leaving only prod-idp02 functioning, at least this restored IDP service operation | ||
03 Aug, 14:43 | Sympa can't connect to CAMS - other servers affected (reported by Linda) | 03 Aug, 16:00 | Saltstack was used to fix all the servers at once |
Total downtime: 1.5 hours?
...
Resolution
CloudBolt changes were reverted and all production and UAT VMs were restored back in their environment.