Page History

Incident description

During the testing CloudBolt configuration was applied to our configuration management (Puppet) server, which changed configuration on some nodes pointing them to wrong environment (test instead of production). CloudBolt in this configuration, is seen by Puppet as ENC (External Node Classifier). The ENC pushes some configuration to all the servers managed by puppet, and when you use a ENC, you cannot switch the configuration back, and it must be done from CloudBolt. This caused the services to pick the configuration from wrong environment that caused the outage.

...

Incident severity: CRITICAL

Data loss: NO

Cause

Failure to follow the change management process - needs further invetigation.

Timeline

Time (CET)
03 Aug, 12:36	Issue Reported by Cristian Bandea on slack channel #techies
03 Aug, 13:25	Picked up by Massimiliano Adamo
03 Aug, 13:46	user-7da5d pointed out to CloudBolt and found the issue (wrong puppet environment)
03 Aug, 14:00	user-7da5d have switched off prod-idp01 VM leaving only prod-idp02 functioning
03 Aug, 16:00	Saltstack was used to fix all the servers at once

Total downtime: 1.5 hours

Resolution

CloudBolt changes were reverted and all production and UAT VMs were restored back in their environment.

Page tree

Versions Compared

Old Version 11

New Version 12

Key

Incident description

Cause

Timeline

Resolution