Incident description
A missing upgrade to the DB backend of Sensu started causing misbehavior in our monitoring backend and agents and drove us to investigate the issue on several VMs, comparable to our Sensu server.
To solve this issue a Puppet module upgrade was required, with the problem being, this module is shared among several applications and the upgrade procedure involves either Sensu, Poller, and Brian.
Brian is using its implementation of Sensu, and I had to verify if it was showing the same issue. The effort that was put in to solve the issue was huge (the new puppet module had differences and the perimeter firewall was denying access to Sensu API) and while attempting to fix the issue InfluxDB received a major upgrade (from version 1.8 to version 2.0).
The second cause, was the lack of pinning of the application.
Incident severity: CRITICAL
Data loss: YES
Timeline
Time (CET) | |
---|---|
27 Jan, 00:11 | this is the time reported in /var/log/yum.log when the update was triggered |
27 Jan, 03:00 | at about 3 am the problem was solved. |
27 Jun, 09:17 | This is the time reported in /var/log/yum.log when Influx version was rolled back |
Total downtime: 09:06 hours.
Proposed Solution
Package pinning is always good practice, for our core applications. The Influx DB packages has been pinned everywhere now and and further upgrades must be deliberately applied, by changing the version number in puppet.