You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Incident description

A missing upgrade to the DB backend of Sensu started causing misbehavior in our monitoring backend and agents and drove us to investigate the issue on several VMs, comparable to our Sensu server.

To solve this issue a Puppet module upgrade was required, with the problem being, this module is shared among several applications and the upgrade procedure involves either Sensu, Poller, and Brian.

Brian is using its implementation of Sensu, and I had to verify if it was showing the same issue. The effort that was put in to solve the issue was huge (the new puppet module had differences and the perimeter firewall was denying access to Sensu API) and while attempting to fix the issue InfluxDB received a major upgrade (from version 1.8 to version 2.0).

The second cause, was the lack of pinning of the application.


Incident severity: CRITICAL

Data loss: YES

Timeline


Time (CET)
27 Jan, 00:11this is the time reported in /var/log/yum.log when the update was triggered
27 Jan, 03:00

at about 3 am the problem was solved.

27 Jun, 09:17

This is the time reported in /var/log/yum.log when Influx version was rolled back

Total downtime: 09:06 hours.

Proposed Solution

Package pinning is always a good practice for our core applications. The Influx DB packages is now pinned everywhere now and and further upgrades must be deliberately applied, by changing the version number in puppet.


  • No labels