BRIAN 2023-11-18 Data Collection Outage

Incident Summary

Update Digital Services Managers of incident in Slack Channel
Create a channel in Slack with the name of the Incident

Write a summary of the incident in a few sentences. Include what happened, why, the severity of the incident and how long the impact lasted.

EXAMPLE:

Between the hour of {time range of incident, e.g., 15:45 and 16:35} on {DATE}, {NUMBER} users encountered {EVENT SYMPTOMS}.

The event was triggered by a {CHANGE} at {TIME OF CHANGE THAT CAUSED THE EVENT}.

The {CHANGE} contained {DESCRIPTION OF OR REASON FOR THE CHANGE}, such as a change in code to update a system}.

A bug in this code caused {DESCRIPTION OF THE PROBLEM}.

The event was detected by {MONITORING SYSTEM}. The team started working on the event by {RESOLUTION ACTIONS TAKEN}.

This {SEVERITY LEVEL} incident affected {X%} of users.

There was further impact as noted by {e.g., NUMBER OF SUPPORT TICKETS SUBMITTED, SOCIAL MEDIA MENTIONS, CALLS TO ACCOUNT MANAGERS} were raised in relation to this incident.

2. Leadup

Describe the sequence of events that led to the incident, for example, previous changes that introduced bugs that had not yet been detected.

EXAMPLE:

At {16:00} on {MM/DD/YY}, ({AMOUNT OF TIME BEFORE CUSTOMER IMPACT, e.g. 10 days before the incident in question}), a change was introduced to {PRODUCT OR SERVICE} in order to {THE CHANGES THAT LED TO THE INCIDENT}.

This change resulted in {DESCRIPTION OF THE IMPACT OF THE CHANGE}.

3. Fault

Describe how the change that was implemented didn't work as expected. If available, attach screenshots of relevant data visualisations that illustrate the fault.

4. Impact

Describe how the incident impacted internal and external users during the incident. Include how many support cases were raised.

EXAMPLE:

For {XXhrs XX minutes} between {XX:XX UTC and XX:XX UTC} on {MM/DD/YY}, {SUMMARY OF INCIDENT} our users experienced this incident.

This incident affected {XX} customers (X% OF {SYSTEM OR SERVICE} USERS), who experienced {DESCRIPTION OF SYMPTOMS}.

{XX NUMBER OF SUPPORT TICKETS AND XX NUMBER OF SOCIAL MEDIA POSTS} were submitted.

5. Detection

When did the team detect the incident? How did they know it was happening? How could we improve time-to-detection? Consider: How would we have cut that time by half?

EXAMPLE:

This incident was detected when the {ALERT TYPE} was triggered and {TEAM/PERSON} were paged.

Next, {SECONDARY PERSON} was paged, because {FIRST PERSON} didn't own the service writing to the disk, delaying the response by {XX MINUTES/HOURS}.

{DESCRIBE THE IMPROVEMENT} will be set up by {TEAM OWNER OF THE IMPROVEMENT} so that {EXPECTED IMPROVEMENT}.

6. Response

Who responded to the incident? When did they respond, and what did they do? Note any delays or obstacles to responding.

EXAMPLE:

After receiving a page at {XX:XX UTC}, {ON-CALL ENGINEER} came online at {XX:XX UTC} in {SYSTEM WHERE INCIDENT INFO IS CAPTURED}.

This engineer did not have a background in the {AFFECTED SYSTEM} so a second alert was sent at {XX:XX UTC} to {ESCALATIONS ON-CALL ENGINEER} into the who came into the room at {XX:XX UTC}.

7. Recovery

Describe how the service was restored and the incident was deemed over. Detail how the service was successfully restored, and how you knew what steps you needed to take to get to recovery.

Depending on the scenario, consider these questions: How could you improve time to mitigation? How could you have cut that time by half?

EXAMPLE:

We used a three-pronged approach to the recovery of the system:

{DESCRIBE THE ACTION THAT MITIGATED THE ISSUE, WHY IT WAS TAKEN, AND THE OUTCOME}

By Increasing the size of the BuildEng EC3 ASG to increase the number of nodes available to support the workload and reduce the likelihood of scheduling on oversubscribed nodes
Disabled the Escalator autoscaler to prevent the cluster from aggressively scaling-down
Reverting the Build Engineering scheduler to the previous version.

8. Timeline

Detail the incident timeline.

Include any notable lead-up events, any starts of activity, the first known impact, and escalations. Note any decisions or changed made, and when the incident ended, along with any post-impact events of note.

Date/time	Action	Actor

9. Root Cause Identification

The Five Whys is a root cause identification technique root cause identification technique. Here’s how you can use it:

Begin with a description of the impact and ask why it occurred.
Note the impact that it had.
Ask why this happened, and why it had the resulting impact.
Then, continue asking “why” until you arrive at a root cause.
List the "whys" in your post-mortem documentation.

10. Root Cause

Note the final root cause of the incident, the thing identified that needs to change in order to prevent this class of incident from happening again.

EXAMPLE:

A bug in connection pool handling led to leaked connections under failure conditions, combined with lack of visibility into connection state.

11. Backlog Check

Review your engineering backlog to find out if there was any unplanned work there that could have prevented this incident, or at least reduced its impact?

A clear-eyed assessment of the backlog can shed light on past decisions around priority and risk

12. Recurrence

Now that you know the root cause, can you look back and see any other incidents that could have the same root cause? If yes, note what mitigation was attempted in those incidents and ask why this incident occurred again.

13. Lessons Learned

Discuss what went well in the incident response, what could have been improved, and where there are opportunities for improvement.

14. Corrective Actions

Describe the corrective action ordered to prevent this class of incident in the future. Note who is responsible and when they must complete the work and where that work is being tracked.

Page tree

BRIAN 2023-11-18 Data Collection Outage