Incident Management Process (draft)
Establish who are the affected users and stakeholders
- A starting input for this list can be the list of stakeholders here: Service Catalogue
Communicate information about the incident to the affected users and stakeholders
- Do this before taking any other action
The relevant team members should look into the issue
- First priority is to restore service
Create an Incident Report
- Start with one of the previous Incident Reports as a template: Incidents
- Save the new Incident Report here as a new child page
- Basic information:
- Timeline (how/when it was identified, when service was restored, etc)
- Other information
- Optional future mitigations
- If it's taking a long time to resolve the issue we must update the users every 3-4 hours, Linda Ness can probably help/advise with this.
Index
Severity
- CRITICAL Complete service outage
- MED Partial service degradation
- LOW Virtually no user impact
Data Loss
- YES Data has been lost
- NO No data was lost
Service | Start Date | End Date | Severity | Data Loss | Incident Page |
---|---|---|---|---|---|
DNS |
|
| CRITICAL | NO | DNS Outage 2019-02-27 |
SharePoint |
|
| CRITICAL | NO | SharePoint Outage 2020-01-08 |
SharePoint |
|
| MED | NO | RSS Feed in Jobs page Geant.org was down - 17/01/2020 |
BRIAN |
| CRITICAL | YES | Brian Outage 2020-01-26 | |
Cacti |
|
| CRITICAL | YES | Cacti production incident - 06-03-2020 |
Cacti |
|
| CRITICAL | YES | Cacti Production Instance - July 2020 |
HAProxy |
|
| CRITICAL | NO | Haproxy Outage 2021-03-17 |
ProxySQL |
|
| CRITICAL | YES | ProxySQL Outage 2021-07-12 |
EMS |
|
| CRITICAL | NO | EMS - 2022-03-14 - Service Outage |
EMS/DNS |
|
| MED | NO | EMS - 2022-04-20 - Service Degradation |