Incident description

At 08:47 on Wednesday the 8th of January our SharePoint Servers suffered an outage with the message

"404 Not Found, site not found"

This was a message from SharePoint informing us that it was suffering unusually high traffic. 


Incident severity: CRITICAL

Data loss: NO

Affected Services 

Following services were affected

Cause

Our SharePoint servers get allocated requests from a “load balancer” which allocates the request to the server with the least load. SharePoint communicates with the load balancer using a service called the Request Management Service.

We found load balancer was configured with only one server to handle all user request load. This was overloading that server causing it to hit a threshold, after which it stops accepting requests.

Resolution.

In order to bring the SharePoint servers back online we killed IIS work process so that the SharePoint Server would continue to accept new requests.

Future Mitigation.

We have included additional server in load balancer to handle the user request. 

So in total we have 2 servers to handle the user request and thereafter we didn't see any downtime from last 4 months.

  • No labels