Incident description
At 14:46 on Thursday the 7th of February our SharePoint Servers suffered an outage with the message
"The server is busy now. Try again later."
This was a message from SharePoint informing us that it was suffering unusually high traffic.
Incident severity: CRITICAL
Data loss: NO
Affected Services
Following services were affected
- SharePoint (e.g. Intranet, Partner Portal, www.geant.org, etc)
Cause
Our SharePoint servers get allocated requests from a “load balancer” which allocates the request to the server with the least load. SharePoint communicates with the load balancer using a service called the Request Management Service.
It appears that the load balancer was sending all requests to only one of the servers.This was overloading that server causing it to hit a threshold, after which it stops accepting requests. It does this to force the load balancer to send requests to the other servers.
However, either the load balancer malfunctioned, or the Request Management Service stopped which led to the load balancer malfunctioning. We are still investigating, but the problem appears to be with the “load balancer”.
This may or may not be linked to high traffic due to a DDoS attack. It could be the case that the load balancer may have been failing to work for a while but the requests were not high enough to trigger the threshold.
Resolution.
In order to bring the SharePoint servers back online we turned off the throttling so that the SharePoint Server would continue to accept requests.
Later on we raised the Threshold from 500 queued requests to 1000.
Future Mitigation.
We are still investigating the issue.
However, there are plans in place to speed up our response to such issues and to make changes to the way we host SharePoint to eliminate the kinds of problems.
