Purpose of the document
GÉANT presents a large number of services which are used by the community and internally. These publicly exposed services have, in most cases, further dependencies on hidden services such as authentication systems and databases. The expectation is that these services should be available 100% but in actuality they occasionally fail. Given this, duplicate copies of some services exist to provide redundancy, and while they are designed to protect from loss of data, failing-over the user interface requires manual intervention, which incurs a delay until carried out by an operator (this can sometime extend to a day or two if the outage occurs at the weekend, for instance). A further problem exists in that even where a service provides redundancy at the data and user-interface level, dependent service upon which they rely often are not redundant which means that there is nothing to fail over to and the underlying service needs to be repaired before the service is fully restored.
The following describes structures and facilities that will be introduced to improve service resilience and reliability in GÉANT.
Problems being addressed
As outlined above, a number of services hosted by GÉANT have suffered from time to time with poor availability
...
Generally, all service deployment can be done in the context of redundant services providing high availability. The impact to SWD is very low. Services should be deployed such that they are essentially the primary in every case.
Design goals
Provide an infrastructure which supports automatic service failover. Service failover should be invisible to users of a service.
Provide an infrastructure which provides service discoverability, where services:
...
Provide a zero downtime scheduled maintenance framework: in a system which employs redundant services, maintenance should be scheduled such that there is no loss of service, and that users are unaware that maintenance is on-going.
Automate service recovery, tolerant of system failure, within the context of the service reliability infrastructure. What this means is that, where possible, services should self-heal.
Monitor service availability. Services should be monitored for availability and redundancy.
Follow-up work: identify and document remaining single points of failure.
Structure
Each server will run the consul agent and include a config listing the services it runs and how to monitor them (to test if they are serviceable). This should be maintained in puppet.
The consul servers create, update and push a DNS zone file. This should occur frequently enough as is reasonable to minimise a service DNS query miss. Perhaps every 5 minutes or more frequently?
Components
Consul: provides service discovery
...