...
GÉANT presents a large number of services which are used by the community and internally. These publicly exposed services have, in most cases, further dependencies on hidden system services such as authentication systems and databases. The expectation is that these services should be available 100% but in actuality they occasionally fail. Given this, duplicate copies of some services exist to provide redundancy, and while they are designed to protect from loss of data, failing-over the user interface requires manual intervention, which incurs a delay until carried out by an operator (this can sometime extend to a day or two if the outage occurs at the weekend, for instance). A further problem exists in that even where a service provides redundancy at the data and user-interface level, dependent service services upon which they rely often are not redundant, which means that there is nothing to fail over to and the underlying service needs to be repaired before the service is fully restored.
...
A couple or examples of specific problems which have occurred in the recent past , which the work proposed in this document would hope to address, are listed here.
Dashboard
...
Many systems, such as JIRA and Dashboard, depend on the Crowd server for user authentication and access control, which . The crowd server has failed occasionally in the past and prevented user access to Dashboard or JIRA. Providing a second Crowd server which can be failed-over to is relatively straight-forward (indeed, the uat-crowd server is configured identically to the prod-crowd) but failing-over to it is still a manual task in the current infrastructure.
...
Generally, all service deployment can be done in the context of redundant services to provide high availability. The impact to SWD is low. Services should be deployed such that they are essentially the primary in every case.
Design goals
Provide an infrastructure which supports automatic service failover. Service failover should be invisible to users of a service.
...
Monitor service availability. Services should be monitored for availability and redundancy.
Follow-up work: identify Identify and document remaining single points of failure.
...
The current infrastructure tightly couples services with the server it is deployed upon. The following describe a setup which maintains the coupling of the service to a specific server but uses service discovery tools with and DNS to advertise services dynamically.
Each server will run the consul agent and include a config listing the services it runs and how to monitor them (to test if they are serviceable). This should be maintained in puppet.
The service discovery tool is an agent called Consul developed by HashiCorp (the same company responsible for Vagrant and Terraform).
The Consul agent operates in either client or server mode. Consul clients, running on GÉANT linux servers, register services with the Consul server. The Consul server then has a record of services running on the network and where they are located. With this information it can either act as a source for DNS lookups (via a DNS forward zone) or can update the network DNS server with information it knows about network services. The consul servers create, update and push a DNS zone file. This should occur frequently enough as is reasonable to minimise a service DNS query miss. Perhaps every 5 minutes or more frequently?
In this setup the consul client agent is installed on every node. Included with the service when it is deployed on a server, is the consul configuration for the service, and the agent uses this to register the service with the Consul Server. The Consul agent configuration includes a health check stanza which the Consul Server uses to confirm the health of a particular service and, thus, determines whether the service is advertised for use.
Services are advertised in DNS. The Consul Servers can by queried directly via a DNS forwarder, for instance, alternatively the Consul server can update GÉANT's DNS infrastructure but periodically pushing DNS zone files to Infoblox. A DNS query port on a consul server is a single point of failure so pushing a zone file to Infoblox is a more appropriate solution.
Example for the crowd service
As an example consider the Atlassian Crowd service deployed onto two servers:
prod-crowd01.geant.org
- location: Frankfurt PoP
- IP: 1.1.1.10
- service name: crowd
- consul agent
prod-crowd02.geant.org
- location: City House PoP
- IP: 1.1.1.11
- service name: crowd
- consul agent
consul-server01(02 and 03)
- consul agent
Infoblox (DNS):
- zone: geant.services
- crowd
- IP: 1.1.1.10
- IP: 1.1.1.11
- crowd
In the above setup, a client connection connects to crowd.geant.services, the DNS lookup will return either 1.1.1.10 or 1.1.1.11 (it doesn't matter as long as crowd works on them both). If the crowd service on prod-crowd02 fails (or the server fails, or if there are connectivity problems), the consul server detects this and updates the zone file to look like this:
Infloblox:
- zone: geant.services
- crowd
- IP: 1.1.1.10
- crowd
In this case, a lookup of crowd.geant.services will return 1.1.1.10 only.
Components
Consul: provides service discovery
Infoblox: DNS for sevicesservices
Install the consul agent on all servers which have services to be discovered
...