The following describes structure and facilities that provide service resilience and reliability in GÉANT.
Components
Consul: provides service discovery
Infoblox: DNS for sevices
Install the consul agent on all servers which have services to be discovered
Have a quorum of consul servers
Consul server creates/generates DNS zonefile which is pushed to infoblox.
Problems being addressed
Primary and Backup dashboard - CNAME change is required to point to the operational dashboard (which is usually primary).
Crowd authentication: prod-crowd and uat-crowd contain identical information, but systems currently configure to use one or the other.
Generally, all service deployment can be done in the context of redundant services providing high availability. The impact to SWD is very low. Services should be deployed such that they are essentially the primary in every case.
Structure
Each server will run the consul agent and include a config listing the services it runs and how to monitor them (to test if they are serviceable). This should be maintained in puppet.
The consul servers create, update and push a DNS zone file. This should occur frequently enough as is reasonable to minimise a service DNS query miss. Perhaps every 5 minutes or more frequently?