RHQ High Availability - Design and Goals
RHQ 1.1.0 is scheduled to introduce RHQ High Availability (HA). The initial goal of RHQ HA is to support multiple RHQ servers configured for a single database repository. Agent load can then be distributed amongst the available servers. Failover will occur for agents whose server goes down for any reason. So, RHQ HA will introduce scalability and fault tolerance.
The sections below describe features of RHQ High Availability that are currently in plan for version 1.1.0. Although accurate at the time of writing the content and features below are subject to change or omission.
Server Cloud
The foundation of RHQ HA is a cloud of RHQ servers. The cloud is made up of one or more loosely coupled RHQ servers. Cloud members must:
- Have a unique name.
- Have a unique endpoint [non-translated address/(secure)port].
- Be configured for the same database.
- Have compatible RHQ versions (
initially, all must be version 1.1.0)
A single-server installation is considered a 1-member HA server cloud and must therefore supply valid name and endpoint information. The RHQ installer will handle multiple scenarios: single server installation, the addition of a cloud member, and the upgrade/re-installation of an existing cloud member. The server distribution will remain a .zip file in RHQ 1.1.0; the archive will unzip into a new directory.
Load Balancing
HA will provide load balancing by distributing agents to different servers in the cloud. There are various factors that will contribute to the distribution algorithm:
Affinity
An "Affinity Group" is a tag (just a unique name) created by the administrator. It can then be assigned to any number of RHQ servers and agents. For example, in an environment with several data centers it may be useful to assign each data center an affinity group (e.g. "DC1", "DC2"). Agents and servers co-located in a single data center can then set their affinity groups appropriately. The distribution algorithm will then show preference to assigning agents to servers withing the same affinity group. Although, based on load and availability affinity is not guaranteed.
- A server or agent can be assigned to only one affinity group.
- By default a server or agent is assigned no affinity group.
- A server can be assigned an affinity group at install time or via the HA Administration Console.
- An agent can be assigned an affinity group via the HA Administration Console.
Round Robin
The goal is to distribute agents across servers while taking into consideration aspects such as affinity. Additionally, the distribution algorithm will take into consideration failover topology. Agents will be assigned server lists (see below) in a weighted round robin fashion distribution.
For example, in a 3-Server cloud with no affinity then agents would be assigned server lists similar to:
A1 : S1, S2, S3
A2 : S2, S3, S1
A3 : S3, S1, S2
A4 : S1, S3, S2
If S1, S2, A1, A2, A3 all had affinity then the assigned server lists may look similar to:
A1 : S1, S2, S3
A2 : S2, S1, S3
A3 : S1, S2, S3
A4 : S3, S2, S1
Server Assignment
Perhaps the best way to understand the proposed behavior for Agent and Server assignment is to look at various use cases for how an an RHQ agent determines its server. To do this, a few terms need to be defined:
- Token
The Agent's "Token" is an identifier provided by the server to, and persisted by, an agent at registration time. An agent will not have a token on initial startup. The agent's token will be delete if it is started with the --clean option.
- Server List (a.k.a Failover List)
The Agent's "Server List" is provided by the server, and persisted by, an Agent upon request. It is an ordered list of servers the agent will use for connection.
- Primary Server
The Agent's "Primary Server" is the Server from the Server List to which a running agent is currently connected.
- Setup Server (a.k.a. Registration Server)
The Agent's "Setup Server" is the server defined (address/port) in the agent setup questions. The agent setup questions are presented on initial startup or if the agent is started with the --clean or --setup option.
Agent startup logic:
If the agent:
- has no assigned token
Then
- removes its server list (in the unlikely case that it has one persisted)
- attempts to register with the setup server
- if the agent name is not known to the server a new token is generated, otherwise the existing token is supplied to the agent.
- if the setup server can not be contacted the agent will not be able to connect to an RHQ server and must be reconfigured or wait for the setup server to come online.
If the agent:
- has a token
- has no server list
Then
- requests a server list from the setup server
- if the setup server can not be contacted the agent will not be able to connect to an RHQ server and must be reconfigured or wait for the setup server to come online.
If the agent:
- has a token
- has a server list
Then
- attempts to connect to servers on the server list, in order, starting with the head of the list, until a connection succeeds or the server list is exhausted.
- When a server list is exhausted it will be reprocessed, from the head of the list, after some (configurable) delay.
Failover
After successful startup the agent will be connected to its primary server. If the agent loses its connection to the primary server it will perform some logic to ensure the connection loss was not just temporary (e.g. network blip). If reconnection does not succeed, the agent will attempt to failover to a different server, starting with the head of its server list, until a connection is made or until the server list is exhausted. When a server list is exhausted it will be reprocessed, from the head of the list, after some (configurable) delay.
Upon connection to a new primary server the agent will scale its workload incrementally. This is to prevent overwhelming a particular server after a large scale failover. For example, in a 2-Server cloud, if one server goes down all agents will failover to the remaining server.
Messages from agent-to-server that were marked for reliable delivery will be sent to the new server once a connection is established.
Redistribution
The HA Server Cloud will, in certain circumstances, redistribute agents. RHQ will periodically review the server-agent topology and decide whether redistribution is necessary. If so, RHQ will re-balance agent load across available servers. This will result in new server lists being delivered to connected agents. It is important to note that the redistribution algorithm will seek to limit connection churn, and as such will ask the minimal number of connected agents to change servers to accomplish the re-balancing.
Redistribution can occur for the following reasons:
- a new server is added to the cloud.
- an existing server comes online in NORMAL operating mode.
- an existing server goes down.
- an existing server is deleted from the cloud.
- an existing server is put into MAINTENANCE operating mode.
- admin request via HA Administration Console
Upon reception of a new server list the agent will reconnect to its new primary server (the head of the just received list) if it differs from its current primary server.
Note that a server going down does not force immediate redistribution but rather relies on failover. A server being removed from the cloud, or being put into Maintenance Mode, similarly relies on failover. The lost server will be removed from server lists in a future, periodic or admin requested, redistribution.
Server Maintenance Mode
An HA Server can be taken out of the cloud for maintenance reasons without actually being shut down. This is done via the HA Administration Console and effectively shuts down all agent communication with the server, although the server remains up and the RHQ GUI remains usable. Agents will treat this as a downed server and will apply reconnect and failover logic as needed.
There are two Server Operating Modes, NORMAL and MAINTENANCE. A Server comes up in the same operating mode it was set to when it went down.
HA Administration Console
The RHQ GUI will offer an HA Administration Console (HAAC), available to RHQ users with management permissions. It will be accessed via the Administration Page in the existing GUI. The Administration Console will offer the following features:
- List of HA Servers
- All configured details
- Active agent count
- Ability to change Operating Mode
- HA Server Detail
- Limited editing (Endpoint address, port, secure port, affinity group)
- Agent list
- List of Known Agents
- Assign affinity group
- Agent Detail
- Server List
- Primary Server
- HA Options
- Forced Redistribution
GUI
Note that it doesn't matter which server you connect to in the Server Cloud to use the RHQ GUI; the viewable resources and available options will be identical regardless of which you choose.
RHQ Agent
Commands
The RHQ Agent will have new commands introduced with HA:
- Server List View
View the current server list for the agent.
- Server List Regenerate
Request a new server list be generated for the agent if for some reason the current list is stale.
Upgrade
The next version of RHQ does not have in plan automated agent updates. The RHQ agent will need to run the same version as the RHQ server, and will need to be re-installed with the new version. Ease of installation and agent backward compatibility are high priority goals for future versions.
Future
The following features are currently not in scope for RHQ 1.1.0 but are in plan for subsequent releases of RHQ High Availability.
Load Balancing (Future)
- Relative server power (number of cores)
- Agent load
Relative Server Power
If the server cloud is made up of servers with unequal compute power it makes sense to assign more agent load to the servers with more compute power.
Agent Load
RHQ agents can vary significantly in the load they put on a server based on number of inventoried resources, measurement collection (schedule) frequency, and other factors. HA will base agent assignment not on number of agents but on relative agent load.
Database Failure Handling (Future)
On database failure all RHQ servers configured for that database will, on a best effort of detection, be moved to Maintenance Mode. When the database is restored, for servers still operating, they can be reset to Normal operating mode via the GUI HAAC.
Failover (Future)
Initially a server will have no hard limit on how mucg agent load can be assigned. A potential future is to be able to define various limits for server load which when enforced will deny agent connection requests.
Redistribution
Redistribution can occur for the following reasons:
- in response to unbalanced agent load.
RHQ Agent (Future)
- Maintenance Mode
Put the agent into Maintenance Mode. This will suppress failover in situations where it is known that the primary may be down temporarily.