Elasticsearch: Comparison of New and Old Primary Election Algorithms

Es6 vs Es8 primary election contrast

Aug 22, 2023

Preview

I recently started investigating various distributed system internals, and since my daytime work, has very high coupling with Elasticearch. This article spawned from a combination of those real-life collaborations! Without any further ado, let’s jump right into it. Please keep in mind, This article already assumes on basic understanding of Elasticsearch. I will directly jump into leader elections.

I am still learning these systems, if you see any enhancement I can do or fix an issue. Please feel free to shout out!

Introduction

Elasticsearch, a powerful search engine built on top of Apache Lucene, has undergone numerous improvements over the years. One of the areas that has witnessed changes is the primary selection algorithm, the method used to determine which node becomes the leading or "master" node. Before Elasticsearch 7.0, the "Bully" algorithm dominated this task, but with the introduction of the newer versions, this changed.

The Old Bully Algorithm

History: Until Elasticsearch 7.0, Elasticsearch employed the Bully algorithm for primary selection. The name "Bully" suggests a kind of dominance, implying that the algorithm is assertive or aggressive in its approach.

Principle: The Bully algorithm determines the leader based on node ID values. In essence, the node with the highest ID takes precedence and becomes the leader.

Message Types in the Bully Algorithm: During the election process, the Bully algorithm uses three types of messages:

Election Message: Announces the initiation of an election.
Answer (Alive) Message: Serves as a response to the Election message.
Coordinator (Success) Message: Sent by the winning node to announce its victory.

Example

When selecting a leader, node1 sends out an Election message. If no response is received, it's either because the other nodes have a higher ID or they're also broadcasting their election messages. The nodes will then exchange Answers and Coordinator messages to finalize the master.

Explanation:

Node 1 initiates an election by sending a message to Node 3, containing its own serial number, which is 1.
Upon receiving the message, both Node 2 and Node 3 compare the provided sequence number with their own. Realizing that their sequence numbers are higher, they reply to Node 1 with an "Alive" message, indicating that Node 1 is not the primary in the sequence.
Node 2 forwards an election request to Node 3. However, Node 3 is unable to identify any node with a higher sequence number to which it should forward this election request.
Node 3 sends a response back to Node 2. More importantly, Node 3 does not receive any response messages from other nodes during this period.
Now recognized as the leader(or Primary), Node 3 dispatches a Coordinator Message to the other nodes, declaring its successful election and announcing itself as the master node to both Node 1 and Node 2

How Bully Algorithm was used in Elasticsearch 6

For the Bully algorithm to select the Elasticsearch master, it primarily follows these steps:

Identify active master nodes. This is determined from the discovery.zen.ping.unicast.hosts configuration in the elasticsearch.yml file.
1. In this process, we'll ping the hosts directly. Any nodes that respond to the ping and are available and active, excluding the current node, will be added to the activeMasters list.

List<DiscoveryNode> activeMasters = new ArrayList<>();
for (ZenPing.PingResponse pingResponse : pingResponses) {
if (.. !localNode.equals(pingResponse.master()..)
                activeMasters.add(pingResponse.master());
...

Determine potential master nodes. A list masterCandidates is built containing nodes eligible to be master.

// nodes discovered during pinging
List<MasterCandidate> masterCandidates = new ArrayList<>();
for (PingResponse pingResponse : pingResponses) {
  if (pingResponse.node().isMasterNode()) {
       masterCandidates.add(new MasterCandidate(pingResponse.node(), ..)
...

Conduct the Bully algorithm election. If no active master nodes exist and the number masterCandidates is sufficient, the node with the lowest ID is elected as the master.

 if (activeMasters.isEmpty()) {
    if (electMaster.hasEnoughCandidates(masterCandidates)) {
       final MasterCandidate winner = ...electMaster(masterCandidates);
       logger.trace("candidate {} won election", winner);
       return winner.getNode();
    } else {
      // if we don't have enough master nodes, we bail
      logger.warn("not enough master nodes discovered ...",...)
      return null;
    }
 } else {
    assert !activeMasters.contains(localNode) :
                "local node should never be elected as master when other nodes indicate an active master";
            // lets tie break between discovered nodes
    return electMaster.tieBreakActiveMasters(activeMasters);
}

If you’re interested in the master id comparison, you can further grab it from here and here.

The Bully algorithm, being straightforward, was favored for its simplicity. It's easy to implement, and its decision-making is swift based on comparing nodes. This simplicity and speed probably explain its adoption in earlier versions of Elasticsearch.

Limitations: However, its simplicity can come with drawbacks. The Bully algorithm may not be suitable for more complex scenarios.

For instance, when nodes frequently join or leave the cluster, master nodes might change often. This can increase the metadata size, making it unwieldy over time.
Moreover, in complicated scenarios, a robust primary selection algorithm is vital to ensure stability, which the Bully algorithm might not guarantee.

Why the Bully Algorithm Was Replaced

The ES7 (Elasticsearch 7) version brought significant changes, not only to its master election algorithm but even some types were deleted. Many users were still on ES5 or ES6. At my daytime work, we were stuck in Es2.4 in some projects, quite embarrassing. It took a phenomenal effort to upgrade it to Es8.4.3. That upgrade journey is a different story by itself.

Anyway, A deeper dive into Elasticsearch’s official documentation reveals the reasons for the shift:

If you want to directly get it from the source here

The Shortcomings of Zen Discovery (Es6 election)

Let’s summarize the article. In Elasticsearch versions 6.x and earlier, Zen Discovery was the mechanism for cluster coordination. Over the years, it effectively supported clusters of all sizes. According to Elastic, for desired improvements, fundamental changes in its operation were necessary.

Zen Discovery required users to manually configure the discovery.zen.minimum_master_nodes to determine how many eligible nodes could form a quorum.
1. It was crucial to correctly configure this parameter on every node. If the cluster scaled dynamically, this had to be updated accordingly.
The system couldn't detect a misconfiguration of this parameter. Furthermore, after adding or removing nodes, adjusting this parameter was often overlooked.
Zen Discovery tried to mitigate misconfigurations by waiting for a few seconds during every master election process. This meant that if a chosen master node failed, the cluster would be unavailable for at least several seconds before an alternative was chosen. A big headache, if you want to maintain very high availability.
If the cluster couldn’t elect a master node, understanding why was often painstakingly challenging.

The Introduction of the Raft-like Algorithm in Es 7

From ES7 onwards, the following were the changes to the cluster coordination subsystem:

Automated Quorum Selection: Instead of relying on the manual configuration of minimum_master_nodes in Zen Discovery, the Elasticsearch Raft can autonomously determine which nodes can form a quorum, simplifying cluster coordination and reducing the risk of human error.
Enhanced Cluster: Standard master node elections are now much faster. Higher availability
Improved Cluster Scalability: Scaling the cluster is safer and easier. There are fewer system configurations that might lead to data loss.
Clearer Node Diagnostics: Nodes log their states more clearly, assisting in diagnosing why they can't join a cluster or why a master can't be elected. Easier to diagnose or investigate issues related to coordination.

It appears they investigated many recognized distributed consensus algorithms like Raft, each had its trade-offs. After careful evaluation, Elasticsearch chose Raft, but it wasn't a ditto implementation.

TLDR: How Raft Works in a very high-level

Raft is designed to solve the consistency problem in distributed systems:

Roles in Raft: Nodes can have one of three roles: Follower, Candidate, or Leader.
Initially, all nodes start as Followers.
If a node doesn't receive a message from the Leader, it becomes a Candidate.
The Candidate then sends election requests to other nodes.
If a majority of nodes approve, the Candidate becomes the Leader.

Election criteria include not receiving heartbeats from the Leader or the Leader timing out.

TLDR: Raft-like Algorithm in Elasticsearch

Elasticsearch's election mechanism is Raft-inspired but differs in several ways:

Nodes start as Candidates.
Multiple votes are allowed, i.e., nodes can cast multiple votes.
Candidates can vote.
Multiple primary nodes might be elected.

Example

For example, if node1, node2, node3 elect the master

If node1 becomes the leader and node2 requests votes, node1 abdicates, and node2 becomes the leader. If node3 also requests votes, node2 abdicates for node3. The main essence is to converge and decide on the final leader node.

In Summary

The initial state of the node is Candidate. When joining the cluster, if our discovery finds that there is already a leader in the cluster, the newly added node is automatically converted to a follower.

Here are some of the voting transitions within the algorithm

Advantages of the Raft-like Algorithm

1. Elimination of the need for discovery.zen.minimum_master_nodes. Elasticsearch selects nodes that can form a quorum. Only an initial master node list needs configuration.

2. Newer leader elections are much faster than Zen Discovery.

3. It resolves challenges faced in Zen Discovery, such as potential cluster state update losses due to network partitions.

Limitations

1. In clusters with many nodes, repeated leader elections might slow down the process.

Other Implementations of Raft

Well-known projects like ETCD, Kafka, and RocketMQ have also adopted the Raft protocol.

ETCD uses Raft at its core.
Kafka initially relied on the ZAB protocol (mix of Paxos-like, raft-like, atomic broadcast) provided by Zookeeper. However, from version 2.8 onwards, it switched to Raft, indicating a preference over ZAB.
RocketMQ internally is based on the dledger source code, which is Raft-based.

Conclusion

The adoption of the Raft-like algorithm in Elasticsearch was essential for efficiency, scalability, and reliability. This overview, while simplified, should provide a clear understanding of the changes and reasons behind them.

The hope is that continuous learning and effort will lead to better opportunities and outcomes for all.

Venkat’s Substack

Discussion about this post