Nowadays, there are many ways to use Elasticsearch depending on your scale and budget. Elasticsearch hosting options fall into 2 categories:
Host Elasticsearch yourself either on premise or in a cloud provider like AWS
Use a premium hosting solution like Elastic Cloud or Amazon Elasticsearch Service
If you are already paying for a premium solution you can assume that your cluster is already production-ready! However, if you are hosting your own Elasticsearch clusters how do you know if they are optimized and ready for production?
What is Production-Ready Elasticsearch?
The Elasticsearch documentation has its own definition of what production-ready Elasticsearch means in terms of Elasticsearch settings and system configuration. As soon as an Elasticsearch node detects it is being used for production, it executes a set of bootstrap checks. The checks verify the recommended production settings depending on your OS and existing configuration. Unfortunately, all bootstrap checks execute at the node level. There are several best practices at the cluster level that cannot be detected during the startup process of a single node.
Production-Ready Elasticsearch Redefined
My definition of a production-ready Elasticsearch cluster is a cluster that:
Is fault-tolerant against different types of node failures
Consists of nodes optimized for performance
The rest of the post will focus on the fault tolerance aspects of Elasticsearch which often get neglected until nodes failures start leading to downtime and potentially data loss.
Fault Tolerance For Elasticsearch Clusters
Elasticsearch offers data redundancy by replicating the primary shards of an index across the cluster. The following Elasticsearch features address cluster fault-tolerance beyond just simple shard replication.
Dedicated Master Nodes
If the active master node goes down, your entire cluster becomes unavailable until another master node is selected. That is why It is important to avoid putting any significant load on your master eligible nodes and create dedicated-master nodes instead. The processing requirements of master nodes are low, allowing you to allocate cheaper resources to host them.
Minimum Eligible Master Nodes
Elasticsearch 7.x was recently released, introducing major changes in terms of node discovery. But if you are running versions 6.x or older, there is a discovery setting that is often neglected and can cause data corruption when the active master node fails. The issue is called split-brain and it is caused when multiple master-eligible master nodes are elected as masters.
Shard Allocation Awareness
By default, Elasticsearch will never place a primary shard and its replicas on the same node in order to avoid losing all copies when the node fails. Shard allocation awareness can take the same idea even further. By annotating each node with the different characteristics of your infrastructure (e.g. availability zone, region, server rack, datacenter), Elasticsearch will attempt to separate the primary shards from their copies as much as possible.
Forced Allocation
Elasticsearch is always trying to maintain the required number of replicas. At the same time, shard allocation awareness mentioned above tries to spread the primaries and their replicas as far as possible. Now consider the scenario where a big part of your infrastructure becomes unavailable. Can the remaining nodes handle the entire load and space requirements or will the replication process bring down the rest of your cluster? To set some boundaries for replication you can use forced awareness to prevent replication making things worse when large failures occur.
Local Gateway
The Local Gateway module controls cluster state in the event of a cluster restart. The settings of the module describe the final size and characteristics of the cluster so that recovery does not begin too early in the restart process. For example, starting a cluster and balancing shards across 4 nodes is a waste of resources when you know your cluster will eventually consist of 100 nodes when fully started.
Optimize Cluster Then Nodes
My recommendation is to optimize your cluster for fault tolerance before attempting to optimize your nodes. You can always add more nodes to a fault-tolerant cluster but there is no quick fix for a cluster that becomes unavailable or loses data due to node failures. For more on fault tolerance checkout out our article on dedicated node types and better shard allocation and also read over the official documentation on recommended Elasticsearch settings and system configuration for production. Did you like this article? Subscribe to our blog by adding your email address to the form below. You can also email me at andreas@inventaconsulting.net or schedule a call to find out how Inventa Consulting can help you with Elasticsearch.
Comentários