Elasticsearch is a technology that has been gaining popularity lately at Zalando Tech. We’ve learned that it’s a state of the art tool, and in the hands of a data artist can be used to design data models to conquer information retrieval challenges at a very large scale in a performant, distributed manner.
The success of Elasticsearch as a technology is due to the various use cases it fits. Whether it is full-text search, structured search, or data analytics, Elasticsearch has no competitor in solving so many diverse data problems.
For the Zalando Fashion Store, Elasticsearch has become the foundation of serving data to customer facing applications or other services. Due to the importance of the role it plays in our architecture, we needed a well-founded and scalable setup for our Elasticsearch clusters.
Operating a distributed data store comes with a certain set of challenges. Stability, growth, and availability of our data must be guaranteed for our stakeholders, customers, and consumers.
This is why we built Elasticsearch Express.
Elasticsearch Express is an appliance with a toolkit enabling quick deployment and management of Elasticsearch clusters. It enforces the best practices of operating clusters over cloud-based infrastructures. It is designed specifically to run on AWS over STUPS (Zalando PaaS).
Elasticsearch Express features:
- Easy deployment of Elasticsearch across multiple availability zones
- Cluster deployment in less than 10 minutes
- Full data availability guarantee on each availability zone
- Monitoring dashboard template of Elasticsearch metrics
- Role separation of nodes
- Stable master election
- No manual configuration on AWS
- Automatic data backups in S3 bucket
- Automatic recovery from possible infrastructure failures
We built Elasticsearch Express to allow Zalando teams focus on solving data problems, without worrying about operational configuration or infrastructure setup. The setup we offer is designed to accommodate for the possible failures that might occur from being in a cloud environment and hosting a cluster that spans data on multiple availability zones. In this article we will cover the deployment plan of Elasticsearch Express and scenarios of failure and automatic recovery based on the guarantees we promised.
Let’s dive into our deployment plan.
The Initial State
We start with an empty environment. Our target is to deploy a cluster distributed across a region consisting of two availability zones.
1. Deploy master nodes
The first step in forming a cluster is to deploy a master stack across both zones with an odd number of master nodes, to guarantee a quorum on one of the zones. Master election will take place and one node will be set as a master node. The other master-eligible nodes will be idling as long as the cluster is healthy, and no network disruption occurs.
2. Deploy data stack to the first zone
Step two occurs after masters deployment and cluster formation. The first data nodes stack will be deployed to one availability zone with the configuration identifying the zone ID, so that Elasticsearch is able to understand the cluster topology and obtain shard allocation awareness.
3. Deploy data stack to the second zone
Similarly, a following stack of data nodes will be deployed to the second zone, with a configuration identifying the second zone ID.
At this point, the cluster is formed successfully and ready for usage. We mentioned earlier that Elasticsearch obtained shard allocation awareness with this configuration. This means that whenever you create an index with any replication factor bigger than zero, Elasticsearch will distribute the shards across the two availability zones, guaranteeing that each zone contains the full data set.
To enable shard allocation awareness, each node within a stack is tagged with the following configuration.
Master nodes are configured with: “cluster.routing.allocation.awareness.attributes: zone”.
This informs Elasticsearch that shard allocation awareness for data nodes is defined by the custom attribute “zone”. Each data node is tagged with its zone ID, such as “node.zone: z1”. This way, the master node will know the location of each data node on the cluster across the availability zones, and distribute the data accordingly.
When you’ve reached this point, it is guaranteed that your data set is complete at both sides of the cluster.
The whole deployment process takes less than 10 minutes. You are now ready to ingest or fetch data from your cluster.
Operating a distributed data store in a live environment requires constant awareness of the cluster performance based on data growth, ingestion, and consumption rates. Elasticsearch exposes monitoring metrics that will represent the health of the cluster and its nodes. These metrics are presented by Elasticsearch Express on a Grafana monitoring dashboard, showing the performance of individual nodes. It’s very handy to have an overview of the performance of each node in comparison to others, as it makes it easier to spot and foresee issues before they occur.
Some metrics require monitoring of current instant numbers, while others might be more useful to view over longer durations to see the performance changes over a longer timeframe.
Data ingestion directly impacts the number of segments per node and merging rate. A healthy node will contain a number of segments within a reasonable constant range. A growing graph is a sign of insufficient merging or sudden growth of data magnitude.
Operations performed by nodes such as executing queries, merging segments, and shard allocation consume CPU and are definitely visible on node load and CPU graphs. Letting the nodes operate on low load is recommended to account for sudden extra load or traffic spikes.
Monitoring JVM/OS memory is also very important. Take into consideration that Elasticsearch uses the JVM heap to provide fast operations, while Lucene is using the underlying OS memory for caching in-memory data structures. Memory and garbage collection graphs will tell you about the health of the interaction between Elasticsearch and Lucene.
There are more metrics to account for, but these are the most important for operating a stable long lasting cluster.
Now let’s cover some failure scenarios that might occur to see how Elasticsearch Express will react.
Data link failure
Connection loss between availability zones might not happen too often, but inevitably it will. With the wrong configuration, you can end up with a split brain situation where you have two clusters formed out of the original one. This situation is disastrous on many levels in terms of data consistency and availability. Data replication will no longer occur between the newly formed cluster and the data state will completely diverge. Your application view of data will depend on which cluster the application instance is connected to.
Luckily, Elasticsearch Express is designed to handle such vulnerabilities. By setting a configuration parameter, dictating master election can only take place on the quorom side of the cluster with master-eligible nodes. A split brain between the two zones can never happen. In this case, a new master will be elected and the quorum side will form a cluster. The other side will not form a cluster and will just be hanging until the link between the nodes is back again.
Availability zone failure
Complete availability zone outage is unlikely to happen. However, a long period of connection loss between different zones can cause this.
Remember enabling shard allocation awareness? It’s now time to kick it into action. The data living on the cluster is safe since it is fully contained on the surviving zone. Indeed you may have lost a lot of shards, but Elasticsearch is aware of that and will soon start recreating them on the surviving zone.
Survive with one zone
Survival with one zone is guaranteed if you are sure your application load on half of your cluster instances won’t overwhelm the cluster and start killing nodes with immense traffic. Make sure your cluster always has the space to handle more than it’s designed to be. For stateful machines, you need to over-allocate resources – during times of failure, scaling and maintaining data state consistency is a tough challenge.
Restore the original state on one zone
Elasticsearch will soon replicate the surviving shards to reach the original state before failure, but now only on one zone. Currently, the cluster has survived with half the amount of nodes, so Elasticsearch Express sets each availability zone on its own scaling group. Soon enough, more nodes will be added by auto scaling based on the current traffic and in no time, the cluster will be fully fit and functional as if nothing had happened.
The whole cycle of failure, survival, and recovery occurs within minutes. You might be alerted somehow, but it will all be resolved before your intervention is actually needed. You might not even notice until the next day.
You’ll just need to make sure that your cluster has enough resources to handle higher traffic and perform survival tricks before you call it a night.
The zone outage is not forever and the missing zone will eventually see the light again, whether it has happened automatically or by your own manual doing. Nodes will start joining the cluster from the incarnated zone. Elasticsearch will recognize the new nodes and their topology on the network from the configured zone ID.
Once again, Elasticsearch will redistribute the data across the two zones to guarantee the full data set is contained by each.
Keep in mind that you might end up with a lot more nodes than before the incident due to autoscaling the surviving zone. The Elasticsearch Express scaling strategy is automatically up, and only manually down. You don't want scaling down nodes without your consent.
We are still researching the possibilities to automatically scale down to a certain threshold defined by the cluster operator for saving costs, but until this is supported, you will have to manually decommission and shut down the unwanted nodes. We’ve done our best so far in letting you sleep while disaster was about to strike and end the life of your cluster.
Finally, Elasticsearch Express is back up and running just as smoothly as the first time you deployed it.
Elasticsearch Express is currently being used in production by many teams at Zalando Tech. It is helping development teams deliver their ambitious requirements of serving and retrieving data at a large scale for various use cases.
If you have any further questions about Elasticsearch Express and how we use it at Zalando, you can contact me via Twitter at @alaa_elhadba. I’d love to hear from you.