Recently i wrote about Elasticsearch since then, over the last week I’ve worked on an application that ships data to Elasticsearch and another one, that searches on it. As well I’ve come in touch with the whole ELK stack. This article is a compilation of things I’ve learned regarding Cluster setup and management- since I had to improve performance and stability issues there. So let’s look at it.
Of course, everyone knows, that there is never only one way to build a cluster, and let me say it in the beginning, I agree with that. The good news is, that Elasticsearch is relatively flexible and you can shape your cluster according to your needs. But this also implies, the better you understand your Use Case and of course the better you understand some internals of Elasticsearch, the better cluster you can potentially have (budget might be a constraint ;)).
Hardware for Elasticsearch
Let’s start with Hardware. Yes even in the Cloud age, we need to understand the needs here. Again the good news is, that Elasticsearch was designed for commodity hardware (more or less), so generally speaking, you can start small and grow (and learn) as you need.
First, I would state that Elasticsearch (in a common ELK like Use-Case) is far more affected by IO than by RAM or CPU. I agree this statement is very opinionated, but it’s what you can expect for most ELK-like load scenarios. Some (complex) queries might eat up all of your CPU and memory and in some cases, your IO will not be a problem at all, but normally keep an eye on IO first.
SSD over Spinning Disks
Nowadays the most problems with SSDs are solved and prices dropped a lot. Having in mind that IO bottleneck can really slow your queries, you should not spare money here. SSDs have a lot more IOPS/$ than spinning disks, so just use them! However, HDDs are still improving and if you need a lot of TB of space then HDDs are probably not avoidable, they have more GB per Dollar. But you can mix different machines in the cluster as well. And you can have different indexes (shards) be on particular machines as well.
RAID & Co
Using RAID 0 is an effective way to increase disk speed, for both spinning disks and SSD. There is no need to use mirroring or parity variants of RAID, since high availability is built into Elasticsearch via replicas (don’t forget to use it properly). Furthermore, there is a way to define several locations to path.data
path.data: - /mnt/data-disk1/ - /mnt/data-disk2/
If you mount two physical discs to these two different locations, potentially Elasticsearch can then double read and write throughput. But only potentially! Indeed we have such a setup with up to 4 local HDDs and there are two things are don’t like her.
- First the IO distribution on the disks happens per shard, so you only profit from it, when your queries are hitting several shards on different disks on every node. Same pattern for writing. Therefore it likely more affecting the overall throughput, but may have little or no effect for a single search query.
- Another huge drawback is that if one disk fails, you don’t know how the Elasticsearch node reacts( or how LVM or OS present this problem to Elasticsearch) I’ve been facing this situation last week, one of 4 disks passed away, and the behavior of the Elasticsearch node (v. 5.6.3) was the worst on of possible: It just got slow on everything it did: some timeouts in Kibana, some breaking bulk inserts, and at some point there where short status “red” blinking. This kind of failure is hard to monitor and find problems quickly. Also, Elasticsearch could not react as it would on clearly failed node. So I would not advise.
Finally, avoid network-attached storage (NAS). NAS is often slower but most important it has larger latencies with a wider deviation in average latency. Another point to mention NAS is a single point of failure and potential IO bottleneck.
RAM is important for caching! Elasticsearch can easily fill up a lot of RAM with caches and internal memory structures to serve quick results to you. RAM size of 32GB and 64GB are common for Elasticsearch. You can go with 16GB as well but if you have 128GB it might too much See why
It’s important that the CPU’s performing well with java. For example, newer Xeons, that have AVX-512 Extensions, perform better when running Java. Constantly heavy CPU load can cause nodes to fail out of cluster one by another.
General word to Cluster size. In production, you probably end up with at least 3 nodes and more. Not less.
There are several node roles your machines can have. Remember Data nodes are working horses of your Data, so in small clusters (3-5) you might not need separate ingest or master-only nodes.
As mention above IO is critical to Elasticsearch, so you may ask yourself about filesystem for Elasticsearch. Generally, Ext4 is a fast, reliable option that does not need tuning and Elasticsearch would work very well. Some people say that on more than 1TB of data per node, well-tuned XFS shows better performance. So Ext4 and XFS are fine. Saying that it’s also important to have the newest kernel because a lot of performance of Filesystem is constantly improving there. So for XFS it should be newer than 4.9.25. Several not minor bugs are fixed in these.
A common problem that happens - is configuring a heap that is too large! Elasticsearch uses Heap and off Heap memory. Lucene - the core search engine of every Elasticsearch-Shard is designed to leverage the underlying OS for caching in-memory data structures. Lucene segments are immutable and are stored in individual files that never change. This makes them very cache-friendly, and the underlying OS will happily keep hot segments resident in memory for faster access. These segments include both the inverted index (for full text search) and doc values (for aggregations).
Lucene’s performance relies on this interaction with the OS. But if you give all available memory to Elasticsearch’s heap, there won’t be any leftover for Lucene. This can seriously impact the performance.
SO feel free to give 50% of the available memory to the Elasticsearch heap, while leaving the other 50% free. It won’t go unused since Lucene will grab everything left over. To be precise, it’s the (Linux) OS efficient cache that Lucene relies on here.
Whatever you decide on a machine with a lot of RAM - don’t cross Compressed Oops border that is around 32GB. Read Heap: Sizing and Swapping - and don’t cross the border. We using 64 GB machines and tried 64GB heap and 32Gb with the rest for Lucene and Linux Buffers. The 64GB was way more problematic, with GB pauses up to a 1 second and slower queries.
You can check the JVM setting of your cluster with:
curl -X GET http://hostname:9200/_nodes/jvm -H 'content-type: application/json'
You might guess swapping is a performance killer for JVM’s heap. You should disable swapping on the machine (dangerous sometimes) or at least for the Elasticsearch process. You can do it with:
in your elasticsearch.yml. This allows the JVM to lock its memory and prevent it from being swapped by the OS. But Your system’s security configuration might prevent this by default. You will see meaningful errors in the logs on failed start. Then disable these limits by including the following lines:
elasticsearch soft memlock unlimited elasticsearch hard memlock unlimited
into your ‘/etc/security/limits.conf’ file (On CentOS). While speaking of limits.conf also add
elasticsearch hard nofile 65536 elasticsearch soft nofile 65536
that will allow Lucene to work with a lot of files simultaneously and Lucene creates a lot of files!
This query helps you to check whether
"mlockall": true is set.
curl -X GET http://hostname:9200/_nodes/process/ -H 'content-type: application/json'
Results are listed per node.
A node holds several thread pools in order to control how to thread memory consumption is managed within a node. Many of these pools also have queues associated with them, which allow pending requests to be held instead of discarded. In my use case, we have a custom application, that performs bulk indexing with some spikes of load. I had to increase default queue size for indexing and bulk operations to prevent cluster to reject document on spikes:
thread_pool.bulk.queue_size: 500 thread_pool.index.queue_size: 1000
You can check the current configuration of thread pools with:
curl -X GET http://hostname:9200/_nodes/thread_pool -H 'content-type: application/json'
Read more about different thread pools and associated queues. But be careful with these, increasing queue size can’t help with general throughput problems.
The query cache is responsible for caching the results of queries but only queries that are being used in a filter context. There is one cache per node that is unfortunately shared by all shards. The cache implements an LRU eviction policy. So if you have several Use Cases on one cluster that for example use different indexes, they will affect each other, you might want to increase the default value of 10% of the heap to say 20%
If your index is updated constantly, the caches are invalidated as well. So check your cache statistics as well.
curl -X GET http://hostname:9200/_stats/request_cache?human -H 'content-type: application/json'