Recently i wrote about Elasticsearch since then, over the last week i've worked on an application that ships data to Elasticsearch and another one, that searches on it. As well i've came in touch with the whole ELK stack. Thus article is a compilation of things i've learned regarding Cluster setup and management- since i had to improve performance and stability issues there. So let's look at it.
Of course everyone knows, that there is never only one way to build a cluster and let me say it in the beginning, i agree to that. The good news is, that Elasticsearch is relative flexible and you can shape your cluster according to your needs. But this also implies, the better you understand your Use Case and of course the better you understand some internals of Elasticsearch, the better cluster you can potentially have (budget might be a constraint ;)).
Hardware for Elasticsearch
Let's start with a Hardware. Yes even in the Cloud age, we need to understand the needs here. Again good news is, that Elasticsearch was designed for commodity hardware (more or less), so generally speaking you can start small and grow (and learn) as you need.
First i would state that Elasticsearch (in a common ELK like Use-Case) is fare more affected by IO then by RAM or CPU. I agree this statement is very opinionated, but it's what you can expect for most of ELK like load scenaries.
Some (complex) queries might eat up all of your CPU and memory and in some cases your IO will not be a problem at all, but normally keep an eye on IO first.
SSD over Spinning Disks
Nowadays the most problems with SSDs are solved and prices dropped a lot. Having in mind that IO bottleneck can really slow your queries, you should not spare money here. SSDs has a lot more IOPS/$ than spinnign disks, so just use them! However, HDDs are still improving and if you need a a lot of TB of space then HDDs are probalye not avoidable, the have more GB per Dollar. But you can mix different machine in the cluster as well. And you can have different indexes (shards) be on particular machines as well.
RAID & Co
Using RAID 0 is an effective way to increase disk speed, for both spinning disks and SSD. There is no need to use mirroring or parity variants of RAID, since high availability is built into Elasticsearch via replicas (don't forget to use it properly).
Furthermore there is a way to define several locations to path.data
path.data: - /mnt/data-disk1/ - /mnt/data-disk2/
If you mount two physical discs to these two different locations, potentially Elasticsearch can then double read and write throughput. But only potentially! Indeed we have such a setup with up to 4 local HDDs and there are two things are don't like her.
- First the IO distribution on the disks happens per shard, so you only profit from it, when your queries are hitting several shards on different disks one every node. Same patter for writing. Therefore it likely more affecting the overall throughput, but may have little or no effect for single search query.
- Another huge drawback is that if one disk fails, you don't know how Elasticsearch node reacts( or how LVM or OS present this problem to Elasticsearch) I've been facing this situation last week, one of 4 disk passed away, and the behaviour of the Elasticsearch node (v. 5.6.3) was the worst on of possible: It just got slow on everything it did: some timeouts in Kibana, some breaking bulk inserts, and at some point there where short status "red" blinking. This kind of failure is hard to monitor and find problem quick. Also Elasticsearch could not react as it would on clerly failed-node. So i would not advise.
Finally, avoid network-attached storage (NAS). NAS is often slower but most important it has larger latencies with a wider deviation in average latency. Another point to mention NAS is a single point of failure and potential IO bottleneck.
RAM is important for caching! Elasticsearch can easily fill up a lot of RAM with caches and internal memory structures to serve quick results to you. RAM size of 32GB and 64GB are common for Elasticsearch. You can go with 16GB as well but if you have 128GB it might to much See why
It's important that CPU's performing well with java. For example newer Xeons, that have AVX-512 Extensions, perform better when running Java. Constantly heavy CPU load can cause nodes failing out of cluster one by another.
General word to Cluster size. In production you probably end up with at least 3 nodes and more. Not less.
There are several node roles your machines can have. Remember Data nodes are working horses of your Data, so in small clusters (3-5) you might not need separate ingest or master only nodes.
As mention above IO is critical to Elasticsearch, so you may ask yourself about filesystem for Elasticsearch. Generally Ext4 is a fast, reliable option that does not need tuning and Elasticsearch would work very well. Some people say that on more than 1TB of data per node, well tuned XFS shows better performance. So Ext4 and XFS are fine.
Saying that it's also important to have newest kernel, because a lot of performance of Fielsystem is constantly improving there. So for xfs it should be newer than 4.9.25. Several not minor bugs are fixed in these.
A common problem that happens - is configuring a heap that is too large! Elasticsearch useses
Heap and off Heap memory.
Lucene - the core search engine of every Elasticsearch-Shard is designed to leverage the underlying OS for caching in-memory data structures. Lucene segments are immutable and are stored in individual files that never changes. This makes them very cache friendly, and the underlying OS will happily keep hot segments resident in memory for faster access. These segments include both the inverted index (for fulltext search) and doc values (for aggregations).
Lucene’s performance relies on this interaction with the OS. But if you give all available memory to Elasticsearch’s heap, there won’t be any left over for Lucene. This can seriously impact the performance.
SO feel free to give 50% of the available memory to Elasticsearch heap, while leaving the other 50% free. It won’t go unused since Lucene will grap everythingis left over. To be precise, it's the (linux) OS efficient cache that Lucene relies on here.
Whatever you decide on a machine with a lot of RAM - don't cross Compressed Oops border that is around 32GB. Read Heap:Sizing and Swapping - and don't cross the border. We using 64 GB machines and tried 64GB heap and 32Gb with the rest for Lucene and Linux Buffers. The 64GB was a way more problematic, with GB pauses up to a 1 second and slower queries.
You can check JVM setting of your cluster with:
curl -X GET http://hostname:9200/_nodes/jvm -H 'content-type: application/json'
You might guess swapping is performance killer for jvm's heap. You should disable swapping on machine (dangerous sometimes) or at least for Elasticsearch process. You can do it with:
in your elasticsearch.yml. This allows the JVM to lock it's memory and prevent it from being swapped by the OS. BUt Your system's security configuration might prevent this by default. You will see meaningful errors in the logs on failed start. Then disable these limits by including following lines:
elasticsearch soft memlock unlimited elasticsearch hard memlock unlimited
into your '/etc/security/limits.conf' file (On CentOS). While speaking of limits.conf also add
elasticsearch hard nofile 65536 elasticsearch soft nofile 65536
that will allow Lucene to work with a lot of files simultaneosly and Lucene creates a lot of files!
This query helps you to check whether
"mlockall": true is set.
curl -X GET http://hostname:9200/_nodes/process/ -H 'content-type: application/json'
Results are listed per node.
A node holds several thread pools in order to control how threads memory consumption are managed within a node. Many of these pools also have queues associated with them, which allow pending requests to be held instead of discarded. In my use case we have a custom application, that performs bulk indexing with some spikes of load. I had to increase default queue size for indexing and bulk operations to prevent cluster to reject document on spikes:
thread_pool.bulk.queue_size: 500 thread_pool.index.queue_size: 1000
You can check the current configuration of thread pools with:
curl -X GET http://hostname:9200/_nodes/thread_pool -H 'content-type: application/json'
Read more about different thread pools and associated queues. But be careful with these, increasing queue size can't help with general throughput problems.
The query cache is responsible for caching the results of queries but oly queries which are being used in a filter context. There is one cache per node that is unfortunately shared by all shards. The cache implements an LRU eviction policy. So if you have several Use Cases on one cluster that for example use different indexes, they will affect each other, you might wan't to increase the default value of 10% of heap to say 20%
If your index is updated constantly, the caches are invalidated as well. So check your cache statistics as well.
curl -X GET http://hostname:9200/_stats/request_cache?human -H 'content-type: application/json'