Elasticsearch: Working with Indices

Continuing article series on Elasticsearch this article explains things around indices.

Creating an index

Grab you'r favorite REST tool and let's and make sure you can access your cluster via rest if you like to execute these examples.
with HTTP PUT on /article-11-17 we will create index with name article-11-17 pretty convenient.

PUT /article-11-17  
{
    "settings" : {
        "index" : {
            "number_of_shards" : 3,
            "number_of_replicas" : 2,
        "refresh_interval": "5s",
        "priority": "10"
        }
    }
}

Here under the settings.index we define some important properties:

  • number_of_shards - the number of primary shards. Default is 5. Static setting.
  • number_of_replicas - teh number of replica shards. Default is 1
  • refresh_interval - default is 1s
  • priority - Unallocated shards are recovered in order of priority, whenever possible. Indices are sorted into priority order as follows:
    • the optional index.priority setting
    • the index creation date
    • the index name

These setting are frequently used and they can have significant impact on the performace of your index and cluster.

Static and dynamic settings

  • static settings can only be set at index creation time or on a closed index. There are not many static settings and the most importand one is number_of_shards.

  • dynamic settings can be changed on a live index using the update-index-settings API. E.g.:

PUT /article-11-17  
{
    "index" : {
        "number_of_replicas" : 3
    }
}

Nuber of shards & things to consider

The question regarding number of primary shards and replicas is probably the most important one. But there is not simple answer, because the answe depends on things like query patterns, number of nodes in the cluster, overal number of documents in the index.

  • Every shard is an Lucene index. An Lucene Index is a very effcient "search engine"
    • So many too small shard potentially produce to much overhead.
    • Knowing that, the question shoul be: when is the shard to big?
  • You can query several indices at once, consider this by planing the shape the indices.
    • For exmple if you'r Logstash creates daily index, that has 5 shards. Then with 7 days you end up with 35 primary shards. If you search back for 7 days potentialy 35 shards are participating in your query.
  • If you expect few documents in the index, let say less than 3 million - it's fine to have only one shard per index.
    • Depending on node count (and space) you could experiment with number of replicas, to increase serach speed and throughput.
    • the only problem might arise here, is that indexing performance would be limited to one node, because indexing is always happening to primary shards. If you have only one shard index all writes go to one node, holding that shard.
  • I would say that 5 millions of docs per shard is a good size to plan with. But you might also not notice any drop in performance with 30 millions per shard as well. There is never one size fit's all.
  • Replicas may help with throughput. Consider 2 shards and to replicas, on 4 nodes, ES will distribute them like on the picture below. In this case parallel queries on same index would probably utilize all 4 nodes, also reading from replicas.

An index with two primary shards and one replica can scale out across four nodes
(Pictuere from Elasticsearch: The Definitive Guide [2.x])

  • One of the advanced optimization to be mentioned here is possiblity to define on which nodes of the cluster shard of the index should be created. This makes sence in heterogeneus clusters that differ in Reouce Configuration. For example SSD machines v.s. large HDDs. or machines in paricular Rack or DC. Detail can be found in the Documentatio under Shard allocation Filtering.

Refresh interval

refresh_interval - is very important on heavy indexing. In many cases you don't need the result of the index to be visible imediately (e.g. logs index), but making refresh every second, might strog affect the overal performance of the cluster. So you can go with 5s or 30s in such a case.

Reindexing

Once an index created with unlucky number of primary shards you cannot change it on existing Index. However Elasticsearch provides very helpful Reindex API, that allows you "re-index" any documents of any existing indexes (even form remote clusters) to new index. Here an example of reindexing of an Subset of Documents from articles-11-17 to a acricles_experiment index:

{
  "source": {
    "index":  ["articles-11-17"],
    "query": {
         "bool" : {
            "must": [
                   { "range" : { "timestamp" : {  "gte": "10.11.2017", "lte": "17.11.2017", "format": "dd.MM.yyyy" } } }
              ]
         }
    }
  },
  "dest": {
    "index": "acricles_experiment"
  }
}

Outlook

Next i wan't to spend some words on Document type mappings and things like Index templates