Some thoughts on Elasticsearch

Let me say in the beginning: Elasticsearch is great for searching. Currently i'm busy with improvement of some search on million of objects, so Elasticsearch is good idea i think. So i stumbled over existing cluster (3 x 64gb ram, 32 Core) for logs (ELK stack) that looks as good place to look for existing data on that and to build new indices (document collections). However first i need to get to know Elasticsearch more closely.

What i've learned so far.

in the first week

  • Search performance is really impressive. Even by searching on not optimized raw log indices with billions (yes billions) of documents, you can still get result's in acceptable time.

  • Search queries (Search APIs) are expressive, but it takes time to understand them. A lot of time!.

  • As well you should care about mappings and types and understand how they indexed. This also takes time.

  • Not appropriate index structure can also affect you soon (To much or to few shards)

  • If logstash is used, it should be well understood as well and you should develop logstash config (filters) test driven from the start!

  • Sometimes you want to create new indice from data that exist in one another. This can be done by great Reindex API. And this is done in background, while ES stays responsive. For example i was able to create new index with 4.198.761 elements out of source index with >120.000.000 elements by executing REST call (see below) on the Reindex API. It took 30 minutes.

Examples

Some examples for those who never saw it. Typical REST calls.

Search query

POST /index/_search  
{   "_source":  ["entry_id","contract_id", "name","description", "score", "country"],

    "from" : 10, "size" : 200,
    "sort" : [{ "@timestamp" : {"order" : "asc"} }],
    "query": {
         "bool" : {
            "must": [
                   { "match_phrase": { "entry_type": "score processing" } },
                   { "term" :{ "contract_id" : "1000"} },
                   { "range" : { "@timestamp" : {  "gte": "17:08:2017", "lte": "17:08:2017", "format": "dd:MM:yyyy" } } },
                   { "match": { "name": "fantastic" } }
              ]            
         }
    }
}

We see a bool-query with only one must boolean clause that contains several expressions: match, match_phrase, term, range

New type mapping

PUT /index_name/_mapping/type_name  
{
       "type_name" : {
            "properties" : {
                "entry_id" : { "type" : "long" },
                "key" : { "type" : "text" },
                "name" : { "type" : "text" },
                "sescriotuion" : { "type" : "text" },
                "country" : { "type" : "text" },
                "@timestamp" : { "type" : "date", "format": "date_optional_time||yyyy-MM-dd HH:mm:ss" },
                "state"  : { "type" : "byte" },
                "contract_id": {  "type" : "long" },
            }
        }

}

This would create new type type_name inside the index index_name

Re-indexing.

POST /_reindex  
{
  "source": {
    "index": "logstash-2017.08.17",
    "_source":  ["entry_id","contract_id", "name","description", "score", "country"]

    "sort": { "@timestamp": "desc" },
    "query": {
         "bool" : {"must": [{ "match_phrase": { "entry_type": "score processing" } }]
         }
    }
  },
  "dest": {
    "index": "new_index", "type":"new_type"
  }
}

This Query creates new_index and fills it with elements that are matching the query section

Challenges

Struggling with the search.
I still don't know how to retrieve (all) elements, but only one child for same (parent) id - kind a group by.
And i don't know is it even possible to retrieve all elements, but again "group by" child for same parent id field and i want to specify a group by function.

Going further what i need is for example new synthetic fields while grouping:

  • inDate -> max(child.timestamp)
  • outDate -> min(child.timestamp).

I have no clue how to achiv that yet.

There is as well not that much examples to advanced queries. Also question to search queries on stackoverflow or on Elastics' Discuss platform are not well answered or answered at all, which wounders me a bit.

The same applies to Reindex. Probably i would like to use same "GROUP BY" expression to rebuild new index and to insert new fields, it look's like it's possible with "Pipelines", but not tried so far and not easy to understand without examples.

If you have some tips for beginners or any other feedback, please comment.