Internal data structures of Elasticsearch

If you start working intensively with Elasticsearch you cannot get around the understanding of internal data structures of it. Here i'll try to make this very comprehensible:

Inverted Index

Key Characteristics of Inverted Index

  • Allow very fast full-text searches
  • Not good structure for sorting
  • Created at index-time
  • Serialized to disk

An inverted index is basic memory structure. It consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears. Consider the following structure.

Term      Doc_1  Doc_2  ...| Doc_X  
----------------------------------
hello   |   X   |  X  
world   |   X   |  X  
java    |       |  X  
perl    |   X   |  
golang  |       |       ... |  X  
...
----------------------------------

Here for every term a list of documents containing that term.Now, if we want to search for "world perl", we just need to find the documents in which each term appears:

Term      Doc_1  Doc_2  
-------------------------
world   |   X   |  X  
perl    |   X   |  
------------------------
Total   |   2   |  1  

Both documents match, but the first document has more matches than the second. Keep in mind on indexing the values are subject to tokenization and normalization - process that called analysis.

Doc Values

Key Characteristics of Doc Values

  • Good for sorting operations
  • Stores all the values for a single field together in a single column of data
  • Doc values are enabled by default for all fields types except text.
  • Created at index-time
  • Serialized to disk

While indexing Elasticsearch adds the elements or tokens to the inverted index for search. But it also extracts the terms and adds them to the columnar storage called Doc Values.

Doc      Terms  
-----------------------------------------------------------------
Doc_1 | hello, world, perl  
Doc_2 | hello, world, java  
Doc_3 | We, need, more, golang, tutorials  
-----------------------------------------------------------------

Doc values are used in several Use Cases in Elasticsearch:

  • For Sorting
  • Aggregations on a field
  • Certain filters (for example, geolocation filters)
  • Scripts that refer to fields

When the "working set" is smaller than the available memory on a node, the OS will naturally keep all the doc values hot in memory, leading to very fast access. When the "working set" is much larger than available memory, the OS will naturally start to page doc-values on/off disk.

Fielddata

Key Characteristics of Fielddata

  • Good for operations like doc values
  • But for text fields only
  • Created at query-time
  • in-memory data structure
  • Is not serialized to disk
  • Is disabled by default (expensive to build them, and preseve in heap)

Most fields can use index-time, on-disk docvalues for this data access pattern, but text fields do not support docvalues.

Instead, text fields use a query-time in-memory data structure called fielddata. This data structure is built on demand the first time that a field is used for aggregations, sorting, or in a script. It is built by reading the entire inverted index for each segment from disk, inverting the term ↔︎ document relationship, and storing the result in memory, in the JVM heap.

Before you enable fielddata, consider why you are using a text field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so, since they are quite memory and computation expensive.

P.S. Did i forgot something? Your comment is welcome!