Internal data structures of Elasticsearch

If you start working intensively with Elasticsearch you cannot get around the understanding of internal data structures of it. Here i’ll try to make this very simple for you.

Inverted Index

Key Characteristics of Inverted Index

Allow very fast full-text searches
Not good structure for sorting
Created at index-time
Serialized to disk

An inverted index is a basic memory structure. It consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears. Consider the following structure.

Term      Doc_1  Doc_2  ...| Doc_X
----------------------------------
hello   |   X   |  X
world   |   X   |  X
java    |       |  X
perl    |   X   |       
golang  |       |       ... |  X
...
----------------------------------

Here for every term a list of documents containing that term. Now, if we want to search for “world perl”, we just need to find the documents in which each term appears:

Term      Doc_1  Doc_2
-------------------------
world   |   X   |  X
perl    |   X   |
------------------------
Total   |   2   |  1

Both documents match, but the first document has more matches than the second. Keep in mind on indexing the values are subject to tokenization and normalization - a process called analysis.

Doc Values

Key Characteristics of Doc Values

Good for sorting operations
Stores all the values for a single field together in a single column of data
Doc values are enabled by default for all fields types except text.
Created at index-time
Serialized to disk

While indexing Elasticsearch adds the elements or tokens to the inverted index for search. But it also extracts the terms and adds them to the columnar storage called Doc Values.

Doc      Terms
-----------------------------------------------------------------
Doc_1 | hello, world, perl 
Doc_2 | hello, world, java
Doc_3 | We, need, more, golang, tutorials
-----------------------------------------------------------------

Doc values are used in several Use Cases in Elasticsearch:

For Sorting
Aggregations on a field
Certain filters (for example, geolocation filters)
Scripts that refer to fields

When the “working set” is smaller than the available memory on a node, the OS will naturally keep all the doc values hot in memory, leading to very fast access. When the “working set” is much larger than available memory, the OS will naturally start to page doc-values on/off the disk.

Fielddata

Key Characteristics of Fielddata

Good for operations like doc values
But for text fields only
Created at query-time
in-memory data structure
Is not serialized to disk
Is disabled by default (expensive to build them, and preserve in heap)

Most fields can use index-time, on-disk doc_values for this data access pattern, but text fields do not support doc_values.

Instead, text fields use a query-time in-memory data structure called field data. This data structure is built on demand the first time that a field is used for aggregations, sorting, or in a script. It is built by reading the entire inverted index for each segment from the disk, inverting the term ↔︎ document relationship, and storing the result in memory, in the JVM heap.

Warning: Before you enable fielddata, consider why you are using a text field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so, since they are quite a memory and computation expensive.

P.S. Did I forgot something? Your comment is welcome!