Friday, June 19, 2020

Elasticsearch - how many indices and shards? provides many best practices regarding Elasticsearch configurations. A lot of the decisions around how to best distribute your data across indices and shards will however depend on the use-case specifics, and it can sometimes be hard to determine how to best apply the advice available.

Use multiple indexes.
ES stack usually creates daily indexes by default, which is a good practice. You can then use aliases to limit the scope of searches to specific date ranges, curator to remove old indexes as they age, and modify index settings as your data grows without having to reindex the old data.

Data with a longer retention period, especially if the daily volumes do not warrant the use of daily indices, often use weekly or monthly indices in order to keep the shard size up.

It is now possible to switch to a new index at a specific size, which makes it possible to more easily achieve an even shard size for all indices.

Avoid big index and big shard.
If a shard is larger than 40% of the size of a data node, that shard is probably too big. Shards should be no larger than 50GB. Reindex to an index with more shards.

Avoid too many indexes and shards.
Having a large number of indices and shards in a cluster can therefore result in a large cluster state, especially if mappings are large. This can become slow to update as all updates need to be done through a single thread in order to guarantee consistency before the changes are distributed across the cluster.

In order to reduce the number of indices and avoid large and sprawling mappings, consider storing data with similar structure in the same index.

The more heap space a node has, the more data and shards it can handle. Indices and shards are therefore not free from a cluster perspective, as there is some level of resource overhead for each index and shard.

Small shards result in small segments, which increases overhead. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.

The number of shards you can hold on a node will be proportional to the amount of heap you have available. The number of shards per node per GB heap is no more than 20, so if you have 10GB heap size, then you should not have more than 200 shards on that data node.

Manage the index lifecycle.
  • Use rollover API to avoid having too large or too small shards when volumes are unpredictable. Rolls an alias over to a new index when the existing index meets one of the rollover conditions, like size, age, and document count.
  • Use shrink API to shrink an existing index into a new index with fewer primary shards.
  • Force merge: Reduce the number of index segments and purge deleted documents. Makes the index read-only.
  • Freeze the index to minimize its memory footprint.

No comments:

Post a Comment