How to reduce Index size on disk?减少ES索引大小的一些小手段

时间：2017-02-15 15:12:58 阅读：423 评论：0 收藏：0 [点我收藏+]

Strange that I haven‘t receive any suggestion on my query anyways following are some steps which I performed to reduce index size .Hope it will help someone .Please feel free to add more in case I miss something .

1) Delete unnecessary fields (or do not index unwanted fields, I am handling it at the LS level)
2) Delete @message field (if Message field is not in use you can delete this)
3) Disable _all field ( Be careful with this setting )
It is a special catch-all field which concatenates the values of all of the other fields into one big string, using space as a delimiter. It requires extra CPU cycles and uses more disk space. If not needed, it can be completely disabled.
Benefits of having _All field enabled :- Allows you to search for values in documents without knowing which field contains the value, but CPU will be compromised .
Downside of Disabling this field :- Kibana Search bar will not act as full text search bar , so user have to fire query like name : “vikas” or name:vika* (provided name is an analyzed field ) . Also the _all field loses the distinction between field types like (string integer, or IP ) because it stores all the values as string.
4) Analyzed and Not Analyzed fields :- Be very careful while making a field Analyzed and Not analyzed because to perform partial search(name :vik*) we need analyzed field but it will consume more disk space . Recommended option is to make all the string fields to not analyzed in the first go and then make any filed as analyzed field if needed .
5) Doc_Value :-Doc values are the on-disk data structure, built at document index time, which makes this data access pattern possible. So, doc values offload this heap burden by writing the fielddata to disk at index time, thereby allowing Elasticsearch to load the values outside of your Java heap as they are needed. In the latest version of ES this feature has already been enabled .In our case we are on ES 1.7.1 version an we have to enable it explicitly which will consume extra Disk space but this does not degrade performance at all. The overall benefits of doc values significantly outweigh the cost.

Thanks
VG

摘自：https://discuss.elastic.co/t/how-to-reduce-index-size-on-disk/49415

关于doc values的一些补充：

大家知道，搜索引擎的基本数据结构是反向索引，也就是为每个关键词建立了到文档的映射，然后所有的关键词是一个有序列表。搜索的时候，只要先从有序列表中匹配到关键词，就能搜索到包含该关键词的所有文档，反向索引的数据结构对于关键词搜索的场景是非常高效的。但聚合分析和搜索有很大的不同。典型的场景，比如计算某个文档中每个关键词的出现次数，反向索引就无能为力了，需要先扫描整个关键词映射表，才能找到该文档包含的所有关键词，然后再进行聚合统计（这个例子其实不太准确，因为Lucene在反向索引中冗余了词频的信息，用于计算搜索相关度），也就是要对整个反向索引做全扫描，在数据量大的时候，性能当然好不到哪里去。

所以，Elasticsearch为聚合计算引入了名为fielddata的数据结构，其实就是根据反向索引再次反向出来的一个正向索引，也就是文档到关键词的映射。因为聚合计算也好，排序也好，通常是针对某些列的，实际上生成的是文档到field的多个列式索引，所以叫做fielddata。这样对文档内的关键词做聚合计算的时候，就只要从fielddata中根据文档ID查找就好。而且，fielddata是保存在内存中的，好处是不占用存储，坏处么，当然上内存不够用啦。而且这个内存是从JVM的Heap上分配的，因为JVM对于大内存的垃圾收集的影响，不能不说对稳定性有很大的挑战，数据量大的时候，时不时的OutOfMemory也不是闹着玩的。因为内存是有限的，所以不可能预先为所有的字段都建立fielddata，只能是由具体的搜索需求来触发。如果是未命中的搜索，还需要先在内存中建立fielddata，这会影响到响应时间。

fielddata的问题在于内存的有限性和JVM对于大内存的垃圾收集对系统带来的稳定性挑战。所以后来又引入了一个新的机制，就是DocValues，从数据结构上来说，它和fielddata是一样的按列的正向索引，但是实现方式不同，DocValues是持久化存储在文件中，并且是预先构建的,也就是数据进入到Elasticsearch时，就会同时生成反向索引和DocValues，这会消耗额外的存储空间，但对于JVM的内存需求会大幅度减少，剩余的内存可以留给操作系统的文件缓存使用。加上DocValues是预先构建的，查询时也免去了不命中时构建fielddata的时间，所以总体来看，DocValues只比内存fielddata慢大概10~25%，稳定性则有了大幅度提升。从Elasticsearch2.0开始，除了分词过的字符串字段，其他字段已经默认生成DocValues了（可以在索引的Mapping中通过doc_values布尔值来设置）。

简单的说，Elasticsearch通过反向索引做搜索，通过DocValues列式存储做分析，将搜索和分析的场景统一到了通一个分布式系统中，还是很有搞头的。不过分析不仅仅是聚合，这也是Elasticsearch还需要继续努力的方向，目前通过Elasticsearch-Hadoop项目，可以将Elasticsearch的搜索结果做为Spark的RDD，利用Spark做更深度的分析。未来如果分布式计算这一层能够和Spark这样的计算框架再进一步做深度的融合，恐怕有可能成为大数据领域内的另外一个大杀器。

摘自：https://yq.aliyun.com/articles/6902

How to reduce Index size on disk?减少ES索引大小的一些小手段

原文：http://www.cnblogs.com/bonelee/p/6401317.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)