analyzer
由三部分构成:
Character Filters、Tokenizers、Token filters
Character Filters 负责字符过滤 官方的解释是:字符过滤器用来把阿拉伯数字(??????????)转成成Arabic-Latin的等价物(0123456789)或用于去掉html内容,如:<b>。
Tokenizers 负责分词,常用的分词器有:whitespace、standard
Token filters
length用于去掉过长或者过短的单词;
min 定义最短长度
max 定义最长长度
用法如下:
GET _analyze { "tokenizer" : "standard", "filter": [{"type": "length", "min":1, "max":3 }], "text" : "this is a test" }
结果:
"tokens": [ { "token": "is", "start_offset": 5, "end_offset": 7, "type": "<ALPHANUM>", "position": 1 }, { "token": "a", "start_offset": 8, "end_offset": 9, "type": "<ALPHANUM>", "position": 2 } ]
{ "tokenizer" : "standard", "filter": [{"type": "stop", "stopwords": ["this", "a"]}], "text" : ["this is a test"] }
输出:
# stopwords中拦截词this, a 被过滤掉; "tokens": [ { "token": "is", "start_offset": 5, "end_offset": 7, "type": "<ALPHANUM>", "position": 1 }, { "token": "test", "start_offset": 10, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 } ]
PUT /my_index { "settings": { "analysis" : { "analyzer" : { "my_analyzer" : { "tokenizer" : "standard", "filter" : ["standard", "lowercase", "my_stemmer"] } }, "filter" : { "my_stemmer" : { "type" : "stemmer", "name" : "light_german" } } } } }
调用:
GET _analyze { "tokenizer": "standard", "filter": ["reverse"], "text": ["hello world"] }
结果:
"tokens": [ { "token": "olleh", "start_offset": 0, "end_offset": 5, "type": "<ALPHANUM>", "position": 0 }, { "token": "dlrow", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 } ]
GET _analyze { "tokenizer": "standard", "filter": ["unique"], "text": ["this is a test test test"] }
后面的多个test,最终生成的时候,只有一个。
输出:
"tokens": [ { "token": "this", "start_offset": 0, "end_offset": 4, "type": "<ALPHANUM>", "position": 0 }, { "token": "is", "start_offset": 5, "end_offset": 7, "type": "<ALPHANUM>", "position": 1 }, { "token": "a", "start_offset": 8, "end_offset": 9, "type": "<ALPHANUM>", "position": 2 }, { "token": "test", "start_offset": 10, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 } ]
Elasticsearch学习笔记之—分词器 analyzer
原文:https://www.cnblogs.com/wjx-blog/p/12068487.html