一、分析器 analyzer
包括
1.字符过滤器 character filter
比如去除HTML标记,或者转化为“&”为“and”
2.分词器 tokenizer
比如按空格分词
3.词单元过滤器 token filter
如大小写转换,去掉停用词,增加同义词
二、内置分析器
标准分析器
根据单词边界分词,去标点符号,转小写
GET _analyze
{
"analyzer": "standard",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
简单分析器
根据非字母切分,非字母去除,转小写
GET _analyze
{
"analyzer": "simple",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
stop 分析器
根据非字母切分,非字母去除,转小写,停用词
GET _analyze
{
"analyzer": "stop",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
空格分析器
依据空格切分,不转换小写
GET _analyze
{
"analyzer": "whitespace",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
正则分析器
默认为非字符符号(\w+)分隔,转小写
GET _analyze
{
"analyzer": "pattern",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
keyword 分析器
不分词
GET _analyze
{
"analyzer": "keyword",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
三、自定义 analyzer
字符过滤器
# html_strip
POST _analyze
{
"tokenizer": "keyword",
"char_filter": ["html_strip"],
"text": "<b>hello world</b>"
}
# 映射替换
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type": "mapping",
"mappings": ["-=>_"]
}
],
"text": "123-456, i-test"
}
# 正则替换
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type": "pattern_replace",
"pattern": "http://(.*)",
"replacement": "$1"
}
],
"text": "http://www.elastic.co"
}
分词器
# 路径分词
POST _analyze
{
"tokenizer": "path_hierarchy",
"text": "/usr/abc/efg"
}
# 空格分词
POST _analyze
{
"tokenizer": "whitespace",
"filter": ["stop"],
"text": "The rain in Spain falls mainly on the plain."
}
词单元过滤器
# 转小写、停用词去除
POST _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase", "stop"],
"text": "The rain in Spain falls mainly on the plain."
}
233
原文:https://www.cnblogs.com/lemos/p/12513356.html