Elasticsearch 之 Hello World (二)

首页 > 代码库 > Elasticsearch 之 Hello World (二)

Elasticsearch 之 Hello World (二)

2024-09-06 08:34:03 221人阅读

首先测试下分词尤其是中文分词功能，这个可是传统数据库如mysql，sqlserver的痛啊。

打开浏览器，并登录到http://localhost:5601，点击Dev Tools项，在Console栏输入

POST _analyze
{
  "analyzer": "standard",
  "text":"Hello World ElasticSearch"
}

会在右面显示返回的结果

{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "elasticsearch",
      "start_offset": 12,
      "end_offset": 25,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

一切看上去都挺美好，等加入中文看看。

POST _analyze
{
  "analyzer": "standard",
  "text":"ElasticSearch是一个很不错的全文检索软件。"
}

结果是

{
  "tokens": [
    {
      "token": "elasticsearch",
      "start_offset": 0,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "是",
      "start_offset": 13,
      "end_offset": 14,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "一",
      "start_offset": 14,
      "end_offset": 15,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "个",
      "start_offset": 15,
      "end_offset": 16,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "很",
      "start_offset": 16,
      "end_offset": 17,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "不",
      "start_offset": 17,
      "end_offset": 18,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    },
    {
      "token": "错",
      "start_offset": 18,
      "end_offset": 19,
      "type": "<IDEOGRAPHIC>",
      "position": 6
    },
    {
      "token": "的",
      "start_offset": 19,
      "end_offset": 20,
      "type": "<IDEOGRAPHIC>",
      "position": 7
    },
    {
      "token": "全",
      "start_offset": 20,
      "end_offset": 21,
      "type": "<IDEOGRAPHIC>",
      "position": 8
    },
    {
      "token": "文",
      "start_offset": 21,
      "end_offset": 22,
      "type": "<IDEOGRAPHIC>",
      "position": 9
    },
    {
      "token": "检",
      "start_offset": 22,
      "end_offset": 23,
      "type": "<IDEOGRAPHIC>",
      "position": 10
    },
    {
      "token": "索",
      "start_offset": 23,
      "end_offset": 24,
      "type": "<IDEOGRAPHIC>",
      "position": 11
    },
    {
      "token": "软",
      "start_offset": 24,
      "end_offset": 25,
      "type": "<IDEOGRAPHIC>",
      "position": 12
    },
    {
      "token": "件",
      "start_offset": 25,
      "end_offset": 26,
      "type": "<IDEOGRAPHIC>",
      "position": 13
    }
  ]
}

这显然不能忍啊，每个中文字都拆，基本就是不能用的节奏。google下，貌似其还有analyzer为chinese选项，测试发现结果一样。网上搜索发现这里一般用的是smartcn或是IKAnanlyzer插件，有的资料和书就推荐IKAnanlyzer，但这些资料都是基于老版本的es，我去IKAnanlyzer的github上去看了下，发现貌似太监了，所以还是用官方推荐的smartcn吧，下载安装的过程和安装其他插件一致，这里还是推荐离线包安装。安装完，应该要重启es服务才能生效。现在再试试

POST _analyze
{
  "analyzer": "smartcn",
  "text":"ElasticSearch是一个很不错的全文检索软件。"
}

{
  "tokens": [
    {
      "token": "elasticsearch",
      "start_offset": 0,
      "end_offset": 13,
      "type": "word",
      "position": 0
    },
    {
      "token": "是",
      "start_offset": 13,
      "end_offset": 14,
      "type": "word",
      "position": 1
    },
    {
      "token": "一个",
      "start_offset": 14,
      "end_offset": 16,
      "type": "word",
      "position": 2
    },
    {
      "token": "很",
      "start_offset": 16,
      "end_offset": 17,
      "type": "word",
      "position": 3
    },
    {
      "token": "不错",
      "start_offset": 17,
      "end_offset": 19,
      "type": "word",
      "position": 4
    },
    {
      "token": "的",
      "start_offset": 19,
      "end_offset": 20,
      "type": "word",
      "position": 5
    },
    {
      "token": "全文",
      "start_offset": 20,
      "end_offset": 22,
      "type": "word",
      "position": 6
    },
    {
      "token": "检索",
      "start_offset": 22,
      "end_offset": 24,
      "type": "word",
      "position": 7
    },
    {
      "token": "软件",
      "start_offset": 24,
      "end_offset": 26,
      "type": "word",
      "position": 8
    }
  ]
}

这下看上去河蟹多了。:)

Elasticsearch 之 Hello World (二)

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > Elasticsearch 之 Hello World (二)

Elasticsearch 之 Hello World (二)

看完仍有疑问？有类似问题直接问程序猿