首页 > 代码库 > Elasticsearch 之 Hello World (二)

Elasticsearch 之 Hello World (二)

    首先测试下分词尤其是中文分词功能,这个可是传统数据库如mysql,sqlserver的痛啊。

    打开浏览器,并登录到http://localhost:5601,点击Dev Tools项,在Console栏输入

POST _analyze
{
  "analyzer": "standard",
  "text":"Hello World ElasticSearch"
}

    会在右面显示返回的结果

{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "elasticsearch",
      "start_offset": 12,
      "end_offset": 25,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

    一切看上去都挺美好,等加入中文看看。

POST _analyze
{
  "analyzer": "standard",
  "text":"ElasticSearch是一个很不错的全文检索软件。"
}

    结果是

{
  "tokens": [
    {
      "token": "elasticsearch",
      "start_offset": 0,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "",
      "start_offset": 13,
      "end_offset": 14,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "",
      "start_offset": 14,
      "end_offset": 15,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "",
      "start_offset": 15,
      "end_offset": 16,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "",
      "start_offset": 16,
      "end_offset": 17,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "",
      "start_offset": 17,
      "end_offset": 18,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    },
    {
      "token": "",
      "start_offset": 18,
      "end_offset": 19,
      "type": "<IDEOGRAPHIC>",
      "position": 6
    },
    {
      "token": "",
      "start_offset": 19,
      "end_offset": 20,
      "type": "<IDEOGRAPHIC>",
      "position": 7
    },
    {
      "token": "",
      "start_offset": 20,
      "end_offset": 21,
      "type": "<IDEOGRAPHIC>",
      "position": 8
    },
    {
      "token": "",
      "start_offset": 21,
      "end_offset": 22,
      "type": "<IDEOGRAPHIC>",
      "position": 9
    },
    {
      "token": "",
      "start_offset": 22,
      "end_offset": 23,
      "type": "<IDEOGRAPHIC>",
      "position": 10
    },
    {
      "token": "",
      "start_offset": 23,
      "end_offset": 24,
      "type": "<IDEOGRAPHIC>",
      "position": 11
    },
    {
      "token": "",
      "start_offset": 24,
      "end_offset": 25,
      "type": "<IDEOGRAPHIC>",
      "position": 12
    },
    {
      "token": "",
      "start_offset": 25,
      "end_offset": 26,
      "type": "<IDEOGRAPHIC>",
      "position": 13
    }
  ]
}

    这显然不能忍啊,每个中文字都拆,基本就是不能用的节奏。google下,貌似其还有analyzer为chinese选项,测试发现结果一样。网上搜索发现这里一般用的是smartcn或是IKAnanlyzer插件,有的资料和书就推荐IKAnanlyzer,但这些资料都是基于老版本的es,我去IKAnanlyzer的github上去看了下,发现貌似太监了,所以还是用官方推荐的smartcn吧,下载安装的过程和安装其他插件一致,这里还是推荐离线包安装。安装完,应该要重启es服务才能生效。现在再试试

POST _analyze
{
  "analyzer": "smartcn",
  "text":"ElasticSearch是一个很不错的全文检索软件。"
}
{
  "tokens": [
    {
      "token": "elasticsearch",
      "start_offset": 0,
      "end_offset": 13,
      "type": "word",
      "position": 0
    },
    {
      "token": "",
      "start_offset": 13,
      "end_offset": 14,
      "type": "word",
      "position": 1
    },
    {
      "token": "一个",
      "start_offset": 14,
      "end_offset": 16,
      "type": "word",
      "position": 2
    },
    {
      "token": "",
      "start_offset": 16,
      "end_offset": 17,
      "type": "word",
      "position": 3
    },
    {
      "token": "不错",
      "start_offset": 17,
      "end_offset": 19,
      "type": "word",
      "position": 4
    },
    {
      "token": "",
      "start_offset": 19,
      "end_offset": 20,
      "type": "word",
      "position": 5
    },
    {
      "token": "全文",
      "start_offset": 20,
      "end_offset": 22,
      "type": "word",
      "position": 6
    },
    {
      "token": "检索",
      "start_offset": 22,
      "end_offset": 24,
      "type": "word",
      "position": 7
    },
    {
      "token": "软件",
      "start_offset": 24,
      "end_offset": 26,
      "type": "word",
      "position": 8
    }
  ]
}

这下看上去河蟹多了。:)

Elasticsearch 之 Hello World (二)