搜索引擎--范例：中英文混杂分词算法的实现--正向最大匹配算法的原理和实现

首页 > 代码库 > 搜索引擎--范例：中英文混杂分词算法的实现--正向最大匹配算法的原理和实现

搜索引擎--范例：中英文混杂分词算法的实现--正向最大匹配算法的原理和实现

2024-07-10 17:35:45 221人阅读

纯中文和中英文混杂的唯一区别是，分词的时候你如何辨别一个字符是英文字符还是孩子字符，

人眼很容易区分，但是对于计算机来说就没那么容易了，只要能辨别出中文字符和英文的字符，分词本身就不是一个难题

1：文本的编码问题：

　　utf8：windows下，以utf8格式保存的文本是一个3个字节（以16进制）的BOM的，并且你不知道一个汉字是否是用3位表示，但是英文适合ascii编码一样的

ascii：英文一位，中文两位，并且中文的第一个字节的值是大于128和，不会和英文混淆，推荐

　　unicode：中文基本是两个字节，适合网页纯中文分词

2：分析选择文本编码

　　线下的分词我选用的是ascii编码的文本，因为英文和中文的编码容易区分，所以最容易实现

　　线上的分词，由于传入的参数是unicode的，纯中文分词更加简单

3：中文分词原理（词库或者语义分词，后者需要大量的数据），这里采用的是词库分词

　　我用的中文词库：点此下载本来30万的词库，去掉大于5个字词语后，只剩下25万多，基本够用，ps,我是用结巴分词词库自己提取出来的，可以自己去提取30万的词库^_^

　　中文停用词库：点此下载

　　分词素材，点词下载10000条新浪微博的数据

　　如果点开下载不了，把链接拷贝到迅雷，旋风等下载软件上面下载即可

4：正向最大匹配算法分词的原理

定义一个匹配的最大长度max_length

从左往右，依次遍历文档，

　　如果是汉字字符的话，ord>128

　　　　如果长度不足max_length，继续，

　　　　如果长度==max_length，

　　　　　　依次匹配max_length到1长度的单词

　　　　　　　　若匹配到

判断是否为停用词，

　　　　　　　　　　　　　若果不是

记录

　　如果是停用词

　　　　　　　　　　　　　　　　重新切词

　　若为非汉字字符，对当前遍历得到的中文词组进行处理

　　　　如果为空，继续

　　　　如果不为空，进行分词处理

5：python代码实现正向最大匹配中英文分词如下：weibo是分词的对象，result是分词后结果，可以print出来看看对不对

# -*- coding: cp936 -*-import stringdist = {}df = file("bdist.txt","r")while True:    line = df.readline()    if len(line)==0:        break    term = line.strip()    dist[term]=1stop = {}sf = file("stopword.txt","r")while True:    line = sf.readline()    if len(line)==0:        break    stopping = line.strip()    stop[stopping]=1re = {}def record(t,i,w_id):    #print(t)    if re.get(t)==None:        re[t]=str(w_id)+‘[‘+str(i)+‘.‘    else:        re[t]=re[t]+str(i)+‘.‘wf = file("weibo.txt","r")while True:    re = {}    line = wf.readline()    if len(line) ==0:        break    b = 0;    #print(line[len(line)-2:len(line)-1])    if line[len(line)-2:len(line)-1]!=‘1‘:        continue    w_id_end = line.find(r‘,‘,0)    w_id = line[0:w_id_end]    if not w_id.isdigit():        continue    w_id = string.atoi(line[0:w_id_end])    #print(w_id)    w_userid_end = line.find(r‘,‘,w_id_end+1)    w_userid = line[w_id_end+1:w_userid_end]    #print(w_userid)    w_username_end = line.find(r‘,‘,w_userid_end+1)    w_username = line[w_userid_end+1:w_username_end]    #print(w_username)    w_content_end = line.find(r‘,‘,w_username_end+1)    w_content = line[w_username_end+1:w_content_end]    #print(w_content)    w_pt_end = line.find(r‘,‘,w_content_end+1)    w_pt = line[w_content_end+1:w_pt_end]    #print(line[w_pt_end+1:])    w_count = string.atoi(line[w_pt_end+1:] )    #print(w_count)        weibo = w_content    #s= type(weibo)    #print(s)    #print(weibo)    #begin particle    max_length = 10    weibo_length = len(weibo)    #print(weibo_length)    t = 2    index = 0    temp = ‘‘    result = ‘‘    #print(weibo_length)    while index<weibo_length:        #print(index)        #print(temp)        s=weibo[index:index+1]        if ord(s[0])>128:            #print("ord")            s = weibo[index:index+2]            temp = temp+s            index = index+2            if len(temp)<max_length and index+1<weibo_length:                #print("@")                continue            else:                t =temp                while True:                    #print(temp)                    if temp==‘‘:                        break                    if dist.get(temp)==1:                        result = result+temp+‘/‘                        if stop.get(temp)==None:                            record(temp,index-2,w_id)                        temp = t[len(temp):len(t)]                        #print(temp)                        if temp!=‘‘ and index+1>weibo_length:                            t =temp                            while True:                                #print(temp)                                if temp==‘‘:                                    break                                if dist.get(temp)==1:                                    result = result+temp+‘/‘                                    if stop.get(temp)==None:                                        record(temp,index-2,w_id)                                    #print(result)                                    temp = t[len(temp):len(t)]                                    t = temp                                    #print(temp)                                else:                                    if len(temp)>0:                                        temp=temp[0:len(temp)-2]                        else:                            break                    else:                        if len(temp)>0:                            temp=temp[0:len(temp)-2]        else:            #print("ooo")            index=index+1            if temp==‘‘:                #print("$")                result =result+s                continue            else:                #print("&")                t =temp                while True:                    #print(temp)                    if temp==‘‘:                        break                    if dist.get(temp)==1:                        result = result+temp+‘/‘                        if stop.get(temp)==None:                            record(temp,index-2,w_id)                        #print(result)                        temp = t[len(temp):len(t)]                        t = temp                        #print(temp)                    else:                        if len(temp)>0:                            temp=temp[0:len(temp)-2]                result =result+s    print(result)sf.close()df.close()wf.close()

6：分词完成后，可以处理得到倒排索引，可以算TF-IDF，可以用向量模型等一系列的知识来做搜索引擎了

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 搜索引擎--范例：中英文混杂分词算法的实现--正向最大匹配算法的原理和实现

搜索引擎--范例：中英文混杂分词算法的实现--正向最大匹配算法的原理和实现

看完仍有疑问？有类似问题直接问程序猿