首页 > 代码库 > java文本相似度计算(Levenshtein Distance算法(中文翻译:编辑距离算法))----代码和详解
java文本相似度计算(Levenshtein Distance算法(中文翻译:编辑距离算法))----代码和详解
算法代码实现:
package com.util; public class SimFeatureUtil { private static int min(int one, int two, int three) { int min = one; if (two < min) { min = two; } if (three < min) { min = three; } return min; } public static int ld(String str1, String str2) { int d[][]; // 矩阵 int n = str1.length(); int m = str2.length(); int i; // 遍历str1的 int j; // 遍历str2的 char ch1; // str1的 char ch2; // str2的 int temp; // 记录相同字符,在某个矩阵位置值的增量,不是0就是1 if (n == 0) { return m; } if (m == 0) { return n; } d = new int[n + 1][m + 1]; for (i = 0; i <= n; i++) { // 初始化第一列 d[i][0] = i; } for (j = 0; j <= m; j++) { // 初始化第一行 d[0][j] = j; } for (i = 1; i <= n; i++) { // 遍历str1 ch1 = str1.charAt(i - 1); // 去匹配str2 for (j = 1; j <= m; j++) { ch2 = str2.charAt(j - 1); if (ch1 == ch2) { temp = 0; } else { temp = 1; } // 左边+1,上边+1, 左上角+temp取最小 d[i][j] = min(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1]+ temp); } } return d[n][m]; } public static double sim(String str1, String str2) { try { double ld = (double)ld(str1, str2); return (1-ld/(double)Math.max(str1.length(), str2.length())); } catch (Exception e) { return 0.1; } } public static void main(String[] args) { String str1 = "测试12"; String str2 = "测试123"; System.out.println("ld=" + ld(str1, str2)); System.out.println("sim=" + sim(str1, str2)); } }
算法介绍:
算法原理:
设我们可以使用d[ i , j ]个步骤(可以使用一个二维数组保存这个值),表示将串s[ 1…i ] 转换为 串t [ 1…j ]所需要的最少步骤个数,那么,在最基本的情况下,即在i等于0时,也就是说串s为空,那么对应的d[0,j] 就是 增加j个字符,使得s转化为t,在j等于0时,也就是说串t为空,那么对应的d[i,0] 就是 减少 i个字符,使得s转化为t。
然后我们考虑一般情况,加一点动态规划的想法,我们要想得到将s[1..i]经过最少次数的增加,删除,或者替换操作就转变为t[1..j],那么我们就必须在之前可以以最少次数的增加,删除,或者替换操作,使得现在串s和串t只需要再做一次操作或者不做就可以完成s[1..i]到t[1..j]的转换。所谓的“之前”分为下面三种情况:
1)我们可以在k个操作内将 s[1…i] 转换为 t[1…j-1]
2)我们可以在k个操作里面将s[1..i-1]转换为t[1..j]
3)我们可以在k个步骤里面将 s[1…i-1] 转换为 t [1…j-1]
针对第1种情况,我们只需要在最后将 t[j] 加上s[1..i]就完成了匹配,这样总共就需要k+1个操作。
针对第2种情况,我们只需要在最后将s[i]移除,然后再做这k个操作,所以总共需要k+1个操作。
针对第3种情况,我们只需要在最后将s[i]替换为 t[j],使得满足s[1..i] == t[1..j],这样总共也需要k+1个操作。而如果在第3种情况下,s[i]刚好等于t[j],那我们就可以仅仅使用k个操作就完成这个过程。
最后,为了保证得到的操作次数总是最少的,我们可以从上面三种情况中选择消耗最少的一种最为将s[1..i]转换为t[1..j]所需要的最小操作次数。
算法实现步骤:
步骤 | 说明 |
---|---|
1 | 设置n为字符串s的长度。("GUMBO") 设置m为字符串t的长度。("GAMBOL") 如果n等于0,返回m并退出。 如果m等于0,返回n并退出。 构造两个向量v0[m+1] 和v1[m+1],串联0..m之间所有的元素。 |
2 | 初始化 v0 to 0..m。 |
3 | 检查 s (i from 1 to n) 中的每个字符。 |
4 | 检查 t (j from 1 to m) 中的每个字符 |
5 | 如果 s[i] 等于 t[j],则编辑代价cost为 0; 如果 s[i] 不等于 t[j],则编辑代价cost为1。 |
6 | 设置单元v1[j]为下面的最小值之一: a、紧邻该单元上方+1:v1[j-1] + 1 b、紧邻该单元左侧+1:v0[j] + 1 c、该单元对角线上方和左侧+cost:v0[j-1] + cost |
7 | 在完成迭代 (3, 4, 5, 6) 之后,v1[m]便是编辑距离的值。 |
算法步骤详解:
本小节将演示如何计算"GUMBO"和"GAMBOL"两个字符串的Levenshtein距离。
步骤1、2
v0 | v1 | |||||
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | |||||
A | 2 | |||||
M | 3 | |||||
B | 4 | |||||
O | 5 | |||||
L | 6 |
初始化完了之后重点是理解步骤6.
步骤3-6,当 i = 1
v0 | v1 | |||||
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | 0 | ||||
A | 2 | 1 | ||||
M | 3 | 2 | ||||
B | 4 | 3 | ||||
O | 5 | 4 | ||||
L | 6 | 5 |
我们算V1中的值:以红色的0所在的格子为例
如果 s[i] 不等于 t[j],则编辑代价cost为1。
a、紧邻该单元上方+1:v1[j-1] + 1
b、紧邻该单元左侧+1:v0[j] + 1
c、该单元对角线上方和左侧+cost:v0[j-1] + cost
步骤3-6,当 i = 2
v0 | v1 | |||||
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | 0 | 1 | |||
A | 2 | 1 | 1 | |||
M | 3 | 2 | 2 | |||
B | 4 | 3 | 3 | |||
O | 5 | 4 | 4 | |||
L | 6 | 5 | 5 |
步骤3-6,当 i = 3
v0 | v1 | |||||
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | 0 | 1 | 2 | ||
A | 2 | 1 | 1 | 2 | ||
M | 3 | 2 | 2 | 1 | ||
B | 4 | 3 | 3 | 2 | ||
O | 5 | 4 | 4 | 3 | ||
L | 6 | 5 | 5 | 4 |
步骤3-6,当 i = 4
v0 | v1 | |||||
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | 0 | 1 | 2 | 3 | |
A | 2 | 1 | 1 | 2 | 3 | |
M | 3 | 2 | 2 | 1 | 2 | |
B | 4 | 3 | 3 | 2 | 1 | |
O | 5 | 4 | 4 | 3 | 2 | |
L | 6 | 5 | 5 | 4 | 3 |
步骤3-6,当 i = 5
v0 | v1 | |||||
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | 0 | 1 | 2 | 3 | 4 |
A | 2 | 1 | 1 | 2 | 3 | 4 |
M | 3 | 2 | 2 | 1 | 2 | 3 |
B | 4 | 3 | 3 | 2 | 1 | 2 |
O | 5 | 4 | 4 | 3 | 2 | 1 |
L | 6 | 5 | 5 | 4 | 3 | 2 |
步骤7
编辑距离就是矩阵右下角的值,v1[m] == 2。由"GUMBO"变换为"GAMBOL"的过程对于我来说是很只管的,即通过将"A"替换为"U",并在末尾追加"L"这样子(实际上替换的过程是由移除和插入两个操作组合而成的)。
我们得到最小编辑距离为2
那么它们的相似度为 (1-ld/(double)Math.max(str1.length(), str2.length()));
1 - 2/6=0.6666666666666667
参考链接:
http://www.cnblogs.com/ymind/archive/2012/03/27/fast-memory-efficient-Levenshtein-algorithm.html
http://teiraisan.blog.163.com/blog/static/12278141420098685835372/
http://teiraisan.blog.163.com/blog/static/12278141420098685835372/
其他语言的代码实现:
c++
In C++, the size of an array must be a constant, and this code fragment causes an error at compile time: int sz = 5; int arr[sz]; This limitation makes the following C++ code slightly more complicated than it would be if the matrix could simply be declared as a two-dimensional array, with a size determined at run-time. In C++ it's more idiomatic to use the System Template Library's vector class, as Anders Sewerin Johansen has done in an alternative C++ implementation. Here is the definition of the class (distance.h): class Distance { public: int LD (char const *s, char const *t); private: int Minimum (int a, int b, int c); int *GetCellPointer (int *pOrigin, int col, int row, int nCols); int GetAt (int *pOrigin, int col, int row, int nCols); void PutAt (int *pOrigin, int col, int row, int nCols, int x); }; Here is the implementation of the class (distance.cpp): #include "distance.h" #include <string.h> #include <malloc.h> //**************************** // Get minimum of three values //**************************** int Distance::Minimum (int a, int b, int c) { int mi; mi = a; if (b < mi) { mi = b; } if (c < mi) { mi = c; } return mi; } //************************************************** // Get a pointer to the specified cell of the matrix //************************************************** int *Distance::GetCellPointer (int *pOrigin, int col, int row, int nCols) { return pOrigin + col + (row * (nCols + 1)); } //***************************************************** // Get the contents of the specified cell in the matrix //***************************************************** int Distance::GetAt (int *pOrigin, int col, int row, int nCols) { int *pCell; pCell = GetCellPointer (pOrigin, col, row, nCols); return *pCell; } //******************************************************* // Fill the specified cell in the matrix with the value x //******************************************************* void Distance::PutAt (int *pOrigin, int col, int row, int nCols, int x) { int *pCell; pCell = GetCellPointer (pOrigin, col, row, nCols); *pCell = x; } //***************************** // Compute Levenshtein distance //***************************** int Distance::LD (char const *s, char const *t) { int *d; // pointer to matrix int n; // length of s int m; // length of t int i; // iterates through s int j; // iterates through t char s_i; // ith character of s char t_j; // jth character of t int cost; // cost int result; // result int cell; // contents of target cell int above; // contents of cell immediately above int left; // contents of cell immediately to left int diag; // contents of cell immediately above and to left int sz; // number of cells in matrix // Step 1 n = strlen (s); m = strlen (t); if (n == 0) { return m; } if (m == 0) { return n; } sz = (n+1) * (m+1) * sizeof (int); d = (int *) malloc (sz); // Step 2 for (i = 0; i <= n; i++) { PutAt (d, i, 0, n, i); } for (j = 0; j <= m; j++) { PutAt (d, 0, j, n, j); } // Step 3 for (i = 1; i <= n; i++) { s_i = s[i-1]; // Step 4 for (j = 1; j <= m; j++) { t_j = t[j-1]; // Step 5 if (s_i == t_j) { cost = 0; } else { cost = 1; } // Step 6 above = GetAt (d,i-1,j, n); left = GetAt (d,i, j-1, n); diag = GetAt (d, i-1,j-1, n); cell = Minimum (above + 1, left + 1, diag + cost); PutAt (d, i, j, n, cell); } } // Step 7 result = GetAt (d, n, m, n); free (d); return result; }
Visual Basic
'******************************* '*** Get minimum of three values '******************************* Private Function Minimum(ByVal a As Integer, _ ByVal b As Integer, _ ByVal c As Integer) As Integer Dim mi As Integer mi = a If b < mi Then mi = b End If If c < mi Then mi = c End If Minimum = mi End Function '******************************** '*** Compute Levenshtein Distance '******************************** Public Function LD(ByVal s As String, ByVal t As String) As Integer Dim d() As Integer ' matrix Dim m As Integer ' length of t Dim n As Integer ' length of s Dim i As Integer ' iterates through s Dim j As Integer ' iterates through t Dim s_i As String ' ith character of s Dim t_j As String ' jth character of t Dim cost As Integer ' cost ' Step 1 n = Len(s) m = Len(t) If n = 0 Then LD = m Exit Function End If If m = 0 Then LD = n Exit Function End If ReDim d(0 To n, 0 To m) As Integer ' Step 2 For i = 0 To n d(i, 0) = i Next i For j = 0 To m d(0, j) = j Next j ' Step 3 For i = 1 To n s_i = Mid$(s, i, 1) ' Step 4 For j = 1 To m t_j = Mid$(t, j, 1) ' Step 5 If s_i = t_j Then cost = 0 Else cost = 1 End If ' Step 6 d(i, j) = Minimum(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost) Next j Next i ' Step 7 LD = d(n, m) Erase d End Function
Python代码
#!/user/bin/env python # -*- coding: utf-8 -*- class arithmetic(): def __init__(self): pass ''''' 【编辑距离算法】 【levenshtein distance】 【字符串相似度算法】 ''' def levenshtein(self,first,second): if len(first) > len(second): first,second = second,first if len(first) == 0: return len(second) if len(second) == 0: return len(first) first_length = len(first) + 1 second_length = len(second) + 1 distance_matrix = [range(second_length) for x in range(first_length)] #print distance_matrix for i in range(1,first_length): for j in range(1,second_length): deletion = distance_matrix[i-1][j] + 1 insertion = distance_matrix[i][j-1] + 1 substitution = distance_matrix[i-1][j-1] if first[i-1] != second[j-1]: substitution += 1 distance_matrix[i][j] = min(insertion,deletion,substitution) print distance_matrix return distance_matrix[first_length-1][second_length-1] if __name__ == "__main__": arith = arithmetic() print arith.levenshtein('GUMBOsdafsadfdsafsafsadfasfadsfasdfasdfs','GAMBOL00000000000dfasfasfdafsafasfasdfdsa'