首页 > 代码库 > 后缀树

后缀树

A suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

The construction of such a tree for the string S takes time and space linear in the length of S. Once constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provide one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string‘s suffix tree typically requires significantly more space than storing the string itself.

1. 在文本T里查询T是否包含子串P(复杂度同流行的KMP相当)。
2. 文本T里找出最长重复子串。比如abcdabcefda里abc同da都重复出现,而最长重复子串是abc。
3. 找出字符串S1同S2的最长公共子串。注意不是常用作动态规划例子的LCS哈。比如字符串acdfg同akdfc的最长公共子串为df,而他们的LCS是adf。
4. Ziv-Lampel无损压缩算法。
5. 求字符串里的最长回文。

更多的应用:http://en.wikipedia.org/wiki/Suffix_tree#cite_note-FOOTNOTEFarach1997-3

The concept was first introduced by Weiner (1973), which Donald Knuth subsequently characterized as "Algorithm of the Year 1973". The construction was greatly simplified by McCreight (1976) , and also by Ukkonen (1995). Ukkonen provided the first online-construction of suffix trees, now known as Ukkonen‘s algorithm, with running time that matched the then fastest algorithms. These algorithms are all linear-time for a constant-size alphabet, and have worst-case running time of $O(n\log n)$ in general. The naive implementation for generating a suffix tree requires O($n^2$) or even O($n^3$) time, where n is the length of the string. By exploiting a number of algorithmic techniques, Ukkonen reduced this to O(n) (linear) time, for constant-size alphabets, and O(n log n) in general.

Farach (1997) gave the first suffix tree construction algorithm that is optimal for all alphabets. In particular, this is the first linear-time algorithm for strings drawn from an alphabet of integers in a polynomial range. Farach‘s algorithm has become the basis for new algorithms for constructing both suffix trees and suffix arrays, for example, in external memory, compressed, succinct, etc.

The suffix tree for the string S of length n is defined as a tree such that:

  • The tree has exactly n leaves numbered from 1 to n.
  • Except for the root, every internal node has at least two children.
  • Each edge is labeled with a non-empty substring of S.
  • No two edges starting out of a node can have string-labels beginning with the same character.
  • The string obtained by concatenating all the string-labels found on the path from the root to leaf i spells out suffix S[i..n], for i from 1 to n.

Since such a tree does not exist for all strings, S is padded with a terminal symbol not seen in the string (usually denoted $). This ensures that no suffix is a prefix of another, and that there will be n leaf nodes, one for each of the n suffixes of S. Since all internal non-root nodes are branching, there can be at most n − 1 such nodes, and n + (n − 1) + 1 = 2n nodes in total (n leaves, n − 1 internal non-root nodes, 1 root).

In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string $\chi\alpha$, where $\chi$ is a single character and $\alpha$ is a string (possibly empty), it has a suffix link to the internal node representing $\alpha$. See for example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used in some algorithms running on the tree.

A suffix tree for a string S of length n can be built in $\Theta(n)$ time, if the letters come from an alphabet of integers in a polynomial range (in particular, this is true for constant-sized alphabets)

The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in $\Theta(n)$ time.

Applications

Suffix trees can be used to solve a large number of string problems that occur in text-editing, free-text search, computational biology and other application areas. Primary applications include:

  • String search, in O(m) complexity, where m is the length of the sub-string (but with initial O(n) time required to build the suffix tree for the string) 对于data string建后缀树,然后用Trie搜索同样的方式去查找,代价就是O(m)。因为如果query string在data string中,那么它必然是data string某个后缀的前缀。
  • Finding the longest repeated substring。只需要搜索包含最多个叶子结点的串。因为每个出现的串,都要某一个后缀的前缀。以它为前缀的后缀个数,就等于它出现的次数。相当于要遍历一次树,树最多有2n个点,所以整个复杂度是O(n);
  • Finding the longest common substring 以两条字符串(分别以\$和#结尾)构建后缀树。然后查找最长的从root到非叶子结点的路径,并且该非叶子结点的孩子,一个包含\$,一个包含#。同样是搜索整棵树的代价。
  • Finding the longest palindrome in a string。O(n)逆转串,O(2n)建树,因为是要求对称中心,所以是求LCA(从叶子结点往root求),根据O(n)的预处理之后,求LCA的代价为O(1),每个位置都要求一遍,所以是O(n),所有总的代价就是(n+2n+n+1*n)=O(n)。(可参考http://www.cnblogs.com/luosongchao/p/3247478.html)

Suffix trees are often used in bioinformatics applications, searching for patterns in DNA or protein sequences (which can be viewed as long strings of characters). The ability to search efficiently with mismatches might be considered their greatest strength. Suffix trees are also used in data compression; they can be used to find repeated data, and can be used for the sorting stage of the Burrows–Wheeler transform. Variants of the LZW compression schemes use suffix trees (LZSS). A suffix tree is also used in suffix tree clustering, a data clustering algorithm used in some search engines.

Implementation

If each node and edge can be represented in $\Theta(1)$ space, the entire tree can be represented in $\Theta(n)$ space. The total length of all the strings on all of the edges in the tree is O($n^2$), but each edge can be stored as the position and length of a substring of S, giving a total space usage of $\Theta(n)$ computer words. The worst-case space usage of a suffix tree is seen with a fibonacci word, giving the full 2n nodes.

 

后缀树