Rabin Karp 算法实战

首页 > 代码库 > Rabin Karp 算法实战

2024-07-03 19:58:48 218人阅读

关键字

Rabin karp 算法, C++, ubuntu 14.04, linux, big integer, gmp

为了计算冗余度，我写出了如下算法

void HandleAMission(const char *srcFileName, FILE *output, int blockSize, int samplingRate, int cacheSize) {

const int MAGICNUM = 33; // hash for more, refer to

const int MAXINTEGER = 4294967295; // biggest unsigned int

struct IP_hdr *ip_hdr;

struct TCP_hdr *tcp_hdr;

struct UDP_hdr *udp_hdr;

char *payLoad;

char src_ip_char[STRSIZE], dst_ip_char[STRSIZE];

int proto; // udp for 0, tcp for 1

int payLoadLen;

PcapParser pcaparser(srcFileName); // init pcapparser

// init FIFOCache

FIFOCache fifocache(cacheSize);

// statistic

unsigned long totalByte = 0;

unsigned long savingByte = 0;

unsigned int localMax;

unsigned int candiMax;

// init big integer

mpz_class product(0); // current product of char array of blocksize

mpz_class part(1); // used as temp

mpz_class head(0); // leftmost

for(int i = 1; i < blockSize; i ++) {

head *= MAGICNUM;

}

while(pcaparser.NextPacket(&ip_hdr, &tcp_hdr, &udp_hdr, proto, &payLoad, payLoadLen) == 0) {

if(payLoadLen < 128) continue;

// init product for new packet

product = 0;

totalByte += payLoadLen;

// init a key

for(int i = 0; i < blockSize; i ++) {

product = product * MAGICNUM + (unsigned int)(payLoad[i]);

}

// Rabin Karp algorithm

for(int cursor = 1; cursor+samplingRate < payLoadLen; cursor += samplingRate) {

for(int offset = cursor; offset < cursor + samplingRate; offset ++) {

product = product - head * (unsigned int)(payLoad[offset-1]);

product = product * MAGICNUM;

product = product + (unsigned int)(payLoad[offset+blockSize-1]);

part = product % MAXINTEGER;

candiMax = part.get_ui();

if(candiMax > localMax) {

localMax = (unsigned int)candiMax;

}

if(fifocache.HasKey(localMax)) {

savingByte += blockSize;

} else {

fifocache.AddKey(localMax);

}

printf("Saving byte is %ld byte\n", savingByte);

printf("Total byte is %ld\n", totalByte);

printf("Redundancy rate is %lf\n", 1.0*savingByte/totalByte);

}

注意, 上面的代码中, Rabin Karp 部分我没有设计成接口, 实际上我写了一个 RabinKarp 类, 但经验告诉我, 处理计算密集型任务时, 不要搞什么花哨的接口, 继承, 重用, 越 dirty（代码都写到一个函数内）效率越高, 当然即便如此，我还是设计了个 PcapParse 类用于处理 pcap 文件, 经不住诱惑.

然后我运行了十个 testcase, 结果却不令我满意, 平均 1MB 一秒，相当于 1GB 要处理 1000s, 对于 T 级别的计算任务来看, 这显然是不可接受的.

我有两条路可以走，一个是自己设计大整数类，弃用 gmp big integer，二是弃用 Rabin Karp, 对输入字符串暴力哈希。

回想自己使用 Rabin Karp 的原因，是前人的 paper 说 rabin karp 能有效的减少计算时间，或许 rabin karp 本身足够快，但因使用 rk 算法引入的 gmp 却太慢，导致 rk 算法的优势尽失。我本来就怀疑 gmp 的性能，而事实的确验证了我的怀疑， gmp 太慢了。

呵，还是用 murmurhash 暴力哈希吧

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > Rabin Karp 算法实战

Rabin Karp 算法实战

看完仍有疑问？有类似问题直接问程序猿