larbin之哈希之谈

首页 > 代码库 > larbin之哈希之谈

2024-07-15 20:13:07 218人阅读

由于工作原因，打算对larbin的源码进行分析一番

用的事2.6.3版本的larbin源码，由于这是业余，会断断续续的分析上传，已做记录笔记

今天我们分析一下larbin的哈希表

这个哈希表结构比较简单，因为它的主要用处是排重，因此只给出了用于排重的简单函数，

我们来看一下头文件怎么定义的：

// Larbin// Sebastien Ailleret// 23-11-99 -> 14-01-00/* class hashTable * This class is in charge of making sure we don‘t crawl twice the same url */#ifndef HASHTABLE_H#define HASHTABLE_H#include "types.h"#include "utils/url.h"class hashTable { private:  ssize_t size;  char *table; public:  /* constructor */  hashTable (bool create);  /* destructor */  ~hashTable ();  /* save the hashTable in a file */  void save();  /* test if this url is allready in the hashtable   * return true if it has been added   * return false if it has allready been seen   */  bool test (url *U);  /* set a url as present in the hashtable   */  void set (url *U);  /* add a new url in the hashtable   * return true if it has been added   * return false if it has allready been seen   */  bool testSet (url *U);};#endif // HASHTABLE_H

由头文件我们可以看出，这个哈希表仅仅有四个成员函数（除了构造和析构）

save 函数是用于保存哈希表内部的数据，用于防止程序异常退出而造成数据丢失，因此把哈希内数据保存到一个文件中

test 函数用于测试参数指定的URL是否在哈希表内存在，只要是排重

set 函数就是判断出需要设置哈希表内值得时候设置该位置的URL对应的值，表示该URL从此开始存在于哈希表中

testset 是一个辅助函数，先判断，然后设置该位置的值，并且返回设置前的判断结果

下面我们就仔细来看一看各个函数的实现，比较简单，我就在程序中做了简单注释，就不再多余的文字解释了

构造函数：

hashTable::hashTable (bool create) {  //构造函数，申请哈希需求的空间并初始化  ssize_t total = hashSize/8;   //因为是位集合判断，所以每个字节8位，对于哈希的总成都除以8  table = new char[total];      //申请哈希空间，其实这个地方主要是以数组巧妙勾勒哈希功能  if (create) {                 //是一个标志，也就是说哈希内部的数据是从文件内读取还是初始化位0    for (ssize_t i=0; i<hashSize/8; i++) {      table[i] = 0;    }  } else {    int fds = open("hashtable.bak", O_RDONLY);   //从bak备份文件读取数据    if (fds < 0) {      cerr << "Cannot find hashtable.bak, restart from scratch\n";      for (ssize_t i=0; i<hashSize/8; i++) {     //如果打开备份文件失败，就重新赋值位0，当做第一次看待        table[i] = 0;      }    } else {      ssize_t sr = 0;      while (sr < total) {        ssize_t tmp = read(fds, table+sr, total-sr); //然后循环读取文件，直到成功读取所有数据        if (tmp <= 0) {          cerr << "Cannot read hashtable.bak : "               << strerror(errno) << endl;          exit(1);        } else {          sr += tmp;        //增加每次读取的数据        }      }      close(fds);          //关闭文件描述符    }  }}

析构函数：

hashTable::~hashTable () {    //析构函数，释放哈希申请的空间  delete [] table;}

测试函数test:

bool hashTable::test (url *U) {    //判断该url对应的是否存在哈希中，如果存在返回true，否则false  unsigned int code = U->hashCode();  //根据hashCode函数求散列值  unsigned int pos = code / 8;          unsigned int bits = 1 << (code % 8);  return table[pos] & bits;         }

设置函数：

void hashTable::set (url *U) {   //设置url对应哈希值  unsigned int code = U->hashCode();  unsigned int pos = code / 8;  unsigned int bits = 1 << (code % 8);  table[pos] |= bits;}

测试设置函数：

bool hashTable::testSet (url *U) { //返回测试结果，并且设置url对应的值  unsigned int code = U->hashCode();  unsigned int pos = code / 8;  unsigned int bits = 1 << (code % 8);  int res = table[pos] & bits;  table[pos] |= bits;  return !res;}

保存文件函数：

void hashTable::save() {     //把哈希内部数据存到文件，该过程是循序渐进的，防止时间间隔过程造成数据大量失真  rename("hashtable.bak", "hashtable.old");   //先把先前备份文件保存，直到最后本次成功备份后删除  int fds = creat("hashtable.bak", 00600);  if (fds >= 0) {    ecrireBuff(fds, table, hashSize/8);       //辅助函数，把哈希数据存到备份文件    close(fds);  }  unlink("hashtable.old");}

该哈希的处理部分就这么多，下面重点来看看我们两个知识点

1，散列函数 hashCode：

uint url::hashCode () {  unsigned int h=port;  unsigned int i=0;  while (host[i] != 0) {    h = 31*h + host[i];    i++;  }  i=0;  while (file[i] != 0) {    h = 31*h + file[i];    i++;  }  return h % hashSize;}

说起来散列函数，要求很有艺术的，而且散列函数也不可能有百分百的通用性。

一般都要自己根据哈希设置自己的散列函数。最起码要设置某些数值，用同一个散列方法和框架

该散列函数比较简单，就是把host和file（URL类中的两个字段，表示主机和文件路径）依次乘以31

然后对哈希最大值求余数，最大值这样定义的：

#define hashSize 64000000

另外对于host和file的先后哈希顺序也是设计的，先host而后file是为了让同一host对应的file的差异更大，减缓相似冲突

2，下面我们就来谈谈URL这个类，上面那个哈希散列函数就是这个类中的一个成员函数，之所以单独摘出去说，是因为散列函数也是重大的一块

我们先来看一下URL类的定义：

class url { private:  char *host;  char *file;  uint16_t port; // the order of variables is important for physical size  int8_t depth;  /* parse the url */  void parse (char *s);  /** parse a file with base */  void parseWithBase (char *u, url *base);  /* normalize file name */  bool normalize (char *file);  /* Does this url starts with a protocol name */  bool isProtocol (char *s);  /* constructor used by giveBase */  url (char *host, uint port, char *file); public:  /* Constructor : Parses an url (u is deleted) */  url (char *u, int8_t depth, url *base);  /* constructor used by input */  url (char *line, int8_t depth);  /* Constructor : read the url from a file (cf serialize) */  url (char *line);  /* Destructor */  ~url ();  /* inet addr (once calculated) */  struct in_addr addr;  /* Is it a valid url ? */  bool isValid ();  /* print an URL */  void print ();  /* return the host */  inline char *getHost () { return host; }  /* return the port */  inline uint getPort () { return port; }  /* return the file */  inline char *getFile () { return file; }  /** Depth in the Site */  inline int8_t getDepth () { return depth; }  /* Set depth to max if we are at an entry point in the site   * try to find the ip addr   * answer false if forbidden by robots.txt, true otherwise */  bool initOK (url *from);  /** return the base of the url   * give means that you have to delete the string yourself   */  url *giveBase ();  /** return a char * representation of the url   * give means that you have to delete the string yourself   */  char *giveUrl ();  /** write the url in a buffer   * buf must be at least of size maxUrlSize   * returns the size of what has been written (not including ‘\0‘)   */  int writeUrl (char *buf);  /* serialize the url for the Persistent Fifo */  char *serialize ();  /* very thread unsafe serialisation in a static buffer */  char *getUrl();  /* return a hashcode for the host of this url */  uint hostHashCode ();  /* return a hashcode for this url */  uint hashCode ();#ifdef URL_TAGS  /* tag associated to this url */  uint tag;#endif // URL_TAGS#ifdef COOKIES  /* cookies associated with this page */  char *cookie;  void addCookie(char *header);#else // COOKIES  inline void addCookie(char *header) {}#endif // COOKIES};

稍后待续。。。

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > larbin之哈希之谈

larbin之哈希之谈

看完仍有疑问？有类似问题直接问程序猿