tokyo cabinet源码分析--Hash算法

2009-12-20 10:56 pm

其实hash算法并不属于tokyo cabinet源码分析的一部分,不过鉴于hash算法在tokyo中的重要性
以及我对hash算法也没深入的研究过,就当学习发上来了。
这里主要针对hash database中的hash算法,第一次对key的hash算法采用的其实是大名鼎鼎的
Times 33(DJB),核心算法很简单:

  1. 1
  1. nHash = nHash * 33 + *key++;

在tokyo cabinet中是times33的小变种:

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  1. static uint64_t tchdbbidx(TCHDB *hdb, const char *kbuf, int ksiz, uint8_t *hp){
  2. assert(hdb && kbuf && ksiz >= 0 && hp);
  3. uint64_t idx = 19780211;
  4. uint32_t hash = 751;
  5. const char *rp = kbuf + ksiz;
  6. while(ksiz--){
  7. idx = idx * 37 + *(uint8_t *)kbuf++;
  8. hash = (hash * 31) ^ *(uint8_t *)--rp;
  9. }
  10. *hp = hash;
  11. return idx % hdb->bnum;
  12. }

可以看到cabinet并没有使用33,而是用了37,关于为何要用这些数字,网上找到一段:

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  1. # DJBX33A (Daniel J. Bernstein, Times 33 with Addition)
  2. #
  3. # This is Daniel J. Bernstein's popular `times 33' hash function as
  4. # posted by him years ago on comp.lang.c. It basically uses a function
  5. # like ``hash(i) = hash(i-1) * 33 + str[i]''. This is one of the best
  6. # known hash functions for strings. Because it is both computed very
  7. # fast and distributes very well.
  8. #
  9. # The magic of number 33, i.e. why it works better than many other
  10. # constants, prime or not, has never been adequately explained by
  11. # anyone. So I try an explanation: if one experimentally tests all
  12. # multipliers between 1 and 256 (as RSE did now) one detects that even
  13. # numbers are not useable at all. The remaining 128 odd numbers
  14. # (except for the number 1) work more or less all equally well. They
  15. # all distribute in an acceptable way and this way fill a hash table
  16. # with an average percent of approx. 86%.
  17. #
  18. # If one compares the Chi^2 values of the variants, the number 33 not
  19. # even has the best value. But the number 33 and a few other equally
  20. # good numbers like 17, 31, 63, 127 and 129 have nevertheless a great
  21. # advantage to the remaining numbers in the large set of possible
  22. # multipliers: their multiply operation can be replaced by a faster
  23. # operation based on just one shift plus either a single addition
  24. # or subtraction operation. And because a hash function has to both
  25. # distribute good _and_ has to be very fast to compute, those few
  26. # numbers should be preferred and seems to be the reason why Daniel J.
  27. # Bernstein also preferred it.
  28. #
  29. # -- Ralf S. Engelschall

找了一下发现用37的也挺多的,我们能知道Mikio Hirabayashi是1878年2月11日出生的,
看是Mikio还是挺喜欢在代码中展示一下自己的。作者在此其实进行了两次Hash。
一次是对bucket的index的hash,能得到bucket里面保存的文件数据offset指针,还有一个就是
用来比对key的hash,这样是为了解决hash的冲突问题,作者用的是拉链法算法来解决冲突问题。

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  1. while(off > 0){
  2. rec.off = off;
  3. if(!tchdbreadrec(hdb, &rec, rbuf)) return false;
  4. if(hash > rec.hash){
  5. off = rec.left;
  6. entoff = rec.off + (sizeof(uint8_t) + sizeof(uint8_t));
  7. } else if(hash ba64 ? sizeof(uint64_t) : sizeof(uint32_t));
  8. } else {
  9. if(!rec.kbuf && !tchdbreadrecbody(hdb, &rec)) return false;
  10. /*比较key为了解决hash的冲突*/
  11. int kcmp = tcreckeycmp(kbuf, ksiz, rec.kbuf, rec.ksiz);
  12. if(kcmp > 0){
  13. off = rec.left;
  14. TCFREE(rec.bbuf);
  15. rec.kbuf = NULL;
  16. rec.bbuf = NULL;
  17. entoff = rec.off + (sizeof(uint8_t) + sizeof(uint8_t));
  18. } else if(kcmp ba64 ? sizeof(uint64_t) : sizeof(uint32_t));
  19. } else {
  20. bool rv;
  21. int nvsiz;
  22. char *nvbuf;
  23. HDBPDPROCOP *procptr;
  24. .........................
  25. 存储..........
  26. .........................

上面是put存储时候的拉链法,可以看到通过两次查找确定最后存储的位置,也可以推论,如能
分配更多的bucket,hash算法将分布的更均匀,这样也减少了二次查找的时间复杂度,能提高
性能的,一般为记录数的0.5-4倍之间。

推荐(0)
收藏

Distributed persistent key-value

2009-11-24 12:32 pm

最近要用到分布式持久化存储的东西,研究了一下memcachedb和tokyo cabinet & tokyo tyrant,把做到ppt共享出来,
欢迎拍砖。。。。

分布式持久化key-value存储.pdf

推荐(0)
收藏