其实hash算法并不属于tokyo cabinet源码分析的一部分,不过鉴于hash算法在tokyo中的重要性
以及我对hash算法也没深入的研究过,就当学习发上来了。
这里主要针对hash database中的hash算法,第一次对key的hash算法采用的其实是大名鼎鼎的
Times 33(DJB),核心算法很简单:
1
| nHash = nHash * 33 + *key++;
|
在tokyo cabinet中是times33的小变种:
1 2 3 4 5 6 7 8 9 10 11 12
| static uint64_t tchdbbidx(TCHDB *hdb, const char *kbuf, int ksiz, uint8_t *hp){ assert(hdb && kbuf && ksiz >= 0 && hp); uint64_t idx = 19780211; uint32_t hash = 751; const char *rp = kbuf + ksiz; while(ksiz--){ idx = idx * 37 + *(uint8_t *)kbuf++; hash = (hash * 31) ^ *(uint8_t *)--rp; } *hp = hash; return idx % hdb->bnum; }
|
可以看到cabinet并没有使用33,而是用了37,关于为何要用这些数字,网上找到一段:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| # DJBX33A (Daniel J. Bernstein, Times 33 with Addition) # # This is Daniel J. Bernstein's popular `times 33' hash function as # posted by him years ago on comp.lang.c. It basically uses a function # like ``hash(i) = hash(i-1) * 33 + str[i]''. This is one of the best # known hash functions for strings. Because it is both computed very # fast and distributes very well. # # The magic of number 33, i.e. why it works better than many other # constants, prime or not, has never been adequately explained by # anyone. So I try an explanation: if one experimentally tests all # multipliers between 1 and 256 (as RSE did now) one detects that even # numbers are not useable at all. The remaining 128 odd numbers # (except for the number 1) work more or less all equally well. They # all distribute in an acceptable way and this way fill a hash table # with an average percent of approx. 86%. # # If one compares the Chi^2 values of the variants, the number 33 not # even has the best value. But the number 33 and a few other equally # good numbers like 17, 31, 63, 127 and 129 have nevertheless a great # advantage to the remaining numbers in the large set of possible # multipliers: their multiply operation can be replaced by a faster # operation based on just one shift plus either a single addition # or subtraction operation. And because a hash function has to both # distribute good _and_ has to be very fast to compute, those few # numbers should be preferred and seems to be the reason why Daniel J. # Bernstein also preferred it. # # -- Ralf S. Engelschall
|
找了一下发现用37的也挺多的,我们能知道Mikio Hirabayashi是1878年2月11日出生的,
看是Mikio还是挺喜欢在代码中展示一下自己的。作者在此其实进行了两次Hash。
一次是对bucket的index的hash,能得到bucket里面保存的文件数据offset指针,还有一个就是
用来比对key的hash,这样是为了解决hash的冲突问题,作者用的是拉链法算法来解决冲突问题。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| while(off > 0){ rec.off = off; if(!tchdbreadrec(hdb, &rec, rbuf)) return false; if(hash > rec.hash){ off = rec.left; entoff = rec.off + (sizeof(uint8_t) + sizeof(uint8_t)); } else if(hash ba64 ? sizeof(uint64_t) : sizeof(uint32_t)); } else { if(!rec.kbuf && !tchdbreadrecbody(hdb, &rec)) return false; /*比较key为了解决hash的冲突*/ int kcmp = tcreckeycmp(kbuf, ksiz, rec.kbuf, rec.ksiz); if(kcmp > 0){ off = rec.left; TCFREE(rec.bbuf); rec.kbuf = NULL; rec.bbuf = NULL; entoff = rec.off + (sizeof(uint8_t) + sizeof(uint8_t)); } else if(kcmp ba64 ? sizeof(uint64_t) : sizeof(uint32_t)); } else { bool rv; int nvsiz; char *nvbuf; HDBPDPROCOP *procptr; ......................... 存储.......... .........................
|
上面是put存储时候的拉链法,可以看到通过两次查找确定最后存储的位置,也可以推论,如能
分配更多的bucket,hash算法将分布的更均匀,这样也减少了二次查找的时间复杂度,能提高
性能的,一般为记录数的0.5-4倍之间。