首页 > 代码库 > Code Review:C#与JAVA的哈希表内部机制的一些区别
Code Review:C#与JAVA的哈希表内部机制的一些区别
看C#与JAVA源码时发现C#与JAVA哈希表的实现略有不同,特此分享一下。
我觉得看哈希表的机制可以从“碰撞”这里划线分为两部分来分析。
1,发生碰撞前
在发生碰撞前决定get与put的速度唯一因素是通过哈希函数计算键值位置的速度。而占用空间大小取决于需要的桶的数量(取决于极限装载值(load factor)(假设已知需要放入哈希表中元素的数量)),和桶的大小。
C#的默认装载系数=0.72
// Based on perf work, .72 is the optimal load factor for this table. this.loadFactor = 0.72f * loadFactor;
JAVA的默认装载系数=0.75:
Constructs an empty HashMap with the default initial capacity (16) and the default load factor (0.75).
473
474 public More ...HashMap() {
475 this.loadFactor = DEFAULT_LOAD_FACTOR; // all other fields defaulted
476 }
JAVA源码注释,关于装载系数的解释:
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets. Iteration over collection views requires time proportional to the “capacity” of the HashMap instance (the number of buckets) plus its size (the number of key-value mappings). Thus, it’s very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important.
假设哈希函数将元素合理的散步到桶中,哈希表保证一个对于基础方法(get和put)的固定的操作时间。对于集合的迭代查询所需要的时间与哈希表的桶数加上键值的数量成正比。因此,如果对迭代性能要求较高,注意不要讲初始容量设置太高(或者极限装载数太低)。
An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
哈希表的实例有两个参数会印象它的性能:初始容量和极限装载值。容量是哈希表桶的数量,初始容量既是哈希表创建时桶的数量。极限装载值决定哈希表何时扩容。当入口数量超过极限装载值与当前容量的乘数时,哈希表将会重建并扩容至将近两倍。
As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.
一般说来,默认的极限装载系数(0.75)提供一个空间与实践上的良好平衡。更大的极限装载系数将会降低空间瓶颈但是会增加查询时间(反应在绝大部分哈希表方法中,包括get和put)。在设置初始容量前应考虑实际的入口数量和表的极限装载系数,以降低哈希表扩容的次数。如果初始容量大于入口的最大数量除以极限装载系数,哈希表将不会进行重建操作。
装载系数PK结果
装载系数看起来没什么区别。而且由于装载系数的合理性与初始容量,数据数量,和哈希函数算法所导致的碰撞程度是息息相关的,单独拿出来比较也没什么意义。但看来两家都认为0.7x是一个比较好的值。
JAVA的桶
278 static class More ...Node<K,V> implements Map.Entry<K,V> {
279 final int hash;
280 final K key;
281 V value;
282 Node<K,V> next;
283
284 More ...Node(int hash, K key, V value, Node<K,V> next) {
285 this.hash = hash;
286 this.key = key;
287 this.value = value;
288 this.next = next;
289 }
290
291 public final K More ...getKey() { return key; }
292 public final V More ...getValue() { return value; }
293 public final String More ...toString() { return key + "=" + value; }
294
295 public final int More ...hashCode() {
296 return Objects.hashCode(key) ^ Objects.hashCode(value);
297 }
298
299 public final V More ...setValue(V newValue) {
300 V oldValue = http://www.mamicode.com/value;
301 value = newValue;
302 return oldValue;
303 }
304
305 public final boolean More ...equals(Object o) {
306 if (o == this)
307 return true;
308 if (o instanceof Map.Entry) {
309 Map.Entry<?,?> e = (Map.Entry<?,?>)o;
310 if (Objects.equals(key, e.getKey()) &&
311 Objects.equals(value, e.getValue()))
312 return true;
313 }
314 return false;
315 }
316 }
C#的桶
private struct bucket {
public Object key;
public Object val;
public int hash_coll; // Store hash code; sign bit means there was a collision.
}
桶PK结果
乍一看差很多,但从空间角度来看实际上差不多,JAVA是一对键值加一个int和一个指针(32或64bit),C#是一对键值加一个int,C#略优。JAVA的节点类里有几个方法,占类的方法只占用一份代码空间,可忽略不计。性能上,在C#中,struct数组要略优于object数组,但是与JAVA的object数组不好比较。
JAVA的哈希函数:
335
336 static final int More ...hash(Object key) {
337 int h;
338 return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
339 }
比较简单,高低16位异或。
然后将哈希函数返回的值转换为桶的位置:桶数减1与哈希函数返回值进行&与运算。
index = (n - 1) & hash
这是一个掩码运算,将计算出的hash值截取为一个小于哈希表现有的桶的数量的一个值。例如,假设桶数n为16,n-1=15,则n-1的位向量为00000000 00000000 00000000 00001111。
任何一个数值与上面的位向量进行&运算,除了后四位以外前面都会变为0。
所以说决定一切的只是位向量的最后几位,前面再复杂也都是不相干的,那么假设需要放进哈希表的一组元素正好位向量的后几位相同岂不糟糕。高低16位异或就是为了降低键的哈希值最低位出现相同情况的概率。
C#的哈希函数要放在下面讲,因为涉及到了碰撞后的问题。
2,发生碰撞后
C#的哈希函数:
private uint InitHash(Object key, int hashsize, out uint seed, out uint incr) {
uint hashcode = (uint) GetHash(key) & 0x7FFFFFFF;
seed = (uint) hashcode;
incr = (uint)(1 + ((seed * HashPrime) % ((uint)hashsize - 1)));
return hashcode;
}
返回值转化为桶位置,取余既是掩码运算:
int bucketNumber = (int) (seed % (uint)lbuckets.Length);
哈希函数PK结果
seed既是哈希函数返回值,掩码&x7FFFFFFF将seed最高五位置为0,用来记录是否发生碰撞,对比JAVA,同样进行了一次位运算。但C#多了一次incr计算。单纯的比较哈希函数的速度上,JAVA胜C#。
C#发生碰撞后的处理
为什么C#要算一个incr呢?因为它采取的是一种称为“双重哈希”的机制,每次进入哈希函数的时候会保存一个incr值,这实际上是一个“第二哈希函数值”,当原始哈希函数的返回值转为桶的位置发生碰撞时它就派上了用场。
当发生碰撞后,寻找下一个桶的位置:
bucketNumber = (int) (((long)bucketNumber + incr)% (uint)lbuckets.Length);
这行代码是放在一个循环中的,所以incr也可以看做一个增量,通过一个增量去寻找哈希表的下一个空桶的方法也叫“开放寻址法”。例如“开放寻址法”中的“线性探索”,第一次通过哈希函数计算出桶的位置为5,访问第5个桶,发现此桶已有人了,那么下一次访问第6个桶,直到发现空桶为止。双重哈希既是把线性探索中的常数增量变为了“第二哈希函数”的返回值:incr = (uint)(1 + ((seed * HashPrime) % ((uint)hashsize - 1))); 每次检测出碰撞后都由前一个桶的位置加此增量,直至找到空桶为止。
JAVA发生碰撞后的处理
通过PUT方法看一下JAVA对碰撞的处理。
624 final V More ...putVal(int hash, K key, V value, boolean onlyIfAbsent,
625 boolean evict) {
626 Node<K,V>[] tab; Node<K,V> p; int n, i;
627 if ((tab = table) == null || (n = tab.length) == 0)
628 n = (tab = resize()).length;
629 if ((p = tab[i = (n - 1) & hash]) == null)
630 tab[i] = newNode(hash, key, value, null);
631 else {
632 Node<K,V> e; K k;
633 if (p.hash == hash &&
634 ((k = p.key) == key || (key != null && key.equals(k))))
635 e = p;
636 else if (p instanceof TreeNode)
637 e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
638 else {
639 for (int binCount = 0; ; ++binCount) {
640 if ((e = p.next) == null) {
641 p.next = newNode(hash, key, value, null);
642 if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
643 treeifyBin(tab, hash);
644 break;
645 }
646 if (e.hash == hash &&
647 ((k = e.key) == key || (key != null && key.equals(k))))
648 break;
649 p = e;
650 }
651 }
652 if (e != null) { // existing mapping for key
653 V oldValue = http://www.mamicode.com/e.value;
654 if (!onlyIfAbsent || oldValue =http://www.mamicode.com/= null)
655 e.value = value;
656 afterNodeAccess(e);
657 return oldValue;
658 }
659 }
660 ++modCount;
661 if (++size > threshold)
662 resize();
663 afterNodeInsertion(evict);
664 return null;
665 }
当发生碰撞后,桶内原有节点变为一个链表头,将新键值通过链表连接到现有节点后面(641行),当链表深度到达某种程度后会将整个哈希表变为一种树(642行),变为“树”后,再发生碰撞,就会将新键值在树中进行连接(637行)。
JAVA哈希表中的树:
1791 static final class More ...TreeNode<K,V> extends LinkedHashMap.Entry<K,V> {
1792 TreeNode<K,V> parent; // red-black tree links
1793 TreeNode<K,V> left;
1794 TreeNode<K,V> right;
1795 TreeNode<K,V> prev; // needed to unlink next upon deletion
1796 boolean red;
1797 More ...TreeNode(int hash, K key, V val, Node<K,V> next) {
1798 super(hash, key, val, next);
1799 }
由注释看是一个红黑树。
碰撞处理PK结果
时间上:
JAVA发生碰撞后进行链表操作,链表是删除快,插入慢,遍历慢,在哈希表这个层面来看效率一般,当然可以通过限制链表的深度来进行优化。而C#方面则是取决于双重哈希算法的效率,既加上incr的二次计算后能找到正确的桶的效率,对此算法的理解有限,目前无法下结论。
空间上:
C#不需要链表,所有struct都是存在一个数组中,地址连续,遍历快,不浪费内存空间,秩序性更强,强于JAVA。
C#的增容方法
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]
private void rehash( int newsize, bool forceNewHashCode ) {
// reset occupancy
occupancy=0;
// Don‘t replace any internal state until we‘ve finished adding to the
// new bucket[]. This serves two purposes:
// 1) Allow concurrent readers to see valid hashtable contents
// at all times
// 2) Protect against an OutOfMemoryException while allocating this
// new bucket[].
bucket[] newBuckets = new bucket[newsize];
// rehash table into new buckets
int nb;
for (nb = 0; nb < buckets.Length; nb++){
bucket oldb = buckets[nb];
if ((oldb.key != null) && (oldb.key != buckets)) {
int hashcode = ((forceNewHashCode ? GetHash(oldb.key) : oldb.hash_coll) & 0x7FFFFFFF);
putEntry(newBuckets, oldb.key, oldb.val, hashcode);
}
}
// New bucket[] is good to go - replace buckets and other internal state.
#if !FEATURE_CORECLR
Thread.BeginCriticalRegion();
#endif
isWriterInProgress = true;
buckets = newBuckets;
loadsize = (int)(loadFactor * newsize);
UpdateVersion();
isWriterInProgress = false;
#if !FEATURE_CORECLR
Thread.EndCriticalRegion();
#endif
// minimun size of hashtable is 3 now and maximum loadFactor is 0.72 now.
Contract.Assert(loadsize < newsize, "Our current implementaion means this is not possible.");
return;
}
不复杂,新建一个略大的表,遍历旧表中每个节点,通过哈希函数重新计算位置,放入新表中。
JAVA的增容方法
676 final Node<K,V>[] More ...resize() {
677 Node<K,V>[] oldTab = table;
678 int oldCap = (oldTab == null) ? 0 : oldTab.length;
679 int oldThr = threshold;
680 int newCap, newThr = 0;
681 if (oldCap > 0) {
682 if (oldCap >= MAXIMUM_CAPACITY) {
683 threshold = Integer.MAX_VALUE;
684 return oldTab;
685 }
686 else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
687 oldCap >= DEFAULT_INITIAL_CAPACITY)
688 newThr = oldThr << 1; // double threshold
689 }
690 else if (oldThr > 0) // initial capacity was placed in threshold
691 newCap = oldThr;
692 else { // zero initial threshold signifies using defaults
693 newCap = DEFAULT_INITIAL_CAPACITY;
694 newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
695 }
696 if (newThr == 0) {
697 float ft = (float)newCap * loadFactor;
698 newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
699 (int)ft : Integer.MAX_VALUE);
700 }
701 threshold = newThr;
702 @SuppressWarnings({"rawtypes","unchecked"})
703 Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
704 table = newTab;
705 if (oldTab != null) {
706 for (int j = 0; j < oldCap; ++j) {
707 Node<K,V> e;
708 if ((e = oldTab[j]) != null) {
709 oldTab[j] = null;
710 if (e.next == null)
711 newTab[e.hash & (newCap - 1)] = e;
712 else if (e instanceof TreeNode)
713 ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
714 else { // preserve order
715 Node<K,V> loHead = null, loTail = null;
716 Node<K,V> hiHead = null, hiTail = null;
717 Node<K,V> next;
718 do {
719 next = e.next;
720 if ((e.hash & oldCap) == 0) {
721 if (loTail == null)
722 loHead = e;
723 else
724 loTail.next = e;
725 loTail = e;
726 }
727 else {
728 if (hiTail == null)
729 hiHead = e;
730 else
731 hiTail.next = e;
732 hiTail = e;
733 }
734 } while ((e = next) != null);
735 if (loTail != null) {
736 loTail.next = null;
737 newTab[j] = loHead;
738 }
739 if (hiTail != null) {
740 hiTail.next = null;
741 newTab[j + oldCap] = hiHead;
742 }
743 }
744 }
745 }
746 }
747 return newTab;
748 }
源码注释中有写,扩容后,节点要么原地不动,否则移动到index+2次幂的位置,
Initializes or doubles table size. If null, allocates in accord with initial capacity target held in field threshold. Otherwise, because we are using power-of-two expansion, the elements from each bin must either stay at same index, or move with a power of two offset in the new table.
怎么实现的呢?
711 newTab[e.hash & (newCap - 1)] = e;
看711行,假设oldCap=16,newCap=32,e1.hash的位向量为—–01010(前面的位无意义,不考虑),那么
—01010 (e1.hash)
&–11111 (newCap-1)
=01010
e1原来的index为1010,新的index没有发生变化。
假设e2.hash的位向量为—11010,那么
—11010 (e2.hash)
&–11111 (newCap-1)
=11010
e2原来的index为5,1010,新的index为21,11010,增加了16。
再考虑下,如果e1,e2原hash值相同,那么则是发生了碰撞,e2其实是e1的子节点,重新分配后,e1原地不动,而e2则分配到了oldIndex+16的位置上(见代码720行 ,if ((e.hash & oldCap) == 0),此处与711行e.hash & (newCap - 1)异曲同工)。通过这种机制,所有的节点以二分法再次平均的分配到了新的桶中。
增容方法PK结果
JAVA完胜C#,通过一次循环与位运算就进行了巧妙的再次分配与优化,而C#每个节点都要重新进行哈希函数的运算,虽然JAVA在有树或链表的情况下需要重连链表,代码上看似复杂,但其实也省去了数组赋值操作,依然略优于C#。空间上,两家都是*2扩容。
3,总结
只看细节两家各有胜负,但实际上没什么意义,整体性能决定一切,而这需要用实验和数据来做判断,但由于哈希表只是一个工具,在数据量和操作重心无法确定的情况下,实验也无法得出绝对结论,很有可能是不同环境下各胜一筹。此文主要还是展示两种语言对同一工具的不同实现方法,都说C#是山寨JAVA,也许这话是说在架构的层面,在源码实现层面上,双重哈希与链表转红黑树各有各的巧妙,其实也算是泾渭分明的。
附录:
1,C#哈希表源码地址:
https://referencesource.microsoft.com/#mscorlib/system/collections/hashtable.cs,adc3a33902ee081a
2,JAVA哈希表源码地址:
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/HashMap.java#HashMap.resize%28%29
3,一个自制的简易哈希表(初始容量为16,极限装载值为0.75,使用链表处理碰撞,哈希函数使用高低16位异或,暂无扩容方法,无删除方法,不支持null值。):
/* Script name: LiuHashMap.cs
* Created on: 16-2-2017
* Author: Liu
* Purpose: This is a simple version of HashMap. Demonstrates some basic functions of the collection type.
* History: 21-2-2017, improved the code.
*/
using System.Collections.Generic;
using System;
public class LiuHashMap<K,V> {
//哈希表类内嵌套类,用作记录键值得节点
class Liu_node<K,V>{
public K key;
public V value;
//链表指针
public Liu_node<K,V> next;
//节点构造函数
public Liu_node(K key,V value){
this.key=key;
this.value=value;
}
}
//初始容量
private int initial_capacity=16;
//极限装载值
private float load_factor=0.75f;
//节点数组,既是桶
private Liu_node<K,V>[] liu_nodes;
//当前最大容许入口数量
private int threshold;
//当前入口数量
private int entries_number = 0;
//极限入口数量,32位int的最大值
private int MAXIMUM_ENTRIES=2147483647;
//构造函数
public LiuHashMap():this(16,0.75f){
}
public LiuHashMap(int capacity,int loadFactor){
if (capacity < 0)
throw new ArgumentOutOfRangeException();
if (!(loadFactor >= 0.1f && loadFactor <= 1.0f))
throw new ArgumentOutOfRangeException();
threshold = capacity * loadFactor;
liu_nodes=new Liu_node<K,V>[capacity];
}
//add键值方法
public void Add(K key,V value){
Liu_node<K,V> current_node = new Liu_node<K,V> (key, value);
//通过哈希函数计算放置此键值的桶的index
int hashcode = hash (key);
int index = hashcode & (initial_capacity-1);
//如果此桶处数组为空,则将键值放入此处
if (liu_nodes [index] == null) {
liu_nodes [index] = current_node;
}
//如果此桶已经有键值,则是发生了碰撞,通过链表将键值链接在此处节点的next处
else {
liu_nodes [index].next = current_node;
return;
}
//入口数量加1;
entries_number += 1;
//如果入口数量大于了最大容许入口数量,则进行扩容
if (entries_number > threshold) {
Resize ();
}
}
//取值方法
public V GetValue(K key){
//通过要搜索的键计算桶的位置
int hashcode = hash (key);
int index = hashcode & initial_capacity;
//取键值
Liu_node<K,V> current_node = liu_nodes [index];
//检查此桶是否还有多个节点,既检查碰撞
while (current_node != null) {
//通过Equals方法寻找正确的键
if(current_node.key.Equals(key)){
return current_node.value;
}
//遍历链表
current_node = current_node.next;
}
return default(V);
}
//检查是否存在此key,与上同理
public bool Contains(K key){
int hashcode = hash (key);
int index = hashcode & initial_capacity;
Liu_node<K,V> current_node = liu_nodes [index];
while (current_node != null) {
if(current_node.key.Equals(key)){
return true;
}
//遍历链表
current_node = current_node.next;
}
return false;
}
//扩容方法,暂无
private void Resize(){
//如果扩容后的容量会超出极限容量值则放弃扩容
if (initial_capacity * 2 > MAXIMUM_ENTRIES) {
return;
}
//扩容方法,暂略
}
//哈希函数
private int hash(K key){
//这里使用Java的高低16位异或
int h;
return(key==null)?0:(h=key.GetHashCode())^(h>>16);
}
}
Code Review:C#与JAVA的哈希表内部机制的一些区别