五分钟了解 HashMap 的散列函数

最编程 2024-06-15 15:44:07

...

/**
 * Computes key.hashCode() and spreads (XORs) higher bits of hash
 * to lower.  Because the table uses power-of-two masking, sets of
 * hashes that vary only in bits above the current mask will
 * always collide. (Among known examples are sets of Float keys
 * holding consecutive whole numbers in small tables.)  So we
 * apply a transform that spreads the impact of higher bits
 * downward. There is a tradeoff between speed, utility, and
 * quality of bit-spreading. Because many common sets of hashes
 * are already reasonably distributed (so don't benefit from
 * spreading), and because we use trees to handle large sets of
 * collisions in bins, we just XOR some shifted bits in the
 * cheapest possible way to reduce systematic lossage, as well as
 * to incorporate impact of the highest bits that would otherwise
 * never be used in index calculations because of table bounds.
 *
 * [这是用软件翻译成的中文，相信很多人看的半知半解，看完整片文章，回来在看注释就很好理解了]
 * 计算 key.hashCode() 并将散列的高位（XORs）扩展到低位。
 * 因为该表使用了2的幂掩码，所以在当前掩码上方位变化的哈希集总是会发生冲突。
 * （众所周知的例子是在小tables中保存连续整数的浮点键集）
 * 因此，我们应用了一种向下传播高位影响的变换。在比特传播的速度、效用和质量之间存在一种折衷。
 * 由于许多常见的散列集已经合理分布（因此无法从传播中获益），并且由于我们使用树来处理容器中的大型冲突集，
 * 因此我们只需以最便宜的方式对一些移位位进行异或，以减少系统损失，以及合并最高位的影响，
 * 否则，由于表边界，最高位将永远不会用于索引计算。
 *
 *  注意这个注释里面有细节！
 * 
 *  key 为空时，hash值 = 0
 *  key 不为空，hash值 = hashCode 异或 (hashCoded 无符号右移 16 位)
 */
static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

右移和无符号右移

1111 1111 1111 1111  1111 1111 1111 1011     -5 
1111 1111 1111 1111  1111 1111 1111 1111     -5 >> 16   负数右移高位以原有标志位补充
0000 0000 0000 0000  1111 1111 1111 1111     -5 >>> 16  负数无符号右移高位补 0

0000 0000 0000 0000  0000 0000 0000 0101      5 
0000 0000 0000 0000  0000 0000 0000 0000      5 >> 16   正数右移高位补 0
0000 0000 0000 0000  0000 0000 0000 0000      5 >>> 16  正数无符号右移高位补 0

先说这个 hash() 函数的计算逻辑，为了更好的展示，将输出的二进制字符串格式化打印出来看下

结合下图看，"zhangsan" 的 hashCode = -1432604556，以二进制展示就是图片左边的第一行（重点来了）

 1、将 key 原有的 hashCode 往右移 16 位。
    第一行高位的 16 位，在第二行变成了低位的 16 位，第二行高位的 16 位全部补0
   (仔细想一下，将一张纸从中间对折，）
   
 2、现在有一个原始的 hashCode（第一行） 和一个右移 16 位的 hashCode（第二行）
   将它俩进行异或(^)操作，因为第二行高位都是 0，所以高位异或不会变，参与异或的只有两个低位。
   得到第三行的结果
   
   再仔细看一下第一行和第二行的异或过程，其实低位的运算，不就是拿原始 hashCode 的高位和低位异或么。
   得到结果是高位不变，低位 = (原始的高位 ^ 原始的低位)
   （低位的计算就像将一张纸对折，想想还有点代码之美）

/**
 * 将 hashCode 转为二进制字符串并 4 位一格式化
 */
public static String formatBinaryString(int hashCode) {
    String binaryString = Integer.toBinaryString(hashCode);
    StringBuilder sb = new StringBuilder(binaryString);
    // 高位补 0
    while (sb.length() < 32) {
        sb.insert(0, "0");
    }
    int space = 3;
    binaryString = sb.toString();
    sb.setLength(0);
    for (int i = 0; i < binaryString.length(); i++) {
        sb.append(binaryString.charAt(i));
        if (i - space == 0) {
            sb.append(" ");
            space = space + 4;
            if (i == 15) {
                sb.append(" ");
            }
        }
    }
    return sb.toString();
}


public static void main(String[] args) {
    Object key = "zhangsan";
    int h = key.hashCode();
    
    String hashCode = formatBinaryString(h);
    System.out.println(hashCode + "   原始 hashCode");

    //右移 16 位
    String rightHashCode = formatBinaryString(h >>> 16);
    System.out.println(rightHashCode + "   右移 16 位后的 hashCode");

    //异或(^)
    String hash = formatBinaryString(h ^ (h >>> 16));
    System.out.println(hash + "   (原始 hashCode) ^ (右移 16 位后的 hashCode)");
    
    // 注意这里是 与(&) 运算
    String hashAnd = formatBinaryString(h & (h >>> 16));
    System.out.println(hashAnd + "   (原始 hashCode) & (右移 16 位后的 hashCode)");
}

输出结果：
    1010 1010 1001 1100   0011 0000 0111 0100    原始 hashCode
    0000 0000 0000 0000   1010 1010 1001 1100    右移 16 位后的 hashCode
    1010 1010 1001 1100   1001 1010 1110 1000    (原始 hashCode) ^ (右移 16 位后的 hashCode)
    
    0000 0000 0000 0000   0010 0000 0001 0100    (原始 hashCode) & (右移 16 位后的 hashCode)

看到这里相信你已经懂了 HashMap 的 hash 算法是如何计算的了。

（扩展一下）上面代码的最后一行突发奇想尝试使用与(&) 运算试一下会怎样，结果可以发现与(&) 运算结果会导致大量的 hash 冲突。是不是感觉不够明显？那直接模仿 HashMap 的 put() 看一下，看下图这段源码，判断 (长度 - 1) & hash 位置是否为空，为空就创建一个新 Node，不为空就链表追加或者转红黑树

public static void main(String[] args) {
    // 注意这里是 与(&) 运算
    int hashAndA = "A".hashCode() & ("A".hashCode() >>> 16);
    // 模拟 put 时寻位操作，按默认 长度 16 算
    System.out.println((16-1) & hashAndA);

    int hashAndB = "B".hashCode() & ("B".hashCode() >>> 16);
    System.out.println((16-1) & hashAndB);

    int hashAndC = "C".hashCode() & ("C".hashCode() >>> 16);
    System.out.println((16-1) & hashAndC);

    int hashAndD = "D".hashCode() & ("D".hashCode() >>> 16);
    System.out.println((16-1) & hashAndD);
}

结果：
0000 0000 0000 0000  0000 0000 0000 0000   //二进制输出
0                                          //hash 冲突
0000 0000 0000 0000  0000 0000 0000 0000 
0
0000 0000 0000 0000  0000 0000 0000 0000 
0
0000 0000 0000 0000  0000 0000 0000 0000 
0

其实仔细想一下 与(&) 运算的逻辑，就知道是不可行的。
想必当年作者 道格·李 应该不会像我这样笨拙的还要计算一下

现在可以思考几个问题了：

1、HashMap 为什么要自己写一个 hash 函数呢？

Because many common sets of hashes are already reasonably distributed 
(so don't benefit from spreading)

想象一下如果 HashMap 直接使用传入对象的 hashCode，而你传入key自定义对象的 hashCode 算法很垃圾，
10 个里面有 7 个 hashCode 相同，会导致什么问题？导致严重 hash 冲突，
可能仅仅几十个元素每个节点下都是链表或红黑树，影响效率，违背了设计初衷

2、hash 函数是怎样做到尽量 hash 散列的？

先将高位右移 16 位，然后与原 hashCode 异或(^)。
相对于其他计算方式，位运算的速度、效用和质量是减少系统损失的最优选

1639035748(1).jpg

上一篇： mysql 哈希函数 - 掘金

下一篇： 2023 跟我学算法：数据结构和算法 - 哈希算法

五分钟了解 HashMap 的散列函数

在 Go 中使用常见的散列函数

散列函数是如何工作的？

五分钟了解 HashMap 的散列函数

各种字符串散列函数的比较

python2 和 python3 中散列函数的区别

几种常见的散列函数

对 HashMap 中散列函数的思考

深入了解 HashMap（以及哈希函数的真正独创性）

散列函数是如何工作的？