JAVA中为什么Map桶中个数超过8才转为树

为什么要转换?

因为Map中桶的元素初始化是链表保存的,其查找性能是O(n),而树结构能将查找性能提升到O(log(n))。当链表长度很小的时候,即使遍历,速度也非常快,但是当链表长度不断变长,肯定会对查询性能有一定的影响,所以才需要转成树。

为什么阈值是8?

转换后存储的数据结构TreeNodes占用空间是普通Nodes的两倍,只有当bin包含足够多的节点时才会转成TreeNodes,而是否足够多是由TREEIFY_THRESHOLD的值决定的。

在hashCode离散性很好的情况下,树型bin(桶,即bucket,HashMap中hashCode值一样的元素保存的地方)用到的概率非常小,因为数据均匀分布在每个bin中,几乎不会有bin中链表长度会达到阈值。事实上,在随机hashCode的情况下,在bin中节点的分布频率遵循如下的泊松分布(http://en.wikipedia.org/wiki/Poisson_distribution)。

在扩容阈值为0.75的情况下,(即使因为扩容而方差很大)遵循着参数平均为0.5的泊松分布。忽略方差,按公式
在这里插入图片描述
计算,概率如下:

长度 概率
0 0.60653066
1 0.30326533
2 0.07581633
3 0.01263606
4 0.00157952
5 0.00015795
6 0.00001316
7 0.00000094
8 0.00000006

如上,一个bin中链表长度达到8个元素的概率为0.00000006,几乎是不可能事件。

大部分情况下,链表存储能节约存储空间同时有着良好的查找性能;极个别情况下,节点数达到8个,转为红黑树,能获得更好的查找性能,同时因为是个别情况,不需要大量的存储空间。

所以,阈值8是时间和空间的权衡,是根据概率统计决定的。不得不感叹,发展30年的Java每一项改动和优化都是非常严谨和科学的。

附. JDK(1.8.0_45)中的相关注释

HashMap类第174~197行

     * Because TreeNodes are about twice the size of regular nodes, we
     * use them only when bins contain enough nodes to warrant use
     * (see TREEIFY_THRESHOLD). And when they become too small (due to
     * removal or resizing) they are converted back to plain bins.  In
     * usages with well-distributed user hashCodes, tree bins are
     * rarely used.  Ideally, under random hashCodes, the frequency of
     * nodes in bins follows a Poisson distribution
     * (http://en.wikipedia.org/wiki/Poisson_distribution) with a
     * parameter of about 0.5 on average for the default resizing
     * threshold of 0.75, although with a large variance because of
     * resizing granularity. Ignoring variance, the expected
     * occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
     * factorial(k)). The first values are:
     *
     * 0:    0.60653066
     * 1:    0.30326533
     * 2:    0.07581633
     * 3:    0.01263606
     * 4:    0.00157952
     * 5:    0.00015795
     * 6:    0.00001316
     * 7:    0.00000094
     * 8:    0.00000006
     * more: less than 1 in ten million

ConcurrentHashMap中第327~349行也有关于此的说法,大同小异。

     * The main disadvantage of per-bin locks is that other update
     * operations on other nodes in a bin list protected by the same
     * lock can stall, for example when user equals() or mapping
     * functions take a long time.  However, statistically, under
     * random hash codes, this is not a common problem.  Ideally, the
     * frequency of nodes in bins follows a Poisson distribution
     * (http://en.wikipedia.org/wiki/Poisson_distribution) with a
     * parameter of about 0.5 on average, given the resizing threshold
     * of 0.75, although with a large variance because of resizing
     * granularity. Ignoring variance, the expected occurrences of
     * list size k are (exp(-0.5) * pow(0.5, k) / factorial(k)). The
     * first values are:
     *
     * 0:    0.60653066
     * 1:    0.30326533
     * 2:    0.07581633
     * 3:    0.01263606
     * 4:    0.00157952
     * 5:    0.00015795
     * 6:    0.00001316
     * 7:    0.00000094
     * 8:    0.00000006
     * more: less than 1 in ten million
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章