HashMap 文檔通讀及源碼分析

文檔通讀及問題

沒有貼英文原文，大家可以照着源碼中的註釋看，這裏翻譯的是源碼上的註釋，並且貼出了我自己讀的時候心裏的問題，後面會帶着問題進行源碼分析.

Hash Map 簡介

Hash Table 基於 Map 接口實現. 它的實現提供了所有可選的 map 操作，並且允許 key 和 value 爲 null. (HashMap 和 HashTable 大體相同，只不過 HashMap 不是線程安全的（unsynchronized)，並且允許空值健）此類不保證 map 的順序；而且尤其要注意的是，它也不能保證順序不會變化

Q: 順序是如何變化的?

Hash Map 性能

在 hash 方法將元素正常散列在哈希桶裏的情況下，此實現對於 get 和 put 這類基本操作的時間複雜度是常量級的.
集合視圖上的迭代要求【時間】與【HashMap實例的“容量”（bucket的數量）和大小（鍵值映射的數量）之和】成正比. 此外，如果對性能有一定要求的話，千萬不要將初始容量設置的太高（或是將加載因子設置得太低）.

Q: 爲何是get 和 put 是常量級的，如何保證？初始容量和加載因子是如何影響性能的?

對於 HashMap 來說，有兩個參數會影響性能：初始容量(initial capacity) 和負載因子(load factor).
capacity 是 hash table的桶數量，而初始容量就是 hash table 被創建時的容量大小. 負載因子 是用來衡量 hash table 被自動擴容前能被填充到什麼程度的一個指標，也是就：填了多少數據就需要自動擴容了. 當 hash table的 entries 超過這個負載因子和當前容量時，hash table 會被重新 hash (意思是：內部的數據組成會被重新組建)，以讓 hash table 能擴容兩倍左右.

Q: entries, capacity, load factor的關係？

一般來說，默認的負載因子（0.75）是一個在時間和空間消耗上都比較平衡的一個值. 更高的值會降低空間消耗但是會增加查詢成本（對於 HashMap 類中的大部分方法，包括 get 和 put）. 當設置初始容量時，要考慮map 中的 entries 數量和負載因子，以儘量減少重新 hash？ 計算操作. 如果初始容量 > entries 最大值 / 負載因子，就不會進行重新 hash 計算.

Q: 又一次提到加載因子 load factor？
Q: 什麼是 rehash 操作，什麼情況下會觸發？

如果需要存儲的數據量較大，創建時設置一個足夠大的 capacity，數據存儲時會比需要擴容時重新 hash 計算更高效. 要注意使用多個 hashCode 相同的 key 是絕對會降低hash table的效率的. 如果 key 實現了 Comparable 接口的話，這個類可能會使用key 之間的comparison 順序來提升效率.

Q: hashCode 相同時如何處理的? 實現了 comparable 會如何操作，未實現又會怎樣？

Hash Map 線程不安全

注意這個實現是非同步的(not synchronized). 如果有多個線程併發地獲取一個 hash map, 並且至少其中一個線程在結構上改變了這個 map，它必須在外部同步. （結構修改操作就是任何新增或者刪除一個或多個數據，如果只是修改和實例已包含的 key 相關的值不能叫做結構上的修改.）一般是通過同步一些自然封裝map的對象來實現.

Q：最後一句不懂什麼意思，This is typically accomplished by synchronizing on some object that naturally encapsulates the map.

如果不存在這樣的對象，map 必須通過 Collections.synchronizedMap 方法包裝（wrapped）. 最好在創建時就這樣操作，以免不慎非同步獲取 map.

示例：

Map m = Collections.synchronizedMap(new HashMap(...));

Fail-fast 機制解釋

對於所有的集合類來說，迭代器對於這個類的集合視圖方法都是 fail-fast 的：如果在迭代器創建後 map 被結構化修改，除了迭代器自身的 remove 操作外，迭代器都會拋出 ConcurrentModificationException 異常。除此之外，併發修改時，迭代器也不會在未來不確定的時候冒險地做任何不確定的操作，而是會迅速原封不動地失敗. (PS譯者注：這裏原文描述是 quickly and cleanly，我猜是說不會有任何操作發生，這個 map 原封不動的返回的意思）

注意 fail-fast 機制並不可靠，一般來說，非同步的併發操作發生時，你是無法得到任何硬性保證的.
Fail-fast 迭代器會盡量去拋出 ConcurrentModificationException.
所以，寫程序時依賴於這個異常來保證正確性是不可取的：fail-fast 機制只應該用來檢查 bug.

這個類是 Java Collections Framework 中的成員

Java 集合類接口繼承關係和實現

源碼通讀

接下來開始看代碼

public class HashMap<K,V> extends AbstractMap<K,V>  
    implements Map<K,V>, Cloneable, Serializable

結合上圖看，HashMap 繼承了 AbstractMap，並實現了 Map 接口，同時還實現了 Cloneable 和 Serializable 接口，這兩個接口不在本文討論範圍內，是關於 Java 持久化和引用相關的知識。

常量

serialVersionUID

private static final long serialVersionUID = 362498820763181265L;

方法裏面定義了一個 serialVersionUID 的靜態變量，這個變量是爲 Serializable 接口準備，進行序列化時使用的。

默認初始容量

/**
 * The default initial capacity - MUST be a power of two. 
 */
 static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

默認初始容量，這裏定義的是 16，必須是2次冪.

最大容量

/**
 * The maximum capacity, used if a higher value is implicitly specified 
 * by either of the constructors with arguments. 
 * MUST be a power of two <= 1<<30. 
 */
 static final int MAXIMUM_CAPACITY = 1 << 30;

最大容量，如果在構造函數中隱式設置了比默認容量更大的值，會用來約束容量的初始值.
規則：容量值必須是 2 次冪，並且不能大於 2 的 30 次方.

默認負載因子

/**
 * The load factor used when none specified in constructor. 
 */
 static final float DEFAULT_LOAD_FACTOR = 0.75f;

負載因子，如果未在構造器中指定負載因子會使用這個默認值.

樹化閾值

/**
 * The bin count threshold for using a tree rather than list for a
 * bin.  Bins are converted to trees when adding an element to a
 * bin with at least this many nodes. The value must be greater
 * than 2 and should be at least 8 to mesh with assumptions in
 * tree removal about conversion back to plain bins upon
 * shrinkage. 
 */
 static final int TREEIFY_THRESHOLD = 8;

將桶中存儲的鏈表轉換成 tree 的閾值. 當值到 8 個並且向桶中添加 1個元素時就會轉成樹. 在從樹退化回原來的鏈表時應該至少有 8 個值，並且 remove 兩個纔會轉換回數組.

從樹轉換成鏈表的闕值.

/**
 * The bin count threshold for untreeifying a (split) bin during a 
 * resize operation. Should be less than TREEIFY_THRESHOLD, and at 
 * most 6 to mesh with shrinkage detection under removal. 
 */
 static final int UNTREEIFY_THRESHOLD = 6;

桶被樹化時的最小容量

/**
 * The smallest table capacity for which bins may be treeified.
 * (Otherwise the table is resized if too many nodes in a bin.)
 * Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts
 * between resizing and treeification thresholds. 
 */
 static final int MIN_TREEIFY_CAPACITY = 64;

桶被樹化的最小容量. 必須至少是 4 * 樹化的闕值來避免resize 和樹化操作闕值的之間的衝突.

接着有一段很長的 Implementation notes. 這部分翻譯和講解附在最後

Node<K, V>

Node 是 HashMap 的一個內部靜態類

static class Node<K,V> implements Map.Entry<K,V> {  
    final int hash;  // 存儲對 key 進行 hash 運算後的值，避免重複計算
    final K key;  
    V value;  
    Node<K,V> next;  // 存儲指向下一個 Node 的引用，單鏈表
  
  ......  省略部分代碼
  
    public final V setValue(V newValue) {  // setValue 會返回設置前的值
         V oldValue = value;  
         value = newValue;  
         return oldValue;  
    }  
  
    public final boolean equals(Object o) {  
        if (o == this)  // 如果相等，== 比較的是物理地址，也就是同一個對象，肯定是返回 true. 
            return true;  
        if (o instanceof Map.Entry) {  // 如果 是 Entry 的實例，則比較 key 和 value 是否都相等.
            Map.Entry<?,?> e = (Map.Entry<?,?>)o;  
            if (Objects.equals(key, e.getKey()) &&  
                Objects.equals(value, e.getValue()))  
                return true;  
        }  
        return false;  
     }  
}

Static utilities

Hash 方法

static final int hash(Object key) {  
    int h;  
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);  
}

文檔翻譯：計算 key 的 hashCode值，並將哈希值的高位移到低位. 因爲散列表大小都是 2 次冪，比當前大小大的 hash 值集合就一定會衝突（比較明顯的例子就是在一個比較的散列表中，一組 Float 類型的數據就會持續發生 hash 衝突）所以通過轉換可以將高位下移從而減少這種衝突. 這是對速度，實用性和位移效率的權衡後的結果. 因爲很多常見的 hash值都已經合理分佈（所以這種情況下這裏的移位操作沒啥卵用），並且因爲我們使用了樹來處理數據量較大時的衝突，所以我們只要用成本最低的方式處理某些位來降低性能上的損失，考慮最高位的這種影響，否則因爲 table 的容量可能永遠不會用於索引計算.

大意就是

將原有的 hashCode 右移 16 位
因爲散列表大小都是 2 次冪，如果不進行右移會非常容易 hash 衝突
相比持續衝突而需要使用樹來保存大量衝突的數據的話，右移的代價是值得的
還可以解決因爲散列表數據較少時高位 hash 不會用於索引計算的問題.

比較方法 comparableClassFor & compareComparables

    /**
     * Returns x's Class if it is of the form "class C implements
     * Comparable<C>", else null.
     */
    static Class<?> comparableClassFor(Object x) {
        if (x instanceof Comparable) {
            Class<?> c; Type[] ts, as; Type t; ParameterizedType p;
            if ((c = x.getClass()) == String.class) // bypass checks
                return c;
            if ((ts = c.getGenericInterfaces()) != null) {
                for (int i = 0; i < ts.length; ++i) {
                    if (((t = ts[i]) instanceof ParameterizedType) &&
                        ((p = (ParameterizedType)t).getRawType() ==
                         Comparable.class) &&
                        (as = p.getActualTypeArguments()) != null &&
                        as.length == 1 && as[0] == c) // type arg is c
                        return c;
                }
            }
        }
        return null;
    }

    /**
     * Returns k.compareTo(x) if x matches kc (k's screened comparable
     * class), else 0.
     */
    @SuppressWarnings({"rawtypes","unchecked"}) // for cast to Comparable
    static int compareComparables(Class<?> kc, Object k, Object x) {
        return (x == null || x.getClass() != kc ? 0 :
                ((Comparable)k).compareTo(x));
    }

comparableClassFor 如果 x 實現了 Comparable 接口，就返回 x 的類，否則就返回 null.
compareComparables k 和 x 的比較結果，如果 x 和 k 的類型能匹配，就返回 k.compareTo(x)的結果，否則返回0.

設置散列表大小 tableSizeFor

 /**
   * Returns a power of two size for the given target capacity.
   */
  static final int tableSizeFor(int cap) {
        int n = cap - 1;
        n |= n >>> 1;
        n |= n >>> 2;
        n |= n >>> 4;
        n |= n >>> 8;
        n |= n >>> 16;
        return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
    }

在設置 capacity 時會調用此方法來保證 capacity 一定是 2 的次冪，並且不能超過最大容量.

Fields

/**
 * The table, initialized on first use, and resized as
 * necessary. When allocated, length is always a power of two.
 * (We also tolerate length zero in some operations to allow
 * bootstrapping mechanics that are currently not needed.)
 */
 transient Node<K,V>[] table;

table, 第一次使用時被初始化，必要時會調整大小. 一旦被創建，長度始終是 2 次冪. （在某些操作中，如果當前並不需要的話，長度也可爲 0）

   /**
     * Holds cached entrySet(). Note that AbstractMap fields are used
     * for keySet() and values().
     */
    transient Set<Map.Entry<K,V>> entrySet;

保存緩存的 entrySet，注意在 AbstractMap 中是使用 keySet() 和 values() 方法來獲取.

   /**
     * The number of key-value mappings contained in this map.
     */
    transient int size;

map 中包含的鍵值映射數量.

   /**
     * The number of times this HashMap has been structurally modified
     * Structural modifications are those that change the number of mappings in
     * the HashMap or otherwise modify its internal structure (e.g.,
     * rehash).  This field is used to make iterators on Collection-views of
     * the HashMap fail-fast.  (See ConcurrentModificationException).
     */
    transient int modCount;

modCount: HashMap 被結構化改變的次數.

結構化改變是指改變了 HashMap 中映射數量，或者改變了它的內部結構，比如說重新 hash 計算.
這個值是集合視角的迭代器用來實現 fail-fast機制的. (詳情參見 ConcurrentModificationException)

   /**
     * The next size value at which to resize (capacity * load factor).
     *
     * @serial
     */
    // (The javadoc description is true upon serialization.
    // Additionally, if the table array has not been allocated, this
    // field holds the initial array capacity, or zero signifying
    // DEFAULT_INITIAL_CAPACITY.)
    int threshold;

閾值，到達這個值大小就需要調整大小.

   /**
     * The load factor for the hash table.
     *
     * @serial
     */
    final float loadFactor;

loadFactor: 加載因子

Public Operations

先是四個構造方法，比較簡單，做一些初始化的設置

public HashMap(Map<? extends K, ? extends V> m) {
        this.loadFactor = DEFAULT_LOAD_FACTOR;
        putMapEntries(m, false);
    }

這裏有根據 map 來構造 HashMap 的，會調用一個 put 方法來組建數據

putMapEntries, putVal

putMapEntries 參數

m: 原始 map
evict: 如果是初始化時的構造爲 false，否則爲 true.

final void putMapEntries(Map<? extends K, ? extends V> m, boolean evict) {
        int s = m.size();
        if (s > 0) {
            if (table == null) { // pre-size
                float ft = ((float)s / loadFactor) + 1.0F;
                int t = ((ft < (float)MAXIMUM_CAPACITY) ?
                         (int)ft : MAXIMUM_CAPACITY);
                if (t > threshold)
                    threshold = tableSizeFor(t);
            }
            else if (s > threshold)
                resize();
            for (Map.Entry<? extends K, ? extends V> e : m.entrySet()) {
                K key = e.getKey();
                V value = e.getValue();
                putVal(hash(key), key, value, false, evict);
            }
        }
    }

如果 map 爲空，不做任何操作
如果當前 HashMap 的 table(Node<K,V>[]) 爲 null，判斷是否需要調整擴容的闕值 threhold.
- 若（ map大小 / 負載因子）+ 1.0 與 容量上限 的較大值 t > 闕值，則根據 t 調整闕值.
如果 map大小 > 當前闕值，擴容 resize().
循環，將鍵值對放入當前的 HashMap. 調用 putVal 方法.

putVal

可以結合註釋和腦圖一起看

final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        if ((tab = table) == null || (n = tab.length) == 0) // 如果 table 爲空
            n = (tab = resize()).length; // 初始化 table，並將 n 設置爲 table 大小
        if ((p = tab[i = (n - 1) & hash]) == null) // 根據 hash 值尋址，n-1 & hash 在 n 爲 2 次冪時
            tab[i] = newNode(hash, key, value, null);
        else {
            Node<K,V> e; K k;
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
                e = p;
            else if (p instanceof TreeNode)
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            else {
                for (int binCount = 0; ; ++binCount) { // 循環遍歷 p
                    if ((e = p.next) == null) { 
                        p.next = newNode(hash, key, value, null); // 將尾結點設爲當前數據生成的 Node
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st 
                            // 如果 P 的子節點數 binCount > 樹化闕值 - 1
                            // -1 是因爲 p 頭結點未算在內
                            // 轉成樹
                            treeifyBin(tab, hash); 
                        break; 
                    }
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        break; // 如果 key 相等跳出循環， e = key 相等的這個結點
                    p = e;
                }
            }
            if (e != null) { // existing mapping for key
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value; // 更新 value 值
                afterNodeAccess(e);
                return oldValue; // 返回舊值
            }
        }
        ++modCount; // modCount 自增，走到着這裏說明是新增了 node ，for fail-fast
        if (++size > threshold) // 判斷是否需要調整大小
            resize(); // 擴容，下面會重點分析這個方法
        afterNodeInsertion(evict);
        return null;
    }

這裏比較重要的方法就是 resize: 擴容 putTreeVal: 將元素當到紅黑樹上 treeifyBin：將桶中的元素轉換成紅黑樹

resize

初始化或是擴容 2 倍時調用，其實 resize 就做了兩件事

處理桶的容量大小
處理桶中的鏈表/紅黑樹

要處理容量，還要處理下次 resize 的閾值，所以主要處理兩個值 threshold 和 capacity，其中 threshold 是通過成員變量顯示，而 capacity 是通過桶的那個數組容量來體現的所以是先計算出capacity 和 threshold（1），拷貝一份桶中的數據，再將原來的桶初始化爲一個新的容量爲 capacity 的數組，將原來的數據根據現有的容量重新散列這個新數組中（2）

下面具體看一下

處理容量大小

因爲 threshold = capacity * loadFactor，所以只要改變了 capacity，就一定要去同步處理 threshold

概括一下就是

如果沒有初始化過，初始化
如果超出最大值，threshold 設爲最大值，並直接返回原有桶，不進行任何處理
如果在可 resize 的範圍內，就 double 容量和 threshold

處理桶中的數據

遍歷桶中的數據如果當前桶只有一個元素e，根據新的容量計算出 e 的下標並指向 e 如果當前桶中的元素是樹，則需要將樹中的元素分成兩份放置，根據原來的容量來計算位置是應該放在高位還是低位，這樣就可以讓桶中的元素分佈更平均，如果不重新放置的話就失去了擴容的意義。

resize 的代碼有點長，而且很大部分都是紅黑樹的操作就不貼了