Netty中FastThreadLocal爲什麼比ThreadLocal快

背景

近期在看netty源碼，發現有個叫做FastThreadLocal的類，代碼doc中寫明此類的用途和ThreadLocal一樣，都是維持線程獨有的變量，但是速度會更快。於是產生了疑問：FastThreadLocal爲什麼比ThreadLocal更快？快在哪？基於這個疑問，對此做了性能測試，並基於此，分析了源碼，找尋原因。

性能測試

在JDK中ThreadLocal主要用於多線程環境獲取當前線程維護變量數據，用戶不需要關心多線程的問題，因此用戶在多線程的環境下也可以方便的使用它。以下測試內容基於考慮兩種情況

在多線程情況下，訪問同一個ThreadLocal或FastThreadLocal變量。
在單線程下，訪問多個ThreadLocal或FastThreadLocal變量。

下面將對此兩種情況分別做測試對比性能

1. 多線程訪問同一個ThreadLocal或FastThreadLocal

代碼如下：

/**
 * 多線程訪問同一ThreadLocal實例
 * 
 */
public static void testThreadLocalWithMultipleThreads() {
    ThreadLocal<String> threadLocal = new ThreadLocal<>();
    long start = System.currentTimeMillis();
    Thread[] threads = new Thread[count];
    for (int i = 0; i < count; i++) {
        threads[i] = new Thread(new Runnable() {
            @Override
            public void run() {
                threadLocal.set(Thread.currentThread().getName());
                for (int j = 0; j < count; j++) {
                    threadLocal.get();
                }
            }
        }, "Thread" + i);
        threads[i].start();
    }
    long end = System.currentTimeMillis();
    System.out.println("testThreadLocalWithMultipleThreads get:" + (end - start));
}

/**
 * 多線程訪問同一FastThreadLocal實例
 */
public static void testFastThreadLocalWithMultipleThreads() {
    FastThreadLocal<String> threadLocal = new FastThreadLocal<>();
    long start = System.currentTimeMillis();
    for (int i = 0; i < count; i++) {
        new FastThreadLocalThread(new Runnable() {
            @Override
            public void run() {
                threadLocal.set(Thread.currentThread().getName());
                for (int j = 0; j < count; j++) {
                    threadLocal.get();
                }
            }
        }, "Thread" + i).start();
    }
    long end = System.currentTimeMillis();
    System.out.println("testFastThreadLocalWithMultipleThreads get:" + (end - start));
}

輸出結果：

testThreadLocalWithMultipleThreads get:13800
testFastThreadLocalWithMultipleThreads get:11335

在此場景下，使用ThreadLocal和使用FastThreadLocal相差不大

2. 單線程訪問多ThreadLocal或FastThreadLocal實例

測試代碼如下：

/**
 * 單線程訪問多個ThreadLocal
 */
public static void testThreadLocalWithMultipleThreadLocal() {
    ThreadLocal<String> threadLocal[] = new ThreadLocal[count];
    for (int i = 0; i < count; i++) {
        threadLocal[i] = new ThreadLocal<String>();
    }
    new Thread(new Runnable() {
        @Override
        public void run() {
            long start = System.currentTimeMillis();
            for (int i = 0; i < count; i++) {
                threadLocal[i].set("value" + i);
            }
            long middle = System.currentTimeMillis();
            for (int i = 0; i < count; i++) {
                for (int j = 0; j < count; j++) {
                    threadLocal[i].get();
                }
            }
            long end = System.currentTimeMillis();
            System.out.println("testThreadLocalWithMultipleThreadLocal set:" + (middle - start) + ",get:" + (end - middle));
        }
    }).start();
}

/**
 * 單線程訪問多個FastThreadLocal
 */
public static void testFastThreadLocalWithMultipleFastThreadLocal() {
    FastThreadLocal<String> threadLocal[] = new FastThreadLocal[count];
    for (int i = 0; i < count; i++) {
        threadLocal[i] = new FastThreadLocal<String>();
    }
    new FastThreadLocalThread(new Runnable() {
        @Override
        public void run() {
            long start = System.currentTimeMillis();
            for (int i = 0; i < count; i++) {
                threadLocal[i].set("value" + i);
            }
            long middle = System.currentTimeMillis();
            for (int i = 0; i < count; i++) {
                for (int j = 0; j < count; j++) {
                    threadLocal[i].get();
                }
            }
            long end = System.currentTimeMillis();
            System.out.println("testFastThreadLocalWithMultipleFastThreadLocal set:" + (middle - start) + ",get:" + (end - middle));
        }
    }).start();
}

輸出結果：

testThreadLocalWithMultipleThreadLocal set:68,get:21492
testFastThreadLocalWithMultipleFastThreadLocal set:61,get:8

在此場景下，使用FastThreadLocal的性能遠高於使用ThreadLocal

原理分析

ThreadLocal機制

在使用ThreadLocal時，主要使用它的get和set方法，所以我們從這兩個方法開始入手分析。

public void set(T value) {
    Thread t = Thread.currentThread();
    // 根據當前線程獲取ThreadLocalMap, ThreadLocalMap是Thread的一個屬性實例
    ThreadLocalMap map = getMap(t);
    if (map != null)
        map.set(this, value);
    else
        createMap(t, value);
}

從此方法可以看出，ThreadLocal的set方法，其實調用的是ThreadLocalMap的set方法，具體實現需要看ThreadLocalMap的實現。現在看下ThreadLocalMap是如何實現set方法的

private void set(ThreadLocal<?> key, Object value) {
    // ThreadLocalMap內部通過一個Entry數組存儲
    Entry[] tab = table;
    int len = tab.length;
    // 通過ThreadLocal.threadLocalHashCode作爲數組下標
    int i = key.threadLocalHashCode & (len-1);
    
    // 此處獲取當前i下標中是否存有元素，若有則判斷當前ThreadLocal實例是否爲鍵，若是則直接賦值到當前i下標下，若不是，則繼續找下一個下標作爲i，直到e爲空，即i達到size的值
    // 此處size表示當前實際存儲的下標，而數組長度len，爲數組的長度，當前存儲的數量小於len,一般在size達到len的一半時，將會對table進行擴容，擴容大小爲元len的2倍；具體可查看rehash方法
    for (Entry e = tab[i];
            e != null;
            e = tab[i = nextIndex(i, len)]) {
        ThreadLocal<?> k = e.get();

        if (k == key) {
            e.value = value;
            return;
        }

        if (k == null) {
            replaceStaleEntry(key, value, i);
            return;
        }
    }
    // 此時表示數組中已用的槽位沒有空位，也沒有當前ThreadLocal實例的槽位，擴展一個槽位
    tab[i] = new Entry(key, value);
    int sz = ++size;
    if (!cleanSomeSlots(i, sz) && sz >= threshold)
        rehash();
}

一般來說，ThreadLocalMap使用了一個數組來存儲數據，類似於HashMap，每一個ThreadLoca在初始化時，分配一個threadLocalHashCode，通過Hash計算後，執行分配數組位置，此時就會產生hash衝突問題。在HashMap中，hash衝突採用數組+鏈表的方式處理，然而在ThreadLocalMap中，採用了向後遞推的方式，類似於一致性Hash算法的方式。

下面再來看下get方法

public T get() {
    Thread t = Thread.currentThread();
    ThreadLocalMap map = getMap(t);
    if (map != null) {
        ThreadLocalMap.Entry e = map.getEntry(this);
        if (e != null) {
            @SuppressWarnings("unchecked")
            T result = (T)e.value;
            return result;
        }
    }
    return setInitialValue();
}

get方法同樣是先從當前線程中獲取ThreadLocalMap實例，然後通過ThreadLocalMap.getEntry獲取當前值

private Entry getEntry(ThreadLocal<?> key) {
    int i = key.threadLocalHashCode & (table.length - 1);
    Entry e = table[i];
    if (e != null && e.get() == key)
        return e;
    else
        return getEntryAfterMiss(key, i, e);
}

這裏若沒有產生hash衝突，threaLocalhashCode經過一次計算後，將會直接獲取到所需要的值。但是若此i存儲hash衝突問題，就需要調用getEntryAfterMiss方法，查找數組中其他元素，直到找到爲止

private Entry getEntryAfterMiss(ThreadLocal<?> key, int i, Entry e) {
    Entry[] tab = table;
    int len = tab.length;

    while (e != null) {
        ThreadLocal<?> k = e.get();
        if (k == key)
            return e;
        if (k == null)
            expungeStaleEntry(i);
        else
            i = nextIndex(i, len);
        e = tab[i];
    }
    return null;
}

FastThreadLocal機制

下面同樣的方式，分析FastThreadLocal的get和set方法

public final V get() {
    return get(InternalThreadLocalMap.get());
}

@SuppressWarnings("unchecked")
public final V get(InternalThreadLocalMap threadLocalMap) {
    Object v = threadLocalMap.indexedVariable(index);
    if (v != InternalThreadLocalMap.UNSET) {
        return (V) v;
    }

    return initialize(threadLocalMap);
}

從無參的get方法，獲取一個變量InternalThreadLocalMap，這個變量存儲在FastThreadLocalThread中

public static InternalThreadLocalMap get() {
    Thread thread = Thread.currentThread();
    if (thread instanceof FastThreadLocalThread) {
        return fastGet((FastThreadLocalThread) thread);
    } else {
        return slowGet();
    }
}

private static InternalThreadLocalMap fastGet(FastThreadLocalThread thread) {
    InternalThreadLocalMap threadLocalMap = thread.threadLocalMap();
    if (threadLocalMap == null) {
        thread.setThreadLocalMap(threadLocalMap = new InternalThreadLocalMap());
    }
    return threadLocalMap;
}

從上述中看出，FastThreadLocalThread中維持了一個變量InternalThreadLocalMap，這個map類似於Thread中的ThreadLocalMap，結合get(InternalThreadLocalMap threadLocalMap)方法，可以看出FastThreadLocal的數據存儲在InternalThreadLocalMap中，查看下面看下InternalThreadLocalMap中indexedVariable方法是如何實現的

    public Object indexedVariable(int index) {
        Object[] lookup = indexedVariables;
        return index < lookup.length? lookup[index] : UNSET;
    }

此方法中可以看出InternalThreadLocaMap中也維持了一個數組，用於保存數據，區別在於直接根據傳入的index來獲取數據。那麼這個index是如何能唯一確定一個線程的變量呢。先看下這個index的定義

    private final int index;

    public FastThreadLocal() {
        index = InternalThreadLocalMap.nextVariableIndex();
    }

可以看出，這個index在初始化時，就賦值了。

    public static int nextVariableIndex() {
        int index = nextIndex.getAndIncrement();
        if (index < 0) {
            nextIndex.decrementAndGet();
            throw new IllegalStateException("too many thread-local indexed variables");
        }
        return index;
    }

賦值方法nextVariableIndex，作爲一個靜態方法獲取遞增的nextIndex，即每創建的一個FastThreadLocal變量，將會生成一個index，在看下nextIndex是如何爲了保證線程安全的

static final AtomicInteger nextIndex = new AtomicInteger();

從這個定義可以看出，產生index時，使用AtomicInteger.getAndIncrement()原子操作，通過CAS保證了線程的安全性。

總結

ThreadLocal和FastThreadLocal在多線程訪問同一變量的情況下，性能相差不多
在單線程訪問多個變量時，性能相差較大
原因是由於在使用ThreadLocal時，若只有一個變量，不會頻繁產生hash衝突，相對於FastThreadLocal只是多進行了一次hash算法，而此hash算法又很簡單，所以不會對性能產生很大的影響
對於單線程訪問多個ThreadLocal時，由於會產生大量的Hash衝突，不管是在get或set時，都將會有很多的衝突，這樣極大的影響了ThreadLocalMap的性能
FastThreadLocal很好的解決了這個問題，通過確定每個線程的唯一索引值，直接定位到InternalThreadLocal中的槽位，而產生唯一索引的方法，也做了線程安全的處理。

Netty中FastThreadLocal爲什麼比ThreadLocal快

背景

性能測試

原理分析

總結

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

一個docker容器暴露多個端口

leetcode 60 排列序列

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

一張圖搞定Eureka Client啓動、註冊、心跳流程

git 清空所有歷史記錄

通過PS1設置命令行提示符

supervisor守護進程

調用鏈監控對比

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結