基於JMH的Benchmark解決方案

原始Benchmark做法

在設計新框架的時候，往往需要評估待接入的組件的性能，這個時候我們可能會利用UnitTest來進行，寫一個方法，然後在循環裏面跑，利用System.CurrentTimeMillis()來評估組件性能。然而這種機制，只是跑在了主線程中，無法將組件的性能全部測算出來。當單線程測算的性能已經到達極限的瑟吉歐雞皮，無論怎麼增加循環次數，OPS都不會有顯著的提升。

上面的方案不怎麼靠譜後，我們轉向了多線程測算。一般都是在本地開幾個線程，然後循環處理。之後再利用System.CurrentTimeMillis()的差值來評估組件性能。此種方法雖然更爲靠譜了一些，但是依然面臨着樣本循環次數小，統計難度大，統計分類不全的特點。如果想測算的更精細，怕是沒有個一時半會，得不到什麼有效結果。

很顯然你，上面的方法，是生產力低下的做法，那麼有什麼方法能夠一勞永逸呢？

JMH Benchmark做法

今天我們將會講解基於openjdk構建的jmh benchmark的做法，此種做法在github上很流行，很多開源代碼都會在readme中附帶上自己的benchmark，通俗易懂，而且讓我們對性能有大概的瞭解。究竟如何做到的呢？

首先，我們需要引入maven包：

        <!--bench mark-->
        <dependency>
            <groupId>org.openjdk.jmh</groupId>
            <artifactId>jmh-core</artifactId>
            <version>1.19</version>
            <!--<scope>test</scope>-->
        </dependency>
        <dependency>
            <groupId>org.openjdk.jmh</groupId>
            <artifactId>jmh-generator-annprocess</artifactId>
            <version>1.19</version>
            <!--<scope>test</scope>-->
        </dependency>

引入jmh-core和jmh-generator-annprocess的作用是利用其做真正的benchmark操作。注意，在運行的時候，需要註釋掉scope，然後先clean，然後package，最後install，一定要進行install操作，否則會提示配置文件找不到的問題。

然後，編寫我們的benchmark代碼，這裏我以local cache爲例來做介紹：

package com.jd.limitbuy.common.cache.offheap.local;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import java.util.UUID;
import java.util.concurrent.TimeUnit;

/**
 * @author shichaoyang
 * @Description: local cache組件的benchmark
 * @date 2018-08-16 11:04
 */

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Thread)
public class localCacheBenchmark {

    private LocalCacheBuilder localCacheBuilder;

    private LocalCacheWrapper localCacheWrapper;

    @Setup
    public void init() {

        localCacheBuilder = new LocalCacheBuilder();
        localCacheBuilder.Init();

        localCacheWrapper = new LocalCacheWrapper(localCacheBuilder);

        localCacheWrapper.set("my_benchmark_key", "this is my first benchmark!!!!!", "localMap1");

        localCacheWrapper.hset("my_benchmark_key_hash", "my_hash_key", "this is my first benchmark!!!!!", "localMap1");

        localCacheWrapper.sadd("my_benchmark_key_set", "this is my hash benchmark!!!!!", "localMap1");

        localCacheWrapper.zadd("my_benchmark_key_zset", "start_val", 100, "localMap1");

        localCacheWrapper.zadd("my_benchmark_key_zset", "end_val", 101, "localMap1");
    }

    /**
     * GroupThreads 併發線程數設置爲3，可以打出接口最大的ops
     */

    @Benchmark
    @GroupThreads(4)
    public String testLocalCacheSet() {
        localCacheWrapper.set(UUID.randomUUID().toString(), "this is my first benchmark!!!!!", "localMap1");
        return "ok";
    }

    /**
     * GroupThreads 併發線程數設置爲3，可以打出接口最大的ops
     */
    @Benchmark
    @GroupThreads(4)
    public String testLocalCacheGet() {
        return localCacheWrapper.get("my_benchmark_key", "localMap1");
    }

    @Benchmark
    @GroupThreads(4)
    public String testLocalCacheHSet() {
        localCacheWrapper.hset(UUID.randomUUID().toString(), UUID.randomUUID().toString(), "this is my hash benchmark!!!!!", "localMap1");
        return "ok";
    }

    @Benchmark
    @GroupThreads(4)
    public String testLocalCacheHGet() {
        return localCacheWrapper.hget("my_benchmark_key_hash", "my_hash_key", "localMap1");
    }

    @Benchmark
    @GroupThreads(4)
    public String testLocalCacheHGetAll() {
        localCacheWrapper.hgetAll("my_benchmark_key_hash", "localMap1");
        return "ok";
    }

    @Benchmark
    @GroupThreads(4)
    public String testLocalCacheSAdd() {
        localCacheWrapper.sadd("slkfjskldfjsdklf", UUID.randomUUID().toString(), "localMap1");
        return "ok";
    }

    @Benchmark
    @GroupThreads(4)
    public String testLocalCacheSmember() {
        localCacheWrapper.smembers("my_benchmark_key_set", "localMap1");
        return "ok";
    }

    @Benchmark
    @GroupThreads(4)
    public String testLocalCacheZAdd() {
        localCacheWrapper.zadd(UUID.randomUUID().toString(), UUID.randomUUID().toString(), 100, "localMap1");
        return "ok";
    }

    @Benchmark
    @GroupThreads(4)
    public String testLocalCacheZRange() {
        localCacheWrapper.zrange("my_benchmark_key_zset", "start_val", "end_val", "localMap1");
        return "ok";
    }

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(localCacheBenchmark.class.getSimpleName())
                .forks(1)
                .build();
        new Runner(opt).run();
    }
}

BenchmarkMode(Mode.Throughput)設置，主要是爲了測試方法的ops性能。

OutputTimeUnit(TimeUnit.SECONDS) 設置，主要是以秒爲單位進行輸出，其實就是ops(operation per second)。

Setup設置，主要是爲了對進行benchmark的類進行初始化操作。注意，要進行benchmark測試的類，必須使用帶參構造注入方式來進行，不能使用@Resource或者@Autowired等方式來進行注入，否則運行起來的時候會報NullPointer Exception，因爲jmh不支持這種方式。所謂的帶參構造注入，就是形如下面的方式：

public class OffheapCacheWrapper implements OffheapCacheStrategy {

    /**
     * 構造注入
     * @param offheapCacheBuilder
     */
    public OffheapCacheWrapper(OffheapCacheBuilder offheapCacheBuilder) {
        this.offheapCacheBuilder = offheapCacheBuilder;
    }

    /**
     * 緩存構建器
     */
    private OffheapCacheBuilder offheapCacheBuilder;
}

然後在使用的時候，就可以按照benchmark代碼中的方式進行實例初始化了。

Benchmark設置，主要是爲了表明，此方法要進行測算。

GroupThreads設置，主要是對當前方法使用的併發數，如果機器爲4核，那麼這個數設置爲4是最合適的。這和我們本地開啓4個多線程測試的原理是一樣的。

最後就是main方法了。每個benchmark測算類裏面都要包含一個main的入口方法。入口方法的寫法可以按照如上的寫法進行書寫即可。之後可以運行此main方法，就可以看到benchmark開始了，顯示日誌如下：

# JMH version: 1.19
# VM version: JDK 1.8.0_162, VM 25.162-b12
# VM invoker: C:\Program Files\Java\jdk1.8.0_162\jre\bin\java.exe
# VM options: -Dvisualvm.id=95286622200089 -javaagent:D:\soft\IntelliJ IDEA 2017.2.4\lib\idea_rt.jar=51200:D:\soft\IntelliJ IDEA 2017.2.4\bin -Dfile.encoding=UTF-8
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 4 threads, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.jd.limitbuy.common.cache.offheap.local.localCacheBenchmark.testLocalCacheGet

# Run progress: 0.00% complete, ETA 00:06:00
# Fork: 1 of 1
# Warmup Iteration   1: 4626030.786 ops/s
# Warmup Iteration   2: 5915177.466 ops/s
# Warmup Iteration   3: 5250390.707 ops/s
# Warmup Iteration   4: 5821984.889 ops/s
# Warmup Iteration   5: 5878264.192 ops/s
# Warmup Iteration   6: 5958235.775 ops/s
# Warmup Iteration   7: 5872995.249 ops/s
# Warmup Iteration   8: 5776545.647 ops/s
# Warmup Iteration   9: 5698557.365 ops/s
# Warmup Iteration  10: 5408015.908 ops/s
# Warmup Iteration  11: 5369501.297 ops/s
# Warmup Iteration  12: 5656644.350 ops/s
# Warmup Iteration  13: 5927929.754 ops/s
# Warmup Iteration  14: 4925956.931 ops/s
# Warmup Iteration  15: 5073723.984 ops/s
# Warmup Iteration  16: 5562728.644 ops/s
# Warmup Iteration  17: 5404073.901 ops/s
# Warmup Iteration  18: 5710289.068 ops/s
# Warmup Iteration  19: 5279941.519 ops/s
# Warmup Iteration  20: 5313558.528 ops/s
Iteration   1: 5479700.075 ops/s
Iteration   2: 5435900.429 ops/s
Iteration   3: 5644384.753 ops/s
Iteration   4: 5439492.270 ops/s
Iteration   5: 4821232.721 ops/s
Iteration   6: 5255550.541 ops/s
Iteration   7: 5328415.572 ops/s
Iteration   8: 5303100.251 ops/s
Iteration   9: 5608949.378 ops/s
Iteration  10: 5493709.321 ops/s
Iteration  11: 5656755.883 ops/s
Iteration  12: 5342198.063 ops/s
Iteration  13: 5356092.929 ops/s
Iteration  14: 5448346.884 ops/s
Iteration  15: 5594615.720 ops/s
Iteration  16: 5263648.663 ops/s
Iteration  17: 5820217.743 ops/s
Iteration  18: 3766476.832 ops/s
Iteration  19: 5430792.407 ops/s
Iteration  20: 5607185.081 ops/s


Result "com.jd.limitbuy.common.cache.offheap.local.localCacheBenchmark.testLocalCacheGet":
  5354838.276 ±(99.9%) 371083.594 ops/s [Average]
  (min, avg, max) = (3766476.832, 5354838.276, 5820217.743), stdev = 427340.418
  CI (99.9%): [4983754.682, 5725921.870] (assumes normal distribution)

上面就是testLocalCacheGet方法的完整benchmark效果，我們可以看到起了4個線程，遍歷了20次，每次都有一個ops。最後的統計部分可以看到ops的具體值和偏差部分。可以說非常詳盡。

當所有的方法都測算完畢之後，會彙總統計數據如下：

Benchmark                                   Mode  Cnt        Score        Error  Units
localCacheBenchmark.testLocalCacheGet      thrpt   20  5367451.445 ± 299325.857  ops/s
localCacheBenchmark.testLocalCacheHGet     thrpt   20  1878476.142 ±  44977.163  ops/s
localCacheBenchmark.testLocalCacheHGetAll  thrpt   20  2597442.245 ± 148661.259  ops/s
localCacheBenchmark.testLocalCacheHSet     thrpt   20    39059.991 ±  60782.779  ops/s
localCacheBenchmark.testLocalCacheSAdd     thrpt   20   231858.138 ±  80494.757  ops/s
localCacheBenchmark.testLocalCacheSet      thrpt   20   168179.683 ± 145495.126  ops/s
localCacheBenchmark.testLocalCacheSmember  thrpt   20  2650997.831 ±  97273.369  ops/s
localCacheBenchmark.testLocalCacheZAdd     thrpt   20    56061.015 ±  73744.916  ops/s
localCacheBenchmark.testLocalCacheZRange   thrpt   20  1682366.032 ±  75684.334  ops/s

這樣我們就可以評估每個方法的ops性能了。

同樣，如果想評估方法的tp50，tp99，tp999性能，只需要將BenchmarkMode改成Mode.AverageTime即可。非常方便。

注意，如果你使用idea，需要下載jmh-plugin插件支持。

基於JMH的Benchmark解決方案

你所不知道的鎖

openresty實踐指導

ChatGPT用後感

自定義classloader的固定套路

【轉】天池中間件大賽dubboMesh優化總結（qps從1000到6850）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結