Intel PAUSE指令變化影響到MySQL的性能，該如何解決？

MySQL得益於其開源屬性、成熟的商業運作、良好的社區運營以及功能的不斷迭代與完善，已經成爲互聯網關係型數據庫的標配。可以說，X86服務器、Linux作爲基礎設施，跟MySQL一起構建了互聯網數據存儲服務的基石，三者相輔相成。本文將分享一個工作中的實踐案例：因Intel PAUSE指令週期的迭代，引發了MySQL的性能瓶頸，美團MySQL DBA團隊如何基於這三者來一步步進行分析、定位和優化。希望這些思路能對大家有所啓發。

1.背景

在2017年，Intel發佈了新一代的服務器平臺Purley，並將Intel Xeon Scalable Processor（至強可擴展處理器）重新劃分爲：Platinum（鉑金）、Gold（金）、Silver（銀）、Broze（銅）等四個等級。產品定位和框架也變得更加清晰。

因美團線上海量數據交易和存儲等後端服務依賴大量高性能服務器的支撐。隨着線上部分Grantly平臺E系列服務器生命週期的臨近，以及產品本身的發展和迭代。從2019年開始，RDS（關係型數據庫服務）後端存儲（MySQL）開始大量上線Purley平臺的Skylake CPU服務器，其中包含Silver 4110等。

Silver 4110相比上一代E5-2620 V4，支持更高的內存頻率、更多的內存通道、更大的L2 Cache、更快的總線傳輸速率等。Intel官方數據顯示Silver 4110的性能比上一代E5-2620 V4提升了10%。

然而，隨着線上Skylake服務器數量的增加，以及越來越多的業務接入。美團MySQL DBA團隊發現部分MySQL實例性能與預期並不相符，有時甚至出現較大程度的下降。經過持續的性能問題分析，我們定位到Skylake服務器存在性能瓶頸：

CPU負載相對較高。
TPS等吞吐量下降。

接下來，我們將從Intel CPU、ut_delay函數、PAUSE指令三方面入手，進行剖析定位，並探索相關優化方案。

2.性能問題分析

2.1 Grantly與Purley CPU性能差異

首先，基於上述兩代平臺的CPU（Grantly和Purley），通過基準測試，橫向對比在不同OS下的性能表現。

通過基準測試數據，總結如下：

1.在oltp_write_only（只寫）的場景下Purley 4110的性能下降較爲明顯。
2.同爲Purley 4110，CentOS 7比CentOS 6 oltp_write_only（只寫）性能有提升。

我們通過二維折線圖，來展示性能之間的差異：

在上圖中，同爲Purley 4110，CentOS 7比CentOS 6性能有提升。具體提升原因，因不涉及本文重點內容，所以不在這裏詳細展開了。

New MCS-based Locking Mechanism

Red Hat Enterprise Linux 7.1 introduces a new locking mechanism, MCS locks. This new locking mechanism significantly reduces spinlock overhead in large systems, which makes spinlocks generally more efficient in Red Hat Enterprise Linux 7.1.

紅帽官網Release Notes顯示，從內核3.10.0-229開始，引入了新的加鎖機制，MCS鎖。可以降低spinlock的開銷，從而更高效地運行。普通spinlock在多CPU Core下，同時只能有一個CPU獲取變量，並自旋，而緩存一致性協議爲了保證數據的正確，會對所有CPU Cache Line狀態、數據，同步、失效等操作，導致性能下降。而MSC鎖實現每個CPU都有自己的“spinlock”本地變量，只在本地自旋。避免Cache Line同步等，從而提升了相關性能。不過，社區對於spinlock的優化爭議還是比較大的，後續又有大牛基於MSC實現了qspinlock，並在4.x的版本上patch了。具體實現可以參看：MCS locks and qspinlocks。

在大致瞭解CentOS 7性能的迭代後，接下來我們深入分析一下Skylake CPU 4110導致性能下降的緣由。

3.CPU性能跟蹤

3.1 定位熱點函數

具體定位4110性能瓶頸，分如下幾步:

首先，通過perf top來跟蹤一下Linux CPU性能開銷。
然後，通過perf record記錄函數CPU週期的消耗佔比。
最後，通過火焰圖來驗證定位熱點函數。

可以看到，其中佔CPU消耗佔比較大爲：ut_delay函數。

我們繼續深挖一下函數鏈調用關係：

# Children      Self  Command  Shared Object        Symbol                                                                                                                                                                            
# ........  ........  .......  ...................  ..................................................................................................................................................................................
#
    93.54%     0.00%  mysqld   libpthread-2.17.so   [.] start_thread
            |
            ---start_thread
               |          
               |--77.07%--pfs_spawn_thread
               |          |          
               |           --77.05%--handle_connection
               |                     |          
               |                      --76.97%--do_command
               |                                |          
               |                                |--74.30%--dispatch_command
               |                                |          |          
               |                                |          |--71.16%--mysqld_stmt_execute
               |                                |          |          |          
               |                                |          |           --70.74%--Prepared_statement::execute_loop
               |                                |          |                     |          
               |                                |          |                     |--69.53%--Prepared_statement::execute
               |                                |          |                     |          |          
               |                                |          |                     |          |--67.90%--mysql_execute_command
               |                                |          |                     |          |          |          
               |                                |          |                     |          |          |--23.43%--trans_commit_stmt
               |                                |          |                     |          |          |          |          
               |                                |          |                     |          |          |           --23.30%--ha_commit_trans
               |                                |          |                     |          |          |                     |          
               |                                |          |                     |          |          |                     |--18.86%--MYSQL_BIN_LOG::commit
               |                                |          |                     |          |          |                     |          |          
               |                                |          |                     |          |          |                     |           --18.18%--MYSQL_BIN_LOG::ordered_commit
               |                                |          |                     |          |          |                     |                     |          
               |                                |          |                     |          |          |                     |                     |--8.02%--MYSQL_BIN_LOG::change_stage
               |                                |          |                     |          |          |                     |                     |          |          
               |                                |          |                     |          |          |                     |                     |          |--2.35%--__lll_unlock_wake
               |                                |          |                     |          |          |                     |                     |          |          |          
               |                                |          |                     |          |          |                     |                     |          |           --2.24%--system_call_fastpath
               |                                |          |                     |          |          |                     |                     |          |                     |          
               |                                |          |                     |          |          |                     |                     |          |                      --2.24%--sys_futex
               |                                |          |                     |          |          |                     |                     |          |                                |          
               |                                |          |                     |          |          |                     |                     |          |                                 --2.23%--do_futex
               |                                |          |                     |          |          |                     |                     |          |                                           |          
               |                                |          |                     |          |          |                     |                     |          |                                            --2.14%--futex_wake
               |                                |          |                     |          |          |                     |                     |          |                                                      |          
               |                                |          |                     |          |          |                     |                     |          |                                                       --1.38%--wake_up_q
               |                                |          |                     |          |          |                     |                     |          |                                                                 |          
               |                                |          |                     |          |          |                     |                     |          |                                                                  --1.33%--try_to_wake_up
               ...

將上述調用通過火焰圖進行直觀展示：

現在基本可以確定，所有的函數調用，最後大部分的消耗都在ut_delay上。

3.2 ut_delay和PAUSE之間的關聯與性能影響

3.2.1 MySQL ut_delay實現

接下來，我們繼續看一下MySQL源碼中ut_delay函數的功能：

/*************************************************************//**
Runs an idle loop on CPU. The argument gives the desired delay
in microseconds on 100 MHz Pentium + Visual C++.
@return dummy value */
ulint
ut_delay(
/*=====*/
  ulint delay)  /*!< in: delay in microseconds on 100 MHz Pentium */
{
  ulint i, j;

  UT_LOW_PRIORITY_CPU();

  j = 0;

  for (i = 0; i < delay * 50; i++) {
    j += i;
    UT_RELAX_CPU();
  }

  UT_RESUME_PRIORITY_CPU();

  return(j);
}
...

#   define UT_RELAX_CPU() asm ("pause" )
#   define UT_RELAX_CPU() __asm__ __volatile__ ("pause")

可以瞭解到，MySQL自旋會調用PAUSE指令，從而提升spin-wait loop的性能。

3.2.2 PAUSE指令週期的演變

我們可以看下Intel官網，也描述了在新平臺架構PAUSE的改動：

Pause Latency in Skylake Microarchitecture

The PAUSE instruction is typically used with software threads executing on two logical processors located in the same processor core, waiting for a lock to be released. Such short wait loops tend to last between tens and a few hundreds of cycles, so performance-wise it is better to wait while occupying the CPU than yielding to the OS. When the wait loop is expected to last for thousands of cycles or more, it is preferable to yield to the operating system by calling an OS synchronization API function, such as WaitForSingleObject on Windows* OS or futex on Linux.

…

The latency of the PAUSE instruction in prior generation microarchitectures is about 10 cycles, whereas in Skylake microarchitecture it has been extended to as many as 140 cycles.

The increased latency (allowing more effective utilization of competitively-shared microarchitectural resources to the logical processor ready to make forward progress) has a small positive performance impact of 1-2% on highly threaded applications. It is expected to have negligible impact on less threaded applications if forward progress is not blocked executing a fixed number of looped PAUSE instructions. There’s also a small power benefit in 2-core and 4-core systems.

As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss.

…

上一代架構中（Grantly平臺E系列）PAUSE的週期時長爲10 cycles，新一代的Skylake架構中則爲140 cycles。
如果程序中使用固定次數的PAUSE循環來實現一段時間的延遲，以此阻塞程序執行，可能引發非預期的延遲。
由於PAUSE週期增加，對於PAUSE敏感的應用會有一定的性能損失。

衡量程序執行性能的簡化公式：

ExecutionTime(T)=InstructionCount∗TimePerCycle∗CPI

即：程序執行時間 = 程序總指令數 x 每CPU時鐘週期時間 x 每指令執行所需平均時鐘週期數。

MySQL內部自旋，就是通過固定次數的PAUSE循環實現。可知，PAUSE指令週期的增加，那麼執行自旋的時間也會增加，即程序執行的時間也會相對增加，對系統整體的吞吐量就會有影響。

顯然，Intel文檔已說明不同平臺、不同架構CPU PAUSE定義的週期是不一樣的。

下面，我們通過一個測試用例來大致驗證、對比一下新老架構CPU執行PAUSE的cycles：

 #include <stdio.h>
#define TIMES 5

static inline unsigned long long rdtsc(void)
{
    unsigned long low, high;
    asm volatile("rdtsc" : "=a" (low), "=d" (high) );
    return ((low) | (high) << 32);
}

void pause_test()
{
    int i = 0;
    for (i = 0; i < TIMES; i++) {
        asm(
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"\
                "pause\n"
                ::
                :);
    }
}

unsigned long pause_cycle()
{
    unsigned long start, finish, elapsed;
    start = rdtsc();
    pause_test();
    finish = rdtsc();
    elapsed = finish - start;
    printf("Pause的cycles約爲:%ld\n", elapsed / 100);
    return 0;
}

int main()
{
    pause_cycle();
    return 0;
}

其運行結果統計如下：

4110和5118 PAUSE週期較大，均爲100多，它們屬於Purley第一代架構：Skylake。
4210和5218 PAUSE相比前一代有提升，是因爲它們同屬Purley第二代架構：Cascadelake，該代CPU PAUSE指令有優化。

3.2.3 Intel 提升PAUSE猜想

Intel提高PAUSE指令週期的原因，推測可能是減少自旋鎖衝突的概率，以及降低功耗；但反而導致PAUSE執行時間變長，降低了整體的吞吐量。

The increased latency (allowing more effective utilization of competitively-shared microarchitectural resources to the logical processor read to make forward progress) has a small positive performance impact of 1-2% on highly threaded applications. It is expected to have negligible impact on less threaded applications if forward progress is not blocked executing a fixed number of looped PAUSE instructions.

3.3 PAUSE導致寫瓶頸分析

接下來，我們深入分析一下PAUSE指令導致MySQL寫瓶頸的原因。

首先，通過MySQL 內部統計信息，查看一下InnoDB信號量監控數據：

SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 153720
--Thread 139868617205504 has waited at row0row.cc line 1075 for 0.00 seconds the semaphore:
X-lock on RW-latch at 0x7f4298084250 created in file buf0buf.cc line 1425
a writer (thread id 139869284108032) has reserved it in mode  SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time read locked in file not yet reserved line 0
Last time write locked in file /mnt/workspace/percona-server-5.7-redhat-binary-rocks-new/label_exp/min-centos-7-x64/test/rpmbuild/BUILD/percona-server-5.7.26-29/percona-server-5.7.26-29/storage/innobase/buf/buf0flu.cc line 1216
OS WAIT ARRAY INFO: signal count 441329
RW-shared spins 0, rounds 1498677, OS waits 111991
RW-excl spins 0, rounds 717200, OS waits 9012
RW-sx spins 47596, rounds 366136, OS waits 4100
Spin rounds per wait: 1498677.00 RW-shared, 717200.00 RW-excl, 7.69 RW-sx

可見寫操作並阻塞在：storage/innobase/buf/buf0flu.cc第1216行調用上。

跟蹤一下發生等待的源碼：buf0flu.cc line 1216：

    if (flush_type == BUF_FLUSH_LIST
        && is_uncompressed
        && !rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE)) {    // 加鎖前，判斷鎖衝突
        
        if (!fsp_is_system_temporary(bpage->id.space())) {
        /* avoiding deadlock possibility involves
        doublewrite buffer, should flush it, because
        it might hold the another block->lock. */
        buf_dblwr_flush_buffered_writes(
          buf_parallel_dblwr_partition(bpage,
                flush_type));
      } else {
        buf_dblwr_sync_datafiles();
      }
      rw_lock_sx_lock_gen(rw_lock, BUF_IO_WRITE);        //  加sx鎖
    }
... 
 #define rw_lock_sx_lock_nowait(M, P)       \
  rw_lock_sx_lock_low((M), (P), __FILE__, __LINE__)
...

rw_lock_sx_lock_func(                                       // 加sx鎖函數            
/*=================*/
  rw_lock_t*  lock, /*!< in: pointer to rw-lock */
  ulint   pass, /*!< in: pass value; != 0, if the lock will
        be passed to another thread to unlock */
  const char* file_name,/*!< in: file name where lock requested */
  ulint   line) /*!< in: line where requested */

{
  ulint   i = 0;
  sync_array_t* sync_arr;
  ulint   spin_count = 0;
  uint64_t  count_os_wait = 0;
  ulint   spin_wait_count = 0;

  ut_ad(rw_lock_validate(lock));
  ut_ad(!rw_lock_own(lock, RW_LOCK_S));

lock_loop:

  if (rw_lock_sx_lock_low(lock, pass, file_name, line)) {

    if (count_os_wait > 0) {
      lock->count_os_wait +=
        static_cast<uint32_t>(count_os_wait);
      rw_lock_stats.rw_sx_os_wait_count.add(count_os_wait);
    }

    rw_lock_stats.rw_sx_spin_round_count.add(spin_count);
    rw_lock_stats.rw_sx_spin_wait_count.add(spin_wait_count);

    /* Locking succeeded */
    return;

  } else {

    ++spin_wait_count;

    /* Spin waiting for the lock_word to become free */
    os_rmb;
    while (i < srv_n_spin_wait_rounds
           && lock->lock_word <= X_LOCK_HALF_DECR) {

      if (srv_spin_wait_delay) {
        ut_delay(ut_rnd_interval(
            0, srv_spin_wait_delay));                         // 加鎖失敗，調用ut_delay
      }

      i++;
    }                             

    spin_count += i;

    if (i >= srv_n_spin_wait_rounds) {

      os_thread_yield();

    } else {

      goto lock_loop;
    }
...
ulong srv_n_spin_wait_rounds  = 30;
ulong srv_spin_wait_delay = 6;

上述源碼可知，MySQL鎖等待是通過調用ut_delay做空循環實現的。

InnoDB層有三種鎖：S（共享鎖）、X（排他鎖）和SX（共享排他鎖）。 SX與SX、X是互斥鎖。加SX不會影響讀，只會阻塞寫。所以在大量寫入操作時，會造成大量的鎖等待，即大量的PAUSE指令。

分析到這裏，我們總結一下影響吞吐量的兩個因素：

自旋的時長，在MySQL5.7以及之前版本的源碼定位爲：spin_wait_delay * 50。
Intel CPU PAUSE的指令週期。

接下來，我們就從這兩方面入手，評估優化空間以及效果。

4. 針對PAUSE指令和spin參數優化與探索

4.1 MySQL spin參數優化

4.1.1 MySQL 5.7 spin參數優化

我們可以基於現有MySQL版本、硬件等方面，來尋找優化點。

MySQL針對spin控制這塊有個參數可以調整，根據參數特點進行相關優化：

innodb_spin_wait_delay

innodb_spin_wait_delay的單位，是100MHZ的奔騰處理器處理1毫秒的時間，默認innodb_spin_wait_delay配置成6，表示最多在100MHZ的奔騰處理器上自旋6毫秒。

innodb_sync_spin_loops

當 innodb 線程獲取 mutex 資源而得不到滿足時，會最多進行 innodb_sync_spin_loops次嘗試獲取mutex資源。

其中innodb_spin_wait_delay參數對PAUSE運行時長是有影響的。針對此參數，我們進行調優測試。

同樣，針對上述參數優化，我們通過基準測試來對比性能和效果：

可以總結爲：

innodb_spin_wait_delay的調整對TPS、QPS 一定影響，其值趨於小，則MySQL性能有提升。反之，下降。
innodb_spin_wait_delay參數調整性能優化效果有限，性能提升的幅度還是無法滿足線上業務需求。

4.2 MySQL8.0 spin新特性移植

4.2.1 spin_wait_pause_multiplier移植

針對Skylake CPU，PAUSE造成的吞吐量下降，我們對MySQL 5.7 spin控制參數innodb_spin_wait_delay的調優並未取得明顯效果。

於是，我們將目光投向了MySQL 8.0的新特性：MySQL 8.0 針對PAUSE，源碼中新增了spin_wait_pause_multiplier參數，來替換之前寫死的循環次數。

4.2.2 spin_wait_pause_multiplier實現

MySQL 8.0源碼中，之前循環50次的邏輯修改成了可以調整循環次數的參數：spin_wait_pause_multiplier。

ulint ut_delay(ulint delay) {
  ulint i, j;
  /* We don't expect overflow here, as ut::spin_wait_pause_multiplier is limited
  to 100, and values of delay are not larger than @@innodb_spin_wait_delay
  which is limited by 1 000. Anyway, in case an overflow happened, the program
  would still work (as iterations is unsigned). */
  const ulint iterations = delay * ut::spin_wait_pause_multiplier;
  UT_LOW_PRIORITY_CPU();

  j = 0;

  for (i = 0; i < iterations; i++) {
    j += i;
    UT_RELAX_CPU();
  }

  UT_RESUME_PRIORITY_CPU();

  return (j);
}
...
namespace ut {
ulong spin_wait_pause_multiplier = 50;
}

4.2.3 移植spin_wait_pause_multiplier patch優化

既然MySQL 8.0參數spin_wait_pause_multiplier可以控制PAUSE執行的時長，那麼就可以減少該值，從而降低整體PAUSE影響。

瞭解MySQL 8.0相關代碼後，我們將該patch移植到線上的穩定版本：

MySQ >select version();
+------------------+
| version()        |
+------------------+
| 5.7.26-29-mt-log |
+------------------+
1 row in set (0.00 sec)

MySQL>show global variables like '%spin%';  
+-----------------------------------+-------+
| Variable_name                     | Value |
+-----------------------------------+-------+
| innodb_spin_wait_delay            | 6     |
| innodb_spin_wait_pause_multiplier | 5     |
| innodb_sync_spin_loops            | 30    |
+-----------------------------------+-------+
3 rows in set (0.00 sec)

由上述可知，Silver 4110的PAUSE cycles是E5-2620 v4的14倍左右。基於此，將innodb_spin_wait_pause_multiplier值調整爲默認值的1/14，取稍大值：5。即將該參數由原默認的50調整爲5。

最後，還是通過二維折線圖來對比該patch調優後的基準測試數據：

Silver 4110移植spin_wait_pause_multiplier patch，並調整優化後，4110（patch）性能有了較大的提升。
Silver 4110（patch）相對調優innodb_spin_wait_delay性能上更優。
Silver 4110（patch）併發線程大於64的只寫場景，性能略低於E5-2620 V4 ，其他均優。
按照真實的線上讀寫比例，4110（patch）可以將吞吐量恢復到原先的性能水平。

4.3 PAUSE指令週期優化

上述章節中，我們測出Cascadelake CPU PAUSE週期下降了。在跟Intel技術專家確認後得知：從Purley的第二代產品Cascadelake開始，Intel將PAUSE的指令週期降低到了44。（估計Intel也發現了第一代增加PAUSE週期後的性能瓶頸問題。）

我們針對第二代CPU產品繼續做基準測試，來看一下性能表現：

接着用perf diff來對比一下4110和4210在ut_delay上的開銷：

可以看到4210比4110佔比下降了8%。
由於PAUSE指令週期還是數倍於E5系列CPU，4210在高負載下，PAUSE的開銷對MySQL吞吐量還是有較大的影響。而在128併發線程以下，性能相比4110有了較大的提升。按理，可以滿足線上業務需求（該測試結果跟移植spin_wait_pause_multiplier patch性能測試數據曲線一致）。

5. 總結

最後針對本篇內容，我們可以做個簡單的總結：

Intel在新平臺CPU產品調大了PAUSE指令週期，在高併發spinlock競爭激烈場景下，可能會造成程序性能較大損耗（特別是執行固定PAUSE次數的程序）。
針對Skylake架構CPU（比如：4110等）PAUSE指令週期較長引起性能問題的優化方法如下：

將MySQL 8.0 innodb_spin_wait_pause_multiplier patch移植到線上穩定版本（或升級到MySQL 8.0），通過降低PAUSE執行時長，來提升吞吐量。
如果是OS爲CentOS 6，可以升級到CentOS 7，CentOS 7本身spinlock優化，對MySQL性能也有一定提升。
最簡單、直接的方法可以替換爲Cascadelake架構CPU。

針對Cascadelake架構CPU，由於Intel本身在PAUSE週期已經優化，性能上已經做了修復。當然也可以採用上述優化方案，讓性能提升一個臺階。

6. 作者簡介

春林，2017年加入美團，主要負責MySQL運維開發和優化工作。

招聘信息

美團DBA團隊招聘各類人才，Base北京、上海均可。我們致力於爲公司提供穩定、可靠、高效的在線存儲服務，打造業界領先的數據庫團隊。這裏有數萬各類架構的MySQL實例，每天提供萬億級的OLTP訪問請求。真正的海量、分佈式、高併發環境。歡迎感興趣的同學發送簡歷至：[email protected]（郵件標題註明：美團DBA團隊）

閱讀更多技術文章，請掃碼關注微信公衆號-美團技術團隊！

Intel PAUSE指令變化影響到MySQL的性能，該如何解決？

1.背景

2.性能問題分析

2.1 Grantly與Purley CPU性能差異

3.CPU性能跟蹤

3.1 定位熱點函數

3.2 ut_delay和PAUSE之間的關聯與性能影響

3.3 PAUSE導致寫瓶頸分析

4. 針對PAUSE指令和spin參數優化與探索

4.1 MySQL spin參數優化

4.2 MySQL8.0 spin新特性移植

4.3 PAUSE指令週期優化

5. 總結

6. 作者簡介

招聘信息

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

基於SSD的Kafka應用層緩存架構設計與實現

美團外賣特徵平臺的建設與實踐

AIOps 在美團的探索與實踐 —— 故障發現篇

美團把 Kafka 作爲應用層緩存的實踐

讓 Flutter 在鴻蒙系統上跑起來

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結