1背景
之前運維的同事時不時的提起儘快爲我們的金融網關係統寫個熔斷模塊,不然心理總覺得不踏實,擔心哪天某個業務系統掛了,直接把網關給拖垮。趁着現在金融行業不景氣,股民還是韭菜狀態沒心思炒股,系統流量不大,時間也比較充裕的大背景下就先把這事做了。
2思路
現在常見的一些熔斷框架,包括使用手冊、源碼大概都看了下,思路都是大同小異,差不多都是圍繞3種狀態來考慮的。
這3個狀態就決定了當前斷路器的行爲:
- 關閉狀態:監控系統在當前時間窗口內的異常指標是否在指定的合理範圍內;一旦超標就打開斷路器;
- 打開狀態:拒絕所有請求直到打開狀態超時,這時斷路器就切換成半打開狀態;
- 半開狀態:試探性的允許少量請求通過,並監控請求結果是否超出異常指標,如果不超出則認爲系統恢復正常並關閉斷路器,否則重新回到打開狀態;
本來是想直接藉助現有的熔斷框架來開發,但是這些框架功能過於繁多,並通過加鎖進行多線程併發處理,用於業務系統不錯,但用在網關上還是重了些,只好自己設計了。
首先關閉狀態。這個狀態下需要一個窗口來採集樣本數據,爲指標計算和判斷系統狀態是否正常提供依據。該窗口定義,參考resilience4j提供了2種窗口類型,可以基於時間段,也可以基於基數。因爲斷路器本質上是根據當前異常指標來判斷系統是否處於一個正常的工作狀態,這點對於指標的時效性要求是比較高的,基於基數的窗口類型會因爲指標時效太差導致無法準確判斷當前的系統狀態,所以這兒我只實現基於時間段的窗口。接下來還要定義一個窗口最小樣本數,這個很好理解,如果樣本數太少會導致系統狀態判斷不準,可能會導致誤判。異常率的計算只要記錄失敗響應個數和響應總數就行了。
然後是打開狀態。這時所有的請求會被拒絕通過,但是之前的響應結果還是會被繼續統計。當然進入打開狀態時斷路器會定義一個打開超時時間,超時後會自動進入半開狀態。
最後是半開狀態。在這個狀態下,我發現不同的斷路器做法都不一樣。有的直接進入關閉狀態,有的是放一個請求試探一下,還有的是允許一定數量的請求通過並計算異常率來判斷是打開還是關閉斷路器。我的考慮是,判斷系統是否恢復正常工作狀態是還是需要根據最近一段時間內的樣本進行計算判斷的,但是要對樣本數進行控制,因爲此時系統可能還沒恢復或正在進行預熱,大量的請求進來會對系統造成較大的壓力,所以這邊我加了個流量控制。總之在這個半開時間窗口內,只要異常率低於打開斷路器的異常率閾值就關閉斷路器,否則重新打開。
3設計
因爲要對窗口時間內的數據進行統計,可以通過一個循環數組來記錄每個時間單位內的統計數據,然後用一個遊標字段指向當前時間的數組元素,每次進行樣本計算時就可以根據遊標找到當前時間對於的數據元素進行處理。這裏我使用了一個定時調度線程池每過一個單位時間就滑動遊標,並更新整個窗口的指標數據。這樣做其實就是滑動窗口的動作交給定時器線程完成,而不是由業務線程完成,這樣做的好處就是減輕了業務線程的壓力,也爲無鎖方式實現斷路器做好了鋪墊。
另外,爲了提高性能,各項統計數據全部放在一個AtomicLong型變量中,這樣也是爲了方便多線程場景下的高效處理(這兒其實也可以用AtomicReference,但從內存佔用,以及GC壓力的角度考慮,顯然AtomicLong更加輕量)。用Long型變量,還要考慮溢出的場景,不過20位的bit長度對於大多數場景已經夠用了。請求總數記錄的是半開窗口內的請求數據,用於半開狀態下判斷流量超限用的。
這兒還得考慮下Cache Line僞共享的問題,因爲大部分操作是通過遊標(cursor)來找到當前時間單位的統計值的,這個變量的讀取操作非常頻繁,如果cursor鄰近內存塊變化而導致cursor的cpu cache失效就會對性能造成影響,所以在cursor變量附近做了些字節填充,保證cursor的高效讀取。查了下,我們機器的cache line都是64Byte的,所以這兒暫時只做了64Byte的填充。
這裏的設計可以用下圖來展示:
定時線程功能如下:
- 如果是關閉狀態,檢查當前窗口統計值是否超過異常率閾值,如果超過則更新斷路器狀態,將其打開;
- 如果當前是打開,則更新倒計時,如果已經超時則更新斷路器狀態爲半開狀態;
- 如果當前是半開狀態,則更新倒計時,如果已經超時則根據異常率更新斷路器狀態,如果當前沒有樣本數據,則繼續保持半開狀態;
- 更新窗口統計值,將其減去遊標下一單位的統計值(WindowStatistic-CircularBuffer(cursor+1));
- 將遊標移至下一單位;
而業務線程就是上報請求和響應事件,並根據斷路器的反饋執行下一步動作,對於請求事件處理,邏輯如下:
- 從窗口統計值中獲取斷路器狀態,如果關閉則接受請求;如果是打開則直接拒絕;如果是半開,先要判斷是否流量超限,超限就拒絕,不超限就接受;
- 如果斷路器接受這個請求,則更新窗口統計值和當前單位時間的統計值;
對於響應事件,只要更新窗口統計值和當前單位時間的統計值即可;
從這2類線程的功能可以分析出,可以更新當前時間單位統計值和窗口彙總的統計值時會有併發衝突,因爲這2個統計值都是用AtomicLong和AtomicLongArray來存放的,可以通過CAS的方式進行高效更新。
4Demo代碼:
CircuitBreaker.java
package foo;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;
import java.util.concurrent.atomic.AtomicLongArray;
class Log {
public static void print(String str) {
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
String date = format.format(new Date());
synchronized (Log.class) {
System.out.println("###" + Thread.currentThread().getId() + " " + date + ": " + str);
}
}
}
class StatisticHelper {
private final static long TOTAL_REQUEST_FIELD_BITS = 20;
private final static long TOTAL_RESPONSE_FIELD_BITS = 21;
private final static long FAILURE_RESPONSE_FIELD_BITS = 21;
private final static long STATUS_FIELD_BITS = 2;
private final static long TOTAL_REQUEST_FIELD_MASK = (1L << TOTAL_REQUEST_FIELD_BITS) - 1;
private final static long TOTAL_RESPONSE_FIELD_MASK = (1L << TOTAL_RESPONSE_FIELD_BITS) - 1;
private final static long FAILURE_RESPONSE_FIELD_MASK = (1L << FAILURE_RESPONSE_FIELD_BITS) - 1;
private final static long TOTAL_REQUEST_FIELD_UNMASK = ~TOTAL_REQUEST_FIELD_MASK;
private final static long TOTAL_RESPONSE_FIELD_UNMASK = ~(TOTAL_RESPONSE_FIELD_MASK << TOTAL_REQUEST_FIELD_BITS);
private final static long FAILURE_RESPONSE_FIELD_UNMASK = ~(FAILURE_RESPONSE_FIELD_MASK << (TOTAL_REQUEST_FIELD_BITS + TOTAL_RESPONSE_FIELD_BITS));
static long getTotalRequest(long statistic) {
return TOTAL_REQUEST_FIELD_MASK & statistic;
}
static long setTotalRequest(long statistic, long totalRequest) {
return (TOTAL_REQUEST_FIELD_UNMASK & statistic) | (TOTAL_REQUEST_FIELD_MASK & totalRequest);
}
static long getTotalResponse(long statistic) {
return TOTAL_RESPONSE_FIELD_MASK & (statistic >>> TOTAL_REQUEST_FIELD_BITS);
}
static long setTotalResponse(long statistic, long totalResponse) {
return (TOTAL_RESPONSE_FIELD_UNMASK & statistic) | ((totalResponse & TOTAL_RESPONSE_FIELD_MASK) << TOTAL_REQUEST_FIELD_BITS);
}
static long getFailureResponse(long statistic) {
return FAILURE_RESPONSE_FIELD_MASK & (statistic >>> (TOTAL_REQUEST_FIELD_BITS + TOTAL_RESPONSE_FIELD_BITS));
}
static long setFailureResponse(long statistic, long failureResponse) {
return (FAILURE_RESPONSE_FIELD_UNMASK & statistic) | ((failureResponse & FAILURE_RESPONSE_FIELD_MASK) << (TOTAL_REQUEST_FIELD_BITS + TOTAL_RESPONSE_FIELD_BITS));
}
static long getStatus(long statistic) {
return statistic >>> (Long.SIZE - STATUS_FIELD_BITS);
}
static long setStatus(long statistic, long status) {
return ((statistic << STATUS_FIELD_BITS) >>> STATUS_FIELD_BITS) | (status << (Long.SIZE - STATUS_FIELD_BITS));
}
static void casUpdateStatus(AtomicLong statistic, long status) {
boolean isOk;
long value;
do {
value = statistic.get();
long actureValue = setStatus(value, status);
isOk = statistic.compareAndSet(value, actureValue);
}
while (!isOk);
}
static String toString(boolean showStatus, long statistic) {
String statusStr = "";
if (showStatus) {
int status = (int) getStatus(statistic);
switch (status) {
case (int) CircuitBreaker.STATUS_CLOSE:
statusStr = "CLOSE";
break;
case (int) CircuitBreaker.STATUS_OPEN:
statusStr = "OPEN";
break;
case (int) CircuitBreaker.STATUS_HALF_OPEN:
statusStr = "HAOP";
break;
}
}
long failureResponse = getFailureResponse(statistic);
long totalResponse = getTotalResponse(statistic);
long totalRequest = getTotalRequest(statistic);
StringBuilder sb = new StringBuilder();
if (showStatus) sb.append(statusStr).append("-");
sb.append(failureResponse).append("-");
sb.append(totalResponse).append("-");
sb.append(totalRequest);
return sb.toString();
}
static String toString(AtomicLongArray slidingWindow, int cursor) {
StringBuilder sb = new StringBuilder();
int len = slidingWindow.length();
String prefix = "";
for (int i = 0; i < len; i++) {
long value = slidingWindow.get(cursor >= i ? (cursor - i) : (cursor - i + len));
sb.append(prefix).append(i).append(")").append(toString(false, value));
prefix = "\n";
}
return sb.toString();
}
}
class SlidingWindowTask implements Runnable {
private CircuitBreaker circuitBreaker;
private int openDuration = 0;
private int halfOpenDuration = 0;
protected void slide() {
int cursor = circuitBreaker.cursor;
int windowSize = circuitBreaker.slidingWindow.length();
int nextCursor = (cursor + 1) % windowSize;
long nextValue = circuitBreaker.slidingWindow.get(nextCursor);
long nextTotalResponse = StatisticHelper.getTotalResponse(nextValue);
long nextFailureResponse = StatisticHelper.getFailureResponse(nextValue);
int tailHalfOpenCursor = cursor + 1 - circuitBreaker.halfOpenDuration;
if (tailHalfOpenCursor < 0) tailHalfOpenCursor = tailHalfOpenCursor + windowSize;
long tailHalfOpenValue = circuitBreaker.slidingWindow.get(tailHalfOpenCursor);
long nextTotalRequest = StatisticHelper.getTotalRequest(tailHalfOpenValue);
boolean isOk;
do {
long value = circuitBreaker.statistic.get();
long totalRequest = StatisticHelper.getTotalRequest(value);
long totalResponse = StatisticHelper.getTotalResponse(value);
long failureResponse = StatisticHelper.getFailureResponse(value);
long actureValue = StatisticHelper.setTotalRequest(value, totalRequest - nextTotalRequest);
actureValue = StatisticHelper.setTotalResponse(actureValue, totalResponse - nextTotalResponse);
actureValue = StatisticHelper.setFailureResponse(actureValue, failureResponse - nextFailureResponse);
isOk = circuitBreaker.statistic.compareAndSet(value, actureValue);
}
while (!isOk);
/*將下一個窗口統計數據清零,並將遊標指向下一個,表示窗口的滑動*/
circuitBreaker.slidingWindow.set(nextCursor, 0);
circuitBreaker.cursor = nextCursor;
}
public SlidingWindowTask(final CircuitBreaker circuitBreaker) {
this.circuitBreaker = circuitBreaker;
}
@Override
public void run() {
long value = circuitBreaker.statistic.get();
long totalResponse = StatisticHelper.getTotalResponse(value);
long failureResponse = StatisticHelper.getFailureResponse(value);
long status = StatisticHelper.getStatus(value);
/*當前關閉狀態:檢查是否超過異常閾值, 如果超過則打開熔斷器*/
if (status == CircuitBreaker.STATUS_CLOSE) {
long failureRate = failureResponse * 100 / totalResponse;
if (totalResponse >= circuitBreaker.minNumber && failureRate > circuitBreaker.failureRateThreshold) {
Log.print("(CLOSE) exceed failure rate " + failureRate + "/" + circuitBreaker.failureRateThreshold);
StatisticHelper.casUpdateStatus(circuitBreaker.statistic, CircuitBreaker.STATUS_OPEN);
openDuration = circuitBreaker.openDuration;
}
}
/*當前打開狀態: 檢查是否已經打開超時,如果超時則進入半開狀態*/
else if (status == CircuitBreaker.STATUS_OPEN) {
if (--openDuration == 0) {
Log.print("(OPEN) timeout 0/" + circuitBreaker.openDuration);
StatisticHelper.casUpdateStatus(circuitBreaker.statistic, CircuitBreaker.STATUS_HALF_OPEN);
halfOpenDuration = circuitBreaker.halfOpenDuration;
}
}
/*當前半開狀態: 檢查是否超過異常閾值, 如果超過則重新打開斷路器, 否則關閉斷路器
* 如果半開階段沒有請求響應,則無法判斷是否超過異常閾值,所以不做任何操作繼續保持半開狀態。
* */
else {
if (--halfOpenDuration <= 0) {
long totalResponseSum = 0;
long failureResponseSum = 0;
int cursor = circuitBreaker.cursor;
int slidingWindowSize = circuitBreaker.slidingWindow.length();
for (int i = 0; i < circuitBreaker.halfOpenDuration; i++) {
long bucketValue = circuitBreaker.slidingWindow.get(cursor >= i ? (cursor - i) : (cursor - i + slidingWindowSize));
totalResponseSum += StatisticHelper.getTotalResponse(bucketValue);
failureResponseSum += StatisticHelper.getFailureResponse(bucketValue);
}
/*只有當半開階段內響應數大於0才能檢查是否超過異常閾值*/
if (totalResponseSum > 0) {
long failureRate = failureResponseSum * 100 / totalResponseSum;
/*超過異常閾值, 打開斷路器,並設置打開持續時長*/
if (failureRate >= circuitBreaker.failureRateThreshold) {
Log.print("(HAOP) exceed failure rate " + failureRate + "/" + circuitBreaker.failureRateThreshold);
StatisticHelper.casUpdateStatus(circuitBreaker.statistic, CircuitBreaker.STATUS_OPEN);
openDuration = circuitBreaker.openDuration;
}
/*小於異常閾值, 關閉斷路器*/
else {
Log.print("(HAOP) beyond failure rate " + failureRate + "/" + circuitBreaker.failureRateThreshold);
StatisticHelper.casUpdateStatus(circuitBreaker.statistic, CircuitBreaker.STATUS_CLOSE);
}
}
}
}
slide();
Log.print(StatisticHelper.toString(true, circuitBreaker.statistic.get()) + "\n" + StatisticHelper.toString(circuitBreaker.slidingWindow, circuitBreaker.cursor));
}
}
class CircuitBreakerScheduler {
private static int count = 0;
private ScheduledExecutorService scheduledExecutorService;
private int bucketDuration;
public CircuitBreakerScheduler(int poolSize, int bucketDuration) {
scheduledExecutorService = Executors.newScheduledThreadPool(poolSize, (runnable) -> {
Thread t = new Thread(runnable, "scheduler-" + count++);
t.setDaemon(true);
return t;
});
this.bucketDuration = bucketDuration;
}
public void registry(final CircuitBreaker circuitBreaker) {
scheduledExecutorService.scheduleAtFixedRate(new SlidingWindowTask(circuitBreaker), bucketDuration, bucketDuration, TimeUnit.SECONDS);
}
}
public class CircuitBreaker {
/*斷路器狀態常量*/
final static long STATUS_CLOSE = 0b00;
final static long STATUS_OPEN = 0b10;
final static long STATUS_HALF_OPEN = 0b11;
/*請求結果*/
final static int SUCCESS = 0; /*請求通過*/
final static int FAILURE_CIRCUIT_BREAKER_OPENED = 1; /*斷路器打開,請求不通過*/
final static int FAILURE_CIRCUIT_BREAKER_HALF_OPENED = 2; /*斷路器半開,請求被限流不通過*/
/*配置數據*/
int slidingWindowSize; /*滑動窗口大小*/
int minNumber; /*最少樣本數,只有窗口樣本數大於該值,才檢查是否打開斷路器*/
int failureRateThreshold; /*異常率閾值(%)*/
int openDuration; /*斷路器打開時長(秒)*/
int halfOpenDuration; /*斷路器半開時長(秒)*/
int haflOpenMaxNumber; /*斷路器半開時長內最大請求數*/
/*核心數據*/
AtomicLong statistic; /*窗口彙總統計數據*/
AtomicLongArray slidingWindow; /*滑動窗口,記錄每個時間單位內的統計數據*/
private long p1, p2, p3, p4, p5, p6, p7; /*64Byte cache line 填充*/
private int p0;
volatile int cursor; /*當前時間對應的窗口下標*/
private long p8, p9, p10, p11, p12, p13, p14; /*64Byte cache line 填充*/
public CircuitBreaker(int slidingWindowSize, int minNumber, int failureRateThreshold, int openDuration, int halfOpenDuration, int haflOpenMaxNumber) {
this.slidingWindowSize = slidingWindowSize;
this.minNumber = minNumber;
this.failureRateThreshold = failureRateThreshold;
this.openDuration = openDuration;
this.haflOpenMaxNumber = haflOpenMaxNumber;
/*半開時長不能大於窗口大小*/
this.halfOpenDuration = halfOpenDuration < slidingWindowSize ? halfOpenDuration : slidingWindowSize;
this.statistic = new AtomicLong();
this.slidingWindow = new AtomicLongArray(slidingWindowSize);
this.cursor = 0;
}
public int onRequest() {
boolean isOk;
/*更新窗口彙總的統計數據*/
do {
long value = statistic.get();
long status = StatisticHelper.getStatus(value);
/*關閉: 更新彙總統計數據*/
if (status == STATUS_CLOSE) {
long actureTotalRequest = StatisticHelper.getTotalRequest(value) + 1;
long actureValue = StatisticHelper.setTotalRequest(value, actureTotalRequest);
isOk = statistic.compareAndSet(value, actureValue);
}
/*打開: 拒絕請求*/
else if (status == STATUS_OPEN) {
return FAILURE_CIRCUIT_BREAKER_OPENED;
}
/*半開: 檢查流量超限,未超限則更新彙總統計數據*/
else {
long actureTotalRequest = StatisticHelper.getTotalRequest(value) + 1;
/*流量超限*/
if (actureTotalRequest > haflOpenMaxNumber) return FAILURE_CIRCUIT_BREAKER_HALF_OPENED;
else {
long actureValue = StatisticHelper.setTotalRequest(value, actureTotalRequest);
isOk = statistic.compareAndSet(value, actureValue);
}
}
}
while (!isOk);
/*更新滑動窗口數據*/
do {
long value = slidingWindow.get(cursor);
long totalRequest = StatisticHelper.getTotalRequest(value);
long actureValue = StatisticHelper.setTotalRequest(value, totalRequest + 1);
isOk = slidingWindow.compareAndSet(cursor, value, actureValue);
}
while (!isOk);
return SUCCESS;
}
public void onResponse(boolean isSuccess) {
Log.print("RESPONSE: " + isSuccess);
boolean isOk;
/*更新窗口彙總的統計數據*/
do {
long value = statistic.get();
long actureTotalResponse = StatisticHelper.getTotalResponse(value);
long actureValue = StatisticHelper.setTotalResponse(value, actureTotalResponse + 1);
if (!isSuccess) {
long actureFailureResponse = StatisticHelper.getFailureResponse(value);
actureValue = StatisticHelper.setFailureResponse(actureValue, actureFailureResponse + 1);
}
isOk = statistic.compareAndSet(value, actureValue);
}
while (!isOk);
/*更新滑動窗口數據*/
do {
long value = slidingWindow.get(cursor);
long totalResponse = StatisticHelper.getTotalResponse(value);
long actureValue = StatisticHelper.setTotalResponse(value, totalResponse + 1);
if (!isSuccess) {
long failureResponse = StatisticHelper.getFailureResponse(value);
actureValue = StatisticHelper.setFailureResponse(actureValue, failureResponse + 1);
}
isOk = slidingWindow.compareAndSet(cursor, value, actureValue);
}
while (!isOk);
}
}
App.java
package foo;
public class App {
public static void main(String[] args) throws Exception {
CircuitBreakerScheduler scheduler = new CircuitBreakerScheduler(1, 1);
CircuitBreaker circuitBreaker = new CircuitBreaker(3, 1, 50, 3, 3, 1);
scheduler.registry(circuitBreaker);
Thread t1 = new Thread(() -> {
try {
for (int i = 0; i < 10; i++) {
int result = circuitBreaker.onRequest();
Log.print("REQUEST: " + result);
Thread.sleep(1000);
}
} catch (Exception ignore) {
}
});
Thread t2 = new Thread(() -> {
try {
for (int i = 0; i < 10; i++) {
circuitBreaker.onResponse(i%2==0?true:false);
Thread.sleep(1000);
}
} catch (Exception ignore) {
}
});
t1.start();
t2.start();
t1.join();
t2.join();
}
}
最後,希望該模塊下個月上線順利。٩(●̮̃•)۶