使用shuffle sharding增加容錯性

最近在看kubernetes的API Priority and Fairness，它使用shuffle sharding來爲請求選擇處理隊列，以此防止高吞吐量流擠佔低吞吐量流，進而造成請求延遲的問題。

介紹

首先看下什麼是shuffle sharding，下面內容來自aws的Workload isolation using shuffle-sharding。

首先來看如何使用一般分片方式來讓系統具備可擴展性和彈性。

假設有一個8 workers節點的水平可擴展的系統或服務，下圖紅線表示達到這些節點的請求，worker可以是服務，隊列或數據庫等。

如果沒有任何分片，則要求每個worker能夠處理所有請求。這種方式高效且具備一定的冗餘性。如果一個worker出現故障，則可以將它的任務分配到剩餘的7個worker上。此時可能需要增加一定的系統容量。但如果突然出現大量請求，如DDoS攻擊，可能會導致級聯故障。下面兩張圖展示了故障是如何升級的。

首先會影響第一臺worker，隨後會級聯到其他workers上，最終導致整個服務不可用。

爲了防止故障轉移，通常可以使用分片方式，如將workers分爲4個分片，以效率換取影響度。下面兩張圖展示瞭如何使用分片來限制DDoS攻擊。

本例中，每個分片包含2個workers，並按照資源(如域名)進行切片。此時的系統仍然具有冗餘性，但由於每個分片只有2個workers，因此可能需要增加容量來避免故障。

通過這種方式降低了故障影響範圍。這裏有4個分片，如果一個分片故障，則只會影響該分片上的服務，其他分片則不受影響。影響範圍爲25%。使用shuffle sharding可以達到更好的效果。

shuffle sharding用到了虛擬分片(shuffle shard)的概念，這裏將不會直接對workers進行分片，而是按照"用戶"進行分片，目的是儘量將用戶打散分佈到不同的worker上。

下圖展示的shuffle sharding佈局中包含8個workers和8個客戶，並給每個客戶分配了2個workers。以彩虹和玫瑰表示的客戶爲例。

這裏，我們給彩虹客戶分配了第1個和第4個worker，這兩個workers構成了該客戶的shuffle shard，其他客戶將使用不同的虛擬分片(含2個workers)，如玫瑰客戶分配了第1個和最後一個worker。

如果彩虹用戶分配的worker 1和worker 4出現了問題(如惡意請求或請求泛紅等)，則此問題只會影響本虛擬分片，但不會影響到其他shuffle shard。事實上，最多隻會有另外一個shuffle shard會受到影響(即另外一個服務都部署到了worker 1和worker 4)。如果請求方具有容錯性，則可以繼續使用剩餘分片繼續提供服務。

換句話說，當彩虹客戶所在的節點因爲出現問題或受到攻擊而無法提供服務時，不會影響到其他節點。對於客戶而言，雖然玫瑰客戶和向日葵客戶都和彩虹客戶共享了worker，但並沒有導致其服務中斷，玫瑰客戶仍然可以繼續使用workers 8提供服務，而向日葵客戶可以繼續使用worker 6提供服務。

當出現上述問題時，雖然失去了四分之一的worker節點，但使用shuffle sharding可以大大降低影響範圍。上述場景下，一共有28種兩兩worker的組合方式，即28種shuffle shards。當有上百甚至更多的客戶時，我們可以給每個客戶分配一個shuffle shards，以此可以將影響範圍縮小到1/28，效果是一般分片方式的7倍。

kubernetes中的shuffle sharding

使用shuffle sharding爲流分片隊列

kubernetes的流控功能中使用了shuffle sharding，其代碼實現如下：

func NewDealer(deckSize, handSize int) (*Dealer, error) {
	if deckSize <= 0 || handSize <= 0 {
		return nil, fmt.Errorf("deckSize %d or handSize %d is not positive", deckSize, handSize)
	}
	if handSize > deckSize {
		return nil, fmt.Errorf("handSize %d is greater than deckSize %d", handSize, deckSize)
	}
	if deckSize > 1<<26 {
		return nil, fmt.Errorf("deckSize %d is impractically large", deckSize)
	}
	if RequiredEntropyBits(deckSize, handSize) > MaxHashBits {
		return nil, fmt.Errorf("required entropy bits of deckSize %d and handSize %d is greater than %d", deckSize, handSize, MaxHashBits)
	}

	return &Dealer{
		deckSize: deckSize,
		handSize: handSize,
	}, nil
}

func (d *Dealer) Deal(hashValue uint64, pick func(int)) {
	// 15 is the largest possible value of handSize
	var remainders [15]int

  //這個for循環用於生成[0,deckSize)範圍內的隨機數。
	for i := 0; i < d.handSize; i++ {
		hashValueNext := hashValue / uint64(d.deckSize-i)
		remainders[i] = int(hashValue - uint64(d.deckSize-i)*hashValueNext)
		hashValue = hashValueNext
	}

	for i := 0; i < d.handSize; i++ {
		card := remainders[i]
		for j := i; j > 0; j-- {
			if card >= remainders[j-1] {
				card++
			}
		}
		pick(card)
	}
}

func (d *Dealer) DealIntoHand(hashValue uint64, hand []int) []int {
	h := hand[:0]
	d.Deal(hashValue, func(card int) { h = append(h, card) })
	return h
}

首先使用func NewDealer(deckSize, handSize int)初始化一個實例，以kubernetes的APF功能爲例，deckSize爲隊列數，handSize表示爲一條流分配的隊列數量
使用func (d *Dealer) DealIntoHand(hashValue uint64, hand []int)可以返回爲流選擇的隊列ID，hashValue可以看做是流的唯一標識，hand爲存放結果的數組。

hashValue的計算方式如下，fsName爲flowschemas的名稱，fDistinguisher可以是用戶名或namespace名稱：
```
func hashFlowID(fsName, fDistinguisher string) uint64 {
	hash := sha256.New()
	var sep = [1]byte{0}
	hash.Write([]byte(fsName))
	hash.Write(sep[:])
	hash.Write([]byte(fDistinguisher))
	var sum [32]byte
	hash.Sum(sum[:0])
	return binary.LittleEndian.Uint64(sum[:8])
}
```

用法如下：

	var backHand [8]int
	deal, _ := NewDealer(128, 9)
	fmt.Println(deal.DealIntoHand(8238791057607451177, backHand[:]))
//輸出：[41 119 0 49 67]

爲請求分片隊列

上面爲流分配了隊列，實現了流之間的隊列均衡。此時可能爲單條流分配了多個隊列，下一步就是將單條流的請求均衡到分配到的各個隊列中。核心代碼如下：

func (qs *queueSet) shuffleShardLocked(hashValue uint64, descr1, descr2 interface{}) int {
	var backHand [8]int
	// Deal into a data structure, so that the order of visit below is not necessarily the order of the deal.
	// This removes bias in the case of flows with overlapping hands.
  //獲取本條流的隊列列表
	hand := qs.dealer.DealIntoHand(hashValue, backHand[:])
	handSize := len(hand)
  //qs.enqueues表示隊列中的請求總數，這裏第一次哈希取模算出隊列的起始偏移量
	offset := qs.enqueues % handSize
	qs.enqueues++
	bestQueueIdx := -1
	minQueueSeatSeconds := fqrequest.MaxSeatSeconds
  //這裏用到了上面的偏移量，並考慮到了隊列處理延遲，找到延遲最小的那個隊列作爲目標隊列
	for i := 0; i < handSize; i++ {
		queueIdx := hand[(offset+i)%handSize]
		queue := qs.queues[queueIdx]
		queueSum := queue.requests.QueueSum()

		// this is the total amount of work in seat-seconds for requests
		// waiting in this queue, we will select the queue with the minimum.
		thisQueueSeatSeconds := queueSum.TotalWorkSum
		klog.V(7).Infof("QS(%s): For request %#+v %#+v considering queue %d with sum: %#v and %d seats in use, nextDispatchR=%v", qs.qCfg.Name, descr1, descr2, queueIdx, queueSum, queue.seatsInUse, queue.nextDispatchR)
		if thisQueueSeatSeconds < minQueueSeatSeconds {
			minQueueSeatSeconds = thisQueueSeatSeconds
			bestQueueIdx = queueIdx
		}
	}
	...
	return bestQueueIdx
}

使用shuffle sharding增加容錯性