一致性哈希原理與應用

一致性哈希的原理與實現
因爲畢設的需求,項目中要用到Memcache服務,來降低對數據庫的請求壓力。雖然只有我一個人訪問,看起來加不加緩存都沒有必要;但是從設計上來講,一個穩健的服務,沒有緩存怎麼能行呢?經過一些搜索,發現一致性哈希算法是目前較爲流行的緩存服務選擇方案。因此來整理總結下,以便於自己的應用。

本文代碼都放到了gitee倉庫,有興趣的可以拿去測一測。
https://gitee.com/marksinoberg/consistent_hash

概念

百科釋義

一致性哈希算法簡單來說就是一種分佈式哈希(DHT)實現算法,設計目標是爲了解決因特網中的熱點(Hot spot)問題,初衷和CARP十分類似。一致性哈希修正了CARP使用的簡 單哈希算法帶來的問題,使得分佈式哈希(DHT)可以在P2P環境中真正得到應用。

哈希算法評價標準

在動態的緩存環境中,有下面這麼幾條標準,可以用來判斷一個哈希算法的好壞。借用網上 一篇文章 對此的描述。

  • 平衡性(Balance):平衡性是指哈希的結果能夠儘可能分佈到所有的緩衝中去,這樣可以使得所有的緩衝空間都得到利用。很多哈希算法都能夠滿足這一條件。
  • 單調性(Monotonicity):單調性是指如果已經有一些內容通過哈希分派到了相應的緩衝中,又有新的緩衝加入到系統中。哈希的結果應能夠保證原有已分配的內容可以被映射到原有的或者新的緩衝中去,而不會被映射到舊的緩衝集合中的其他緩衝區。
  • 分散性(Spread):在分佈式環境中,終端有可能看不到所有的緩衝,而是隻能看到其中的一部分。當終端希望通過哈希過程將內容映射到緩衝上時,由於不同終端所見的緩衝範圍有可能不同,從而導致哈希的結果不一致,最終的結果是相同的內容被不同的終端映射到不同的緩衝區中。這種情況顯然是應該避免的,因爲它導致相同內容被存儲到不同緩衝中去,降低了系統存儲的效率。分散性的定義就是上述情況發生的嚴重程度。好的哈希算法應能夠儘量避免不一致的情況發生,也就是儘量降低分散性。
  • 負載(Load):負載問題實際上是從另一個角度看待分散性問題。既然不同的終端可能將相同的內容映射到不同的緩衝區中,那麼對於一個特定的緩衝區而言,也可能被不同的用戶映射爲不同 的內容。與分散性一樣,這種情況也是應當避免的,因此好的哈希算法應能夠儘量降低緩衝的負荷。

通用代碼

在正式解釋一致性哈希算法之前,先貼一下待會要用到的通用的代碼。

constrants.py

#coding: utf8
servers = [
    "192.168.0.1:11211",
    "192.168.0.2:11211",
    "192.168.0.3:11211",
    "192.168.0.4:11211",
]

entries = [
    "a",
    "b",
    "c",
    "d",
    "e",
    "f",
    "g",
    "h",
    "i",
    "j",
    "k",
    "l",
    "m",
    "n",
    "o",
    "p",
    "q",
    "r",
    "s",
    "t",
    "u",
    "v",
    "w",
    "x",
    "y",
    "z",
]

func.py

#coding: utf8

from hashlib import md5

def hashcode(key=""):
    if key == None or key == "":
        return 0
    return int(md5(str(key).encode('utf8')).hexdigest(), 16)


def print_pretty_list(ls=[]):
    for item in ls:
        print(item)

def print_pretty_dict(dc={}):
    for key, value in dc.items():
        print(f'{key}: {value}')

def compute_cache_percentage(oldcache, newcache):
    result = {key: 0 for key in oldcache.keys()}
    for key, value in oldcache.items():
        if key in newcache.keys():
            result[key] = len(list(set(value).intersection(set(newcache[key]))))
    return result

def get_map_list_key(maplist):
    result = []
    for map in maplist:
        result.extend(map.keys())
    return result

def compute_cache_percentage_ring(oldcache, newcache):
    result = {key: 0 for key in oldcache.keys()}
    # 這裏每一個value其實都是一個裝了字典map的列表
    for node, maplist in oldcache.items():
        if node in newcache.keys():
            oldkeys = get_map_list_key(maplist)
            newkeys = get_map_list_key(newcache[node])
            result[node] = len(list(set(oldkeys).intersection(set(newkeys))))
    return result

def compute_cache_percentage_virtual_ring(oldcache, newcache):
    result = {key.split("#")[0]: 0 for key in oldcache.keys()}
    # 這裏每一個value其實都是一個裝了字典map的列表
    for node, maplist in oldcache.items():
        if node in newcache.keys():
            oldkeys = get_map_list_key(maplist)
            newkeys = get_map_list_key(newcache[node])
            temp = str(node).split("#")[0]
            result[temp] += len(list(set(oldkeys).intersection(set(newkeys))))
    return result

硬哈希

硬哈希,一般又被稱爲普通哈希。其原理較爲簡單,當然實現方式也多種多樣。比如取餘數法等。單調性也能很好的滿足。

硬哈希在緩存服務的適用場景一般是緩存環境不怎麼發生變化,對於緩存服務器羣的穩定性要求較高。一旦服務器羣出現故障,就有可能導致井噴現象,出現緩存服務瞬間崩潰。這麼說可能不太容易理解,下面舉個通俗點的例子。

張大胖負責維護公司的緩存服務羣,手裏現在有10臺機器,每臺服務器的內存使用率都達到了95%, 而且老闆還賊摳,就是不給配新機器。這天,實習生小李寫了個日誌分析的腳本,沒有考慮到目前服務器內存已然吃緊的現狀,就直接執行了。結果第三臺服務器出現了嚴重的Out Of Memory問題,導致服務器直接死掉了。然後緩存數據直接全部丟失。這個時候原本的10臺服務器,變成了9臺。原本可以命中的緩存內容這下都失效了。導致了緩存服務被迫更新,瞬間服務器壓力都上來了。不僅緩存服務器掛了,後臺的MySQL服務器也撐不住這麼大規模的請求啊,公司的整個業務線都陷入了癱瘓的狀態。

簡易實現硬哈希 hard-hash.py

# coding: utf8
from hashlib import *
from constraints import servers, entries
from func import *
length = len(servers)
details1 = {key:[] for key in servers}
for entry in entries:
    # keyhash = sum([ord(c) for c in entry])
    keyhash = hashcode(entry)
    server = servers[keyhash%length]
    details1[server].append(entry)
print_pretty_dict(details1)

print("---"*28)
del servers[0]
length = len(servers)
details2 = {key:[] for key in servers}
for entry in entries:
    # keyhash = sum([ord(c) for c in entry])
    keyhash = hashcode(entry)
    server = servers[keyhash%length]
    details2[server].append(entry)
print_pretty_dict(details2)

print("---"*28)
servers.insert(0, "192.168.0.1:11211")
servers.append("192.168.0.5:11211")
length = len(servers)
details3 = {key:[] for key in servers}
for entry in entries:
    # keyhash = sum([ord(c) for c in entry])
    keyhash = hashcode(entry)
    server = servers[keyhash%length]
    details3[server].append(entry)
print_pretty_dict(details3)

# -------計算緩存度
print("==="*7)
print_pretty_dict(compute_cache_percentage(details1, details2))
print("---"*20)
print_pretty_dict(compute_cache_percentage(details1, details3))

硬哈希緩存效果

python hard-hash.py
192.168.0.1:11211: ['o', 's', 'u', 'w']
192.168.0.2:11211: ['a', 'd', 'g', 'h', 'i', 'j', 'n', 'q', 'r', 'y']
192.168.0.3:11211: ['e', 'p', 't', 'v', 'x']
192.168.0.4:11211: ['b', 'c', 'f', 'k', 'l', 'm', 'z']
------------------------------------------------------------------------------------
192.168.0.2:11211: ['d', 'k', 'l', 'm', 's', 't', 'v', 'x']
192.168.0.3:11211: ['a', 'b', 'c', 'e', 'h', 'i', 'n', 'o', 'p', 'r', 'u', 'w', 'y']
192.168.0.4:11211: ['f', 'g', 'j', 'q', 'z']
------------------------------------------------------------------------------------
192.168.0.1:11211: ['j', 'n', 't']
192.168.0.2:11211: ['e', 'm', 'p', 'q', 'w']
192.168.0.3:11211: ['a', 'i', 'l']
192.168.0.4:11211: ['b', 'c', 'd', 'g', 'h', 'k', 'o', 's', 'v', 'x', 'z']
192.168.0.5:11211: ['f', 'r', 'u', 'y']
=====================
192.168.0.1:11211: 0
192.168.0.2:11211: 1
192.168.0.3:11211: 2
192.168.0.4:11211: 2
------------------------------------------------------------
192.168.0.1:11211: 0
192.168.0.2:11211: 1
192.168.0.3:11211: 0
192.168.0.4:11211: 4

可以看出,動態緩存環境下,硬哈希的命中率並不高。

一致哈希

一致性哈希算法的基本實現原理是將機器節點和key值都按照一樣的hash算法映射到一個0~2^32(也不一定非得是2^32, 理論上能讓節點分佈均勻的‘環’就夠了)的圓環上。當有一個寫入緩存的請求到來時,計算Key值k對應的哈希值Hash(k),如果該值正好對應之前某個機器節點的Hash值,則直接寫入該機器節點,如果沒有對應的機器節點,則順時針查找下一個節點,進行寫入,如果超過2^32還沒找到對應節點,則從0開始查找(因爲是環狀結構)。比如下面盜的一張圖。
簡單一致性哈希原理圖

簡易代碼實現consisthash.py

# coding: utf8
# 簡單一致性hash實現

from constraints import servers, entries
from func import print_pretty_dict, hashcode, compute_cache_percentage, compute_cache_percentage_ring


class ConsistHash(object):
    """
    簡單一致性哈希算法實現
    """
    def __init__(self, servers=[]):
        self.servers = servers
        # sorted list which contains server nodes.
        self.ring = []
        # node:[hashcode1, hashcode2, ...]
        self.hashnodemap = {}
        for server in self.servers:
            self.addNode(server)

    def addNode(self, node):
        code = hashcode(node)
        self.hashnodemap[code] = node
        self.ring.append(code)
        self.ring.sort()


    def removeNode(self, node):
        del self.hashnodemap[hashcode(node)]
        self.ring.remove(hashcode(node))

    def getNode(self, key):
        code = hashcode(key)
        for ringitem in self.ring[::-1]:
            if ringitem <= code:
                return self.hashnodemap[ringitem]
        return self.hashnodemap[self.ring[0]]


class Cacher(object):
    """
    普通一致性哈希算法的應用
    """
    def __init__(self, servers):
        self.c = ConsistHash(servers=servers)
        self.container = {key:[] for key in servers}

    def addServer(self, server):
        self.c.addNode(server)
        self.container[server] = []

    def removeServer(self, server):
        self.c.removeNode(server)
        del self.container[server]

    def cache(self, key, value):
        server = self.c.getNode(key)
        self.container[server].append({key: value})


    def get(self, key):
        server = self.c.getNode(key)
        return self.container[server].items()[key]


if __name__ == "__main__":
    # c = ConsistHash(servers=servers)
    # print_pretty_list(c.ring)
    # print_pretty_dict(c.hashnodemap)
    cacher1 = Cacher(servers)
    for entry in entries:
        cacher1.cache(entry, entry)
    print_pretty_dict(cacher1.container)
    # 刪除一個服務器
    cacher3 = Cacher(servers)
    cacher3.removeServer("192.168.0.1:11211")
    for entry in entries:
        cacher3.cache(entry, entry)
    print_pretty_dict(cacher3.container)
    # 添加一個服務器
    cacher2 = Cacher(servers)
    cacher2.addServer("192.168.0.5:11211")
    for entry in entries:
        cacher2.cache(entry, entry)
    print_pretty_dict(cacher2.container)
    # 計算緩存有效度
    print_pretty_dict(compute_cache_percentage_ring(cacher1.container, cacher2.container))
    print_pretty_dict(compute_cache_percentage_ring(cacher1.container, cacher3.container))

緩存效果

python consisthash.py
192.168.0.1:11211: [{'a': 'a'}, {'c': 'c'}, {'h': 'h'}, {'j': 'j'}, {'l': 'l'}, {'r': 'r'}, {'s': 's'}, {'y': 'y'}]
192.168.0.2:11211: [{'b': 'b'}, {'d': 'd'}, {'f': 'f'}, {'g': 'g'}, {'i': 'i'}, {'k': 'k'}, {'m': 'm'}, {'n': 'n'}, {'p': 'p'}, {'q': 'q'}, {'u': 'u'}, {'v': 'v'}, {'x': 'x'}]
192.168.0.3:11211: []
192.168.0.4:11211: [{'e': 'e'}, {'o': 'o'}, {'t': 't'}, {'w': 'w'}, {'z': 'z'}]
192.168.0.2:11211: [{'a': 'a'}, {'b': 'b'}, {'c': 'c'}, {'d': 'd'}, {'f': 'f'}, {'g': 'g'}, {'h': 'h'}, {'i': 'i'}, {'j': 'j'}, {'k': 'k'}, {'l': 'l'}, {'m': 'm'}, {'n': 'n'}, {'p': 'p'}, {'q': 'q'}, {'r': 'r'}, {'s': 's'}, {'u': 'u'}, {'v': 'v'}, {'x': 'x'}, {'y': 'y'}]
192.168.0.3:11211: []
192.168.0.4:11211: [{'e': 'e'}, {'o': 'o'}, {'t': 't'}, {'w': 'w'}, {'z': 'z'}]
192.168.0.1:11211: []
192.168.0.2:11211: [{'b': 'b'}, {'d': 'd'}, {'f': 'f'}, {'g': 'g'}, {'i': 'i'}, {'k': 'k'}, {'m': 'm'}, {'n': 'n'}, {'p': 'p'}, {'q': 'q'}, {'u': 'u'}, {'v': 'v'}, {'x': 'x'}]
192.168.0.3:11211: []
192.168.0.4:11211: [{'e': 'e'}, {'o': 'o'}, {'t': 't'}, {'w': 'w'}, {'z': 'z'}]
192.168.0.5:11211: [{'a': 'a'}, {'c': 'c'}, {'h': 'h'}, {'j': 'j'}, {'l': 'l'}, {'r': 'r'}, {'s': 's'}, {'y': 'y'}]
192.168.0.1:11211: 0
192.168.0.2:11211: 13
192.168.0.3:11211: 0
192.168.0.4:11211: 5
192.168.0.1:11211: 0
192.168.0.2:11211: 13
192.168.0.3:11211: 0
192.168.0.4:11211: 5

結果表明: 與硬哈希緩存命中率相比,一致哈希的緩存命中率確實提高了不少。

“虛擬”一致哈希

雖然緩存命中率得到了提高,但是僅僅這樣還不能夠真正的應用到實際生產環境中,因爲目前的一致哈希還缺少了平衡性。

在此基礎上,算法大佬們又引入了虛擬節點的概念。

“虛擬節點”( virtual node )是實際節點(機器)在 hash 空間的複製品( replica ),一實際個節點(機器)對應了若干個“虛擬節點”,這個對應個數也成爲“複製個數”,“虛擬節點”在 hash 空間中以hash值排列

虛擬節點的一致哈希原理圖

帶有虛擬節點的一致哈希virtualconstisthash.py

# coding: utf8
# 帶有虛擬節點的一致性哈希算法實現

from constraints import servers, entries
from func import print_pretty_dict, hashcode, print_pretty_list
from func import compute_cache_percentage
from func import compute_cache_percentage_ring
from func import compute_cache_percentage_virtual_ring

class VirtualConsistHash(object):
    """
    帶有虛擬節點的一致性哈希算法實現
    """
    def __init__(self, servers=[], replicas=3):
        self.servers = servers
        # sorted list which contains server nodes.
        self.ring = []
        # node:[hashcode1, hashcode2, ...]
        # 虛擬節點的個數,其實這個名字叫虛擬節點的個數不太合適,每個真實節點“虛擬化”後的節點個數比較好
        self.replicas = replicas
        self.hashnodemap = dict()
        for server in self.servers:
            self.addNode(server)

    def addNode(self, node):
        for i in range(0, self.replicas):
            temp = "{}#{}".format(node, i)
            code = hashcode(temp)
            self.hashnodemap[code] = temp
            self.ring.append(code)
            self.ring.sort()

    def removeNode(self, node):
        for i in range(0, self.replicas):
            temp = "{}#{}".format(node, i)
            code = hashcode(temp)
            self.ring.remove(code)
            del self.hashnodemap[code]

    def getNode(self, key):
        code = hashcode(key)
        for ringitem in self.ring[::-1]:
            if ringitem <= code:
                return self.hashnodemap[ringitem]
        return self.hashnodemap[self.ring[0]]


class Cacher(object):
    """
    帶有虛擬節點的一致性哈希算法的應用
    """
    def __init__(self, servers):
        self.c = VirtualConsistHash(servers=servers)
        self.container = {"{}#{}".format(server, index): [] for index in range(0, self.c.replicas) for server in self.c.servers}

    def addServer(self, server):
        self.c.addNode(server)
        for i in range(0, self.c.replicas):
            temp = "{}#{}".format(server, i)
            self.container[temp] = []

    def removeServer(self, server):
        self.c.removeNode(server)
        for i in range(0, self.c.replicas):
            temp = "{}#{}".format(server, i)
            del self.container[temp]

    def cache(self, key, value):
        server = self.c.getNode(key)
        self.container[server].append({key: value})

    def get(self, key):
        server = self.c.getNode(key)
        return self.container[server].items()[key]


if __name__ == "__main__":
    c = VirtualConsistHash(servers=servers)
    print_pretty_list(c.ring)
    print_pretty_dict(c.hashnodemap)
    cacher1 = Cacher(servers)
    for entry in entries:
        cacher1.cache(entry, entry)
    print_pretty_dict(cacher1.container)
    # 刪除一個服務器
    cacher3 = Cacher(servers)
    cacher3.removeServer("192.168.0.1:11211")
    for entry in entries:
        cacher3.cache(entry, entry)
    print_pretty_dict(cacher3.container)
    # 添加一個服務器
    cacher2 = Cacher(servers)
    cacher2.addServer("192.168.0.5:11211")
    for entry in entries:
        cacher2.cache(entry, entry)
    print_pretty_dict(cacher2.container)
    # 計算緩存有效度
    print("==="*19, "刪除一個緩存服務器後~")
    print_pretty_dict(
        compute_cache_percentage_virtual_ring(cacher1.container, cacher2.container))
    print("==="*19, "添加一個緩存服務器後~")
    print_pretty_dict(
        compute_cache_percentage_virtual_ring(cacher1.container, cacher3.container))

實現效果

python virtualconsisthash.py
25929580212780940911456562527067013
12101104964982711566768785763136289074
74170601562041857353724622613970855161
77290231086376083997830514397772133017
108956197245253243279835718906668306846
119181851294818588345880953329601308254
148120148621525998622527044630882426909
156975434986591250703568213828815453515
166356565783230552968534833801964089480
200325646817984951237589036984080642913
314164546590207529500398448833042413158
322409963387938480044046299781174104628
74170601562041857353724622613970855161: 192.168.0.1:11211#0
148120148621525998622527044630882426909: 192.168.0.1:11211#1
77290231086376083997830514397772133017: 192.168.0.1:11211#2
12101104964982711566768785763136289074: 192.168.0.2:11211#0
166356565783230552968534833801964089480: 192.168.0.2:11211#1
25929580212780940911456562527067013: 192.168.0.2:11211#2
119181851294818588345880953329601308254: 192.168.0.3:11211#0
156975434986591250703568213828815453515: 192.168.0.3:11211#1
108956197245253243279835718906668306846: 192.168.0.3:11211#2
200325646817984951237589036984080642913: 192.168.0.4:11211#0
314164546590207529500398448833042413158: 192.168.0.4:11211#1
322409963387938480044046299781174104628: 192.168.0.4:11211#2
192.168.0.1:11211#0: []
192.168.0.2:11211#0: [{'a': 'a'}, {'h': 'h'}, {'j': 'j'}, {'l': 'l'}]
192.168.0.3:11211#0: []
192.168.0.4:11211#0: [{'e': 'e'}, {'g': 'g'}, {'o': 'o'}, {'t': 't'}, {'v': 'v'}, {'x': 'x'}]
192.168.0.1:11211#1: [{'m': 'm'}]
192.168.0.2:11211#1: [{'b': 'b'}, {'d': 'd'}, {'f': 'f'}, {'i': 'i'}, {'k': 'k'}, {'p': 'p'}]
192.168.0.3:11211#1: [{'n': 'n'}, {'q': 'q'}, {'u': 'u'}]
192.168.0.4:11211#1: [{'w': 'w'}]
192.168.0.1:11211#2: [{'c': 'c'}, {'r': 'r'}, {'y': 'y'}]
192.168.0.2:11211#2: [{'s': 's'}]
192.168.0.3:11211#2: []
192.168.0.4:11211#2: [{'z': 'z'}]
192.168.0.2:11211#0: [{'a': 'a'}, {'c': 'c'}, {'h': 'h'}, {'j': 'j'}, {'l': 'l'}, {'r': 'r'}, {'y': 'y'}]
192.168.0.3:11211#0: [{'m': 'm'}]
192.168.0.4:11211#0: [{'e': 'e'}, {'g': 'g'}, {'o': 'o'}, {'t': 't'}, {'v': 'v'}, {'x': 'x'}]
192.168.0.2:11211#1: [{'b': 'b'}, {'d': 'd'}, {'f': 'f'}, {'i': 'i'}, {'k': 'k'}, {'p': 'p'}]
192.168.0.3:11211#1: [{'n': 'n'}, {'q': 'q'}, {'u': 'u'}]
192.168.0.4:11211#1: [{'w': 'w'}]
192.168.0.2:11211#2: [{'s': 's'}]
192.168.0.3:11211#2: []
192.168.0.4:11211#2: [{'z': 'z'}]
192.168.0.1:11211#0: []
192.168.0.2:11211#0: [{'a': 'a'}]
192.168.0.3:11211#0: []
192.168.0.4:11211#0: [{'g': 'g'}, {'v': 'v'}, {'x': 'x'}]
192.168.0.1:11211#1: [{'m': 'm'}]
192.168.0.2:11211#1: [{'b': 'b'}, {'d': 'd'}, {'f': 'f'}, {'i': 'i'}, {'k': 'k'}, {'p': 'p'}]
192.168.0.3:11211#1: [{'n': 'n'}, {'q': 'q'}, {'u': 'u'}]
192.168.0.4:11211#1: [{'w': 'w'}]
192.168.0.1:11211#2: [{'c': 'c'}, {'r': 'r'}, {'y': 'y'}]
192.168.0.2:11211#2: [{'s': 's'}]
192.168.0.3:11211#2: []
192.168.0.4:11211#2: [{'z': 'z'}]
192.168.0.5:11211#0: [{'h': 'h'}, {'j': 'j'}, {'l': 'l'}]
192.168.0.5:11211#1: [{'e': 'e'}, {'o': 'o'}, {'t': 't'}]
192.168.0.5:11211#2: []
========================================================= 刪除一個緩存服務器後~
192.168.0.1:11211: 4
192.168.0.2:11211: 8
192.168.0.3:11211: 3
192.168.0.4:11211: 5
========================================================= 添加一個緩存服務器後~
192.168.0.1:11211: 0
192.168.0.2:11211: 11
192.168.0.3:11211: 3
192.168.0.4:11211: 8

結果表明: 帶有了虛擬節點的一致哈希實現,使得緩存的命中率得到了進一步的提高。而且平衡性也更趨於平和。

總結

通過上述分析,不難發現。

  • 硬哈希適用場景爲“穩定”的緩存服務羣,因此實際生產環境不怎麼被用到。

  • 簡單一致哈希,緩存命中率還算可以,但是缺乏平衡性,容易導致某臺節點壓力過大而有些節點空閒的狀況。

  • 帶有虛擬節點的一致哈希可以很好的解決上面的兩個問題,但是具體的虛擬節點的設置replicas可能還需要根據實際的生產環境來進行設置。以此來達到一個最優的效果。


[參考文章]

[1]. https://blog.csdn.net/cywosp/article/details/23397179/
[2]. http://blog.huanghao.me/?p=14

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章