關聯規則挖掘——Apriori及其優化(Python實現)

關聯規則挖掘

基本介紹

關聯規則的概念最早是在Agrawal等人在1993年發表的論文 Miniing association rules between sets of items in large databases 中提出。關聯規則挖掘(關聯分析)用於發現隱藏在大型數據集中的聯繫或者規律。如今隨着數據行業的快速發展,我們面對的數據規模愈發巨大,人們對於挖掘海量數據中隱含的關聯知識也越來越感興趣。

研究方向

目前來看,關聯規則的主要研究方向有:

  1. 經典方法——Apriori算法
  2. 串行算法
    · Park等人提出的基於散列(Hash)技術產生頻繁項集的算法
    · 基於劃分(Partition)的算法
    · Toivonen提出基於採樣(Sampling)思想的關聯規則算法
    · Han等人提出的不產生候選集的FP-Growth算法
  3. 並行分佈式算法
    · Agrawal等人提出了CD、DD及CaD三種並行算法
    · Park等人提出的PDM算法
    · 基於DIC思想,Cheung等人提出的APM並行算法
    · 針對DD算法的優化,引入IDD和HD算法
  4. 數據流
    · Giannella等人提出的FP-Stream算法
    · Chi等人提出的Moment算法(基於滑動窗口)
    · Manku 等人提出的Sampling Lossy Counting算法

  5. · AGM,FSG(基於廣度優先)
    · gSpan,FFSM,closeGraph(基於FP-Growth)
    · 不確定頻繁子圖挖掘技術EDFS(基於劃分思想混合深度與寬度搜素)
  6. 序列
    · Zaki 等人提出的SPADE
    · 基於投影的PrefixSpan
    · Lin等人提出的MEMISP

以上羅列了一些已知的關聯規則挖掘算法,並不全只是我花一個小時查出來的。接下來我主要介紹比較經典的兩種算法——Apriori以及FP-Growth的實現方法。

Apriori算法

理論介紹

核心思想: 頻繁項集的子集必定是頻繁項集。反之,若子集非頻繁,則超集必定非頻繁。
算法原理: 關聯規則—Apriori算法—FPTree

代碼實現

手動編寫Apriori(超級精煉版)

import pandas as pd
import numpy as np
from itertools import combinations
from operator import itemgetter
from time import time
import warnings
warnings.filterwarnings("ignore")
# 拿到購物欄數據
dataset = pd.read_csv('retail.csv', usecols=['items'])
# 定義自己的Aprior算法
def my_aprior(data, support_count):
    """
    Aprior關聯規則挖掘
    @data: 數據
    @support_count: 項集的頻度, 最小支持度計數閾值
    """
    start = time()
    # 對數據進行處理,刪除多餘空格
    for index, row in data.iterrows():
        data.loc[index, 'items'] = row['items'].strip()
    # 找出所有頻繁一項集
    single_items = (data['items'].str.split(" ", expand = True)).apply(pd.value_counts) \
    .sum(axis = 1).where(lambda value: value > support_count).dropna()
    print("找到所有頻繁一項集")
    # 創建頻繁項集對照表
    apriori_data = pd.DataFrame({'items': single_items.index.astype(int), 'support_count': single_items.values, 'set_size': 1})
    # 整理數據集
    data['set_size'] = data['items'].str.count(" ") + 1
    data['items'] = data['items'].apply(lambda row: set(map(int, row.split(" "))))
    single_items_set = set(single_items.index.astype(int))
    # 循環計算,找到頻繁項集
    for length in range(2, len(single_items_set) + 1):
        data = data[data['set_size'] >= length]
        d = data['items'] \
            .apply(lambda st: pd.Series(s if set(s).issubset(st) else None for s in combinations(single_items_set, length))) \
            .apply(lambda col: [col.dropna().unique()[0], col.count()] if col.count() >= support_count else None).dropna()
        if d.empty:
            break
        apriori_data = apriori_data.append(pd.DataFrame(
            {'items': list(map(itemgetter(0), d.values)), 'support_count': list(map(itemgetter(1), d.values)),
             'set_size': length}), ignore_index=True)
    print("結束搜索,總耗時%s"%(time() - start))
    return apriori_data

運行

my_aprior(dataset, 5000)

結果

找到所有頻繁一項集
結束搜索,總耗時94.51256704330444秒
	items			support_count	set_size
0	32				15167.0			1
1	38				15596.0			1
2	39				50675.0			1
3	41				14945.0			1
4	48				42135.0			1
5	(32, 39)		8455.0			2
6	(32, 48)		8034.0			2
7	(38, 39)		10345.0			2
8	(38, 48)		7944.0			2
9	(39, 41)		11414.0			2
10	(39, 48)		29142.0			2
11	(41, 48)		9018.0			2
12	(32, 39, 48)	5402.0			3
13	(38, 39, 48)	6102.0			3
14	(39, 41, 48)	7366.0			3

使用Apyori包的Apriori方法

# 使用apriori包進行分析
from apyori import apriori
dataset = pd.read_csv('retail.csv', usecols=['items'])
def create_dataset(data):
    for index, row in data.iterrows():
        data.loc[index, 'items'] = row['items'].strip()
    data = data['items'].str.split(" ", expand = True)
    # 按照list來存儲
    output = []
    for i in range(data.shape[0]):
        output.append([str(data.values[i, j]) for j in range(data.shape[1])])
    return output

dataset = create_dataset(dataset)
association_rules = apriori(dataset, min_support = 0.05, min_confidence = 0.7, min_lift = 1.2, min_length = 2)
association_result = list(association_rules)
association_result

結果

[RelationRecord(items=frozenset({'41', '39'}), support=0.12946620993171662, ordered_statistics=[OrderedStatistic(items_base=frozenset({'41'}), items_add=frozenset({'39'}), confidence=0.7637336901973905, lift=1.3287082307880087)]),
 RelationRecord(items=frozenset({'38', '39', '48'}), support=0.06921349334180259, ordered_statistics=[OrderedStatistic(items_base=frozenset({'38', '48'}), items_add=frozenset({'39'}), confidence=0.7681268882175226, lift=1.336351311673078)]),
 RelationRecord(items=frozenset({'41', '39', '48'}), support=0.0835507361448243, ordered_statistics=[OrderedStatistic(items_base=frozenset({'41', '48'}), items_add=frozenset({'39'}), confidence=0.8168108227988469, lift=1.4210493489806006)]),
 RelationRecord(items=frozenset({'None', '41', '39'}), support=0.12946620993171662, ordered_statistics=[OrderedStatistic(items_base=frozenset({'41'}), items_add=frozenset({'None', '39'}), confidence=0.7637336901973905, lift=1.3287082307880087), OrderedStatistic(items_base=frozenset({'41', 'None'}), items_add=frozenset({'39'}), confidence=0.7637336901973905, lift=1.3287082307880087)]),
 RelationRecord(items=frozenset({'38', 'None', '39', '48'}), support=0.06921349334180259, ordered_statistics=[OrderedStatistic(items_base=frozenset({'38', '48'}), items_add=frozenset({'None', '39'}), confidence=0.7681268882175226, lift=1.336351311673078), OrderedStatistic(items_base=frozenset({'38', 'None', '48'}), items_add=frozenset({'39'}), confidence=0.7681268882175226, lift=1.336351311673078)]),
 RelationRecord(items=frozenset({'None', '41', '39', '48'}), support=0.0835507361448243, ordered_statistics=[OrderedStatistic(items_base=frozenset({'41', '48'}), items_add=frozenset({'None', '39'}), confidence=0.8168108227988469, lift=1.4210493489806006), OrderedStatistic(items_base=frozenset({'41', 'None', '48'}), items_add=frozenset({'39'}), confidence=0.8168108227988469, lift=1.4210493489806006)])]

FP-Growth算法

Apriori在處理大數據時I/O負載會過大,而FP-Growth在Apriori上進行了優化,它只掃描數據集兩次,並將數據壓縮入FP-Tree中,不需要生成候選集,大大降低了計算壓力。具體算法原理可以參考關聯規則—Apriori算法—FPTree
實現方式:

# FP-growth參考博客https://blog.csdn.net/songbinxu/article/details/80411388?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-3.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-3.nonecase
class treeNode:
    def __init__(self, nameValue, numOccur, parentNode):
        self.name = nameValue  # 存放結點名字
        self.count = numOccur  # 計數器
        self.nodeLink = None  # 連接相似結點
        self.parent = parentNode  # 存放父節點,用於回溯
        self.children = {}  # 存放子節點

    def inc(self, numOccur):
        self.count += numOccur

    def disp(self, ind=1):
        # 輸出調試用
        print('  '*ind, self.name, ' ', self.count)
        for child in self.children.values():
            child.disp(ind+1)

def updateHeader(nodeToTest, targetNode):
    """
    設置頭結點
    @nodeToTest: 測試結點
    @targetNode: 目標結點
    """
    while nodeToTest.nodeLink != None:
        nodeToTest = nodeToTest.nodeLink
    nodeToTest.nodeLink = targetNode

def updateFPtree(items, inTree, headerTable, count):
    """
    更新FP-Tree
    @items: 讀取的數據項集
    @inTree: 已經生成的樹
    @headerTable: 鏈表的頭索引表
    @count: 計數器
    """
    if items[0] in inTree.children:
        # 判斷items的第一個結點是否已作爲子結點
        inTree.children[items[0]].inc(count)
    else:
        # 創建新的分支
        inTree.children[items[0]] = treeNode(items[0], count, inTree)
        if headerTable[items[0]][1] == None:
            headerTable[items[0]][1] = inTree.children[items[0]]
        else:
            updateHeader(headerTable[items[0]][1], inTree.children[items[0]])
    # 遞歸
    if len(items) > 1:
        updateFPtree(items[1::], inTree.children[items[0]], headerTable, count)

def createFPtree(dataSet, minSup=1):
    """
    建立FP-Tree
    @dataset: 數據集
    @minSup: 最小支持度
    """
    headerTable = {}
    for trans in dataSet:
        for item in trans:
            headerTable[item] = headerTable.get(item, 0) + dataSet[trans]
    for k in list(headerTable.keys()):
        if headerTable[k] < minSup:
            del(headerTable[k]) # 刪除不滿足最小支持度的元素
    freqItemSet = set(headerTable.keys()) # 滿足最小支持度的頻繁項集
    if len(freqItemSet) == 0:
        return None, None
    for k in headerTable:
        headerTable[k] = [headerTable[k], None] # element: [count, node]
    
    retTree = treeNode('Null Set', 1, None)
    for tranSet, count in dataSet.items():
        # dataSet:[element, count]
        localD = {}
        for item in tranSet:
            if item in freqItemSet: # 過濾,只取該樣本中滿足最小支持度的頻繁項
                localD[item] = headerTable[item][0] # element : count
        if len(localD) > 0:
            # 根據全局頻數從大到小對單樣本排序
            # orderedItem = [v[0] for v in sorted(localD.iteritems(), key=lambda p:(p[1], -ord(p[0])), reverse=True)]
            orderedItem = [v[0] for v in sorted(localD.items(), key=lambda p:(p[1], int(p[0])), reverse=True)]
            # 用過濾且排序後的樣本更新樹
            updateFPtree(orderedItem, retTree, headerTable, count)
    return retTree, headerTable

def ascendFPtree(leafNode, prefixPath):
    """
    樹的回溯
    @leafNode: 葉子結點
    @prefixPath: 前綴路徑索引
    """
    if leafNode.parent != None:
        prefixPath.append(leafNode.name)
        ascendFPtree(leafNode.parent, prefixPath)

def findPrefixPath(basePat, myHeaderTab):
    """
    找到條件模式基
    @basePat: 模式基
    @myHeaderTab: 鏈表的頭索引表
    """
    treeNode = myHeaderTab[basePat][1] # basePat在FP樹中的第一個結點
    condPats = {}
    while treeNode != None:
        prefixPath = []
        ascendFPtree(treeNode, prefixPath) # prefixPath是倒過來的,從treeNode開始到根
        if len(prefixPath) > 1:
            condPats[frozenset(prefixPath[1:])] = treeNode.count # 關聯treeNode的計數
        treeNode = treeNode.nodeLink # 下一個basePat結點
    return condPats

def mineFPtree(inTree, headerTable, minSup, preFix, freqItemList):
    """
    生成我的FP-Tree
    @inTree:
    @headerTable:
    @minSup:
    @preFix: 頻繁項
    @ freqItemList: 頻繁項所有組合集合 
    """
    # 最開始的頻繁項集是headerTable中的各元素
    bigL = [v[0] for v in sorted(headerTable.items(), key=lambda p:p[1])] # 根據頻繁項的總頻次排序
    for basePat in bigL: # 對每個頻繁項
        newFreqSet = preFix.copy()
        newFreqSet.add(basePat)
        freqItemList.append(newFreqSet)
        condPattBases = findPrefixPath(basePat, headerTable) # 當前頻繁項集的條件模式基
        myCondTree, myHead = createFPtree(condPattBases, minSup) # 構造當前頻繁項的條件FP樹
        if myHead != None:
            # print 'conditional tree for: ', newFreqSet
            # myCondTree.disp(1)
            mineFPtree(myCondTree, myHead, minSup, newFreqSet, freqItemList) # 遞歸挖掘條件FP樹

def createInitSet(dataSet):
    """
    創建輸入格式
    @dataset: 數據集
    """
    retDict={}
    for trans in dataSet:
        key = frozenset(trans)
        if key in retDict:
            retDict[frozenset(trans)] += 1
        else:
            retDict[frozenset(trans)] = 1
    return retDict

def calSuppData(headerTable, freqItemList, total):
    """
    計算支持度
    @headerTable:
    @freqItemList: 頻繁項集
    @total: 總數
    """
    suppData = {}
    for Item in freqItemList:
        # 找到最底下的結點
        Item = sorted(Item, key=lambda x:headerTable[x][0])
        base = findPrefixPath(Item[0], headerTable)
        # 計算支持度
        support = 0
        for B in base:
            if frozenset(Item[1:]).issubset(set(B)):
                support += base[B]
        # 對於根的子結點,沒有條件模式基
        if len(base)==0 and len(Item)==1:
            support = headerTable[Item[0]][0]
            
        suppData[frozenset(Item)] = support/float(total)
    return suppData

def aprioriGen(Lk, k):
    retList = []
    lenLk = len(Lk)
    for i in range(lenLk):
        for j in range(i+1, lenLk):
            L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2]
            L1.sort(); L2.sort()
            if L1 == L2: 
                retList.append(Lk[i] | Lk[j])
    return retList

def calcConf(freqSet, H, supportData, br1, minConf=0.7):
    """
    計算置信度,規則評估函數
    @freqSet: 頻繁項集,H的超集
    @H: 目標項
    @supportData: 測試
    """
    prunedH = []
    for conseq in H:
        conf = supportData[freqSet] / supportData[freqSet - conseq]
        if conf >= minConf:
            print("{0} --> {1} conf:{2}".format(freqSet - conseq, conseq, conf))
            br1.append((freqSet - conseq, conseq, conf))
            prunedH.append(conseq)
    return prunedH

def rulesFromConseq(freqSet, H, supportData, br1, minConf=0.7):
    """
    這裏H相當於freqSet的子集,在這個函數裏面,循環是從子集元素個數由2一直增大到freqSet的元素個數減1
    參數含義同calcConf
    """
    m = len(H[0])
    if len(freqSet) > m+1:
        Hmp1 = aprioriGen(H, m+1)
        Hmp1 = calcConf(freqSet, Hmp1, supportData, br1, minConf)
        if len(Hmp1)>1:
            rulesFromConseq(freqSet, Hmp1, supportData, br1, minConf)

def generateRules(freqItemList, supportData, minConf=0.7):
    """
    關聯規則生成主函數
    @L: 頻繁集項列表   
    @supportData: 包含頻繁項集支持數據的字典 
    @minConf: 最小可信度閾值
    構建關聯規則需有大於等於兩個的元素
    """
    bigRuleList = []
    for freqSet in freqItemList:
        H1 = [frozenset([item]) for item in freqSet]
        if len(freqSet)>1:
            rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)
        else:
            calcConf(freqSet, H1, supportData, bigRuleList, minConf)
    return bigRuleList

調用:

# 讀取數據
dataset = pd.read_csv('retail.csv', usecols=['items'])
for index, row in dataset.iterrows():
        dataset.loc[index, 'items'] = row['items'].strip()
dataset = dataset['items'].str.split(" ")
start = time()
initSet = createInitSet(dataset.values)
# # 用數據集構造FP樹,最小支持度5000
myFPtree, myHeaderTab = createFPtree(initSet, 5000)
freqItems = []
mineFPtree(myFPtree, myHeaderTab, 5000, set([]), freqItems)
print("結束搜索,總耗時%s"%(time() - start))
for x in freqItems:
    print(x)

輸出結果:

結束搜索,總耗時3.236400842666626
{'41'}
{'41', '48'}
{'41', '39', '48'}
{'41', '39'}
{'32'}
{'48', '32'}
{'39', '48', '32'}
{'39', '32'}
{'38'}
{'38', '48'}
{'38', '39', '48'}
{'38', '39'}
{'48'}
{'39', '48'}
{'39'}

運算時間相比Apriori大幅降低。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章