FP-Growth主要是用來進行挖掘頻繁項,使用場景是發現事物之間的相關性,其中用支持度表示相關性的大小,可以通過設置支持度來篩選相關性小的事物的聯繫。相比較於Apriori算法需要掃描多次數據,嚴重受到IO的影響。FP-Growth只需要掃描兩次數據集,可以提高算法運行效率。下圖是論文中的圖:
左邊表示初始的數據集,表示原始的相關關係。然後遍歷左邊數據集,統計每個元素的出現次數,然後按照出現次數降序排列。得到中間的表格,設置minSupport = 3,然後刪除出現次數小於minSupport的所有元素,然後遍歷重構左側原始元素之間的關係(刪除出現次數小於minSupport的所有元素,同時以行爲單位按照總出現次數降序重新排列)得到右邊數據集(比如左第一行中f出現次數是2 < 3,因此刪除f得到新關係[d,a])。(遍歷兩次,因此掃描了兩次數據)。
設置minSupport:
private int minStep = 1;
public int getMinStep() {
return minStep;
}
public void setMinStep(int minStep) {
this.minStep = minStep;
}
然後就是構建一棵FP樹,因此需要構建一個節點的結構體,由於使用java實現因此創建了一個FPTreeNode類:
private class FPTreeNode {
int count = 0; //訪問次數
int blockAddress = 0;//指的是元素,由於我主要是用來統計blockAddress之間的關係,因此這樣取名字
FPTreeNode parent = null;//記錄父節點
FPTreeNode nextSimilarNode = null; //記錄指向下一個該元素的節點
List<FPTreeNode> childSet = new ArrayList<>();
public FPTreeNode(int blockAddress) {
this.blockAddress = blockAddress;
count = 1;
}
}
構成的圖如下圖所示(藍色線指的是:nextSimilarNode,紅色線指的是:parent)
構建上圖FP樹主要分三步:
1.首先需要遍歷數據集,統計元素出現總次數
//計算各基本blockAddress出現的頻次。
public List<List<Integer>> init(File file) {
List<List<Integer>> lists = new ArrayList<>();
try {
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
String string = "";
double start = -1.0;
List<Integer> list = new ArrayList<>();
List<Integer> listCount = new ArrayList<>();
double timePeroid = 60.0;
while((string = bufferedReader.readLine()) != null) {
//time peroid, start block, access count
String[] s = string.split(",");
if (start == -1.0) {
start = Double.valueOf(s[0]) + timePeroid;
}
if (start > Double.valueOf(s[0])) {
if (!list.contains(Integer.valueOf(s[1]))) {
list.add(Integer.valueOf(s[1]));
listCount.add(Integer.valueOf(s[2]));
}
} else {
lists.add(new ArrayList<>(list));
listsCount.add(new ArrayList<>(listCount));
start = Double.valueOf(s[0]) + timePeroid;
list.clear();
listCount.clear();
list.add(Integer.valueOf(s[1]));
listCount.add(Integer.valueOf(s[2]));
}
}
if (list.size() > 0) {
lists.add(new ArrayList<>(list));
listsCount.add(new ArrayList<>(listCount));
}
bufferedReader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return lists;
}
2. 根據minSupport進行篩選生成項頭表,同時按出現頻次降序排列。
public List<FPTreeNode> buildTable(List<List<Integer>> lists) {
List<FPTreeNode> trees = new ArrayList<FPTreeNode>(); //獲得頻繁項表頭,刪除了小於minSupport的。
if (lists.size() == 0) {
return null;
}
HashMap<Integer, FPTreeNode> hashMap = new HashMap<>();
for (int i = 0; i < lists.size(); i++) {
List<Integer> tmp = lists.get(i);
for (int j = 0; j < tmp.size(); j++) {
int val = tmp.get(j);
if (hashMap.containsKey(val)) {
hashMap.get(val).count++;
} else {
hashMap.put(val, new FPTreeNode(val));
}
}
}
Iterator<Map.Entry<Integer, FPTreeNode>> iterator = hashMap.entrySet().iterator();
while(iterator.hasNext()) {
Map.Entry<Integer, FPTreeNode> entry = iterator.next();
if (entry.getValue().count >= minStep) {
trees.add(entry.getValue());
}
}
//將頻繁項進行降序排列
Collections.sort(trees, new Comparator<FPTreeNode>() {
@Override
public int compare(FPTreeNode o1, FPTreeNode o2) {
if (o1.count < o2.count) {
return 1;
} else {
if (o1.count > o2.count) {
return -1;
}
}
return 0;
}
});
return trees;
}
3.通過遞歸,構建FP樹。自底向上。具體流程如下圖:(紅色箭頭表示處理流程,藍色“陰影” FP樹爲創建的投影。)
/**
* 返回一個降序且滿足頻繁項的list
* @param list
* @return
*/
public List<Integer> sortbyTrees(List<Integer> list, List<FPTreeNode> trees) {
List<Integer> tmp = new ArrayList<>();//返回一個降序且滿足頻繁項的list
for (int i = 0; i < trees.size(); i++) {
int block = trees.get(i).blockAddress;
if (list.contains(block)) {
tmp.add(block);
}
}
return tmp;
}
public FPTreeNode findChild(FPTreeNode root, int node) {
for (FPTreeNode treeNode : root.childSet) {
if (treeNode.blockAddress == node) {
return treeNode;
}
}
return null;
}
public void addNode(FPTreeNode node, List<Integer> list, List<FPTreeNode> trees) {
if (list.size() > 0) {
int val = list.remove(0);
FPTreeNode node1 = new FPTreeNode(val);
node1.parent = node;
node.childSet.add(node1);
for (FPTreeNode treeNode : trees) {
if (treeNode.blockAddress == val) {
while (treeNode.nextSimilarNode != null) {
treeNode = treeNode.nextSimilarNode;
}
treeNode.nextSimilarNode = node1;
break;
}
}
addNode(node1, list, trees);
}
}
public FPTreeNode buildFPTree(List<List<Integer>> lists, List<FPTreeNode> trees) {
FPTreeNode root = new FPTreeNode(0);
for (int i = 0; i < lists.size(); i++) {
List<Integer> tmp = sortbyTrees(lists.get(i), trees);//得到一個降序且滿足頻繁項的list
FPTreeNode subRoot = root;
FPTreeNode tmpRoot = root;
if (root.childSet.size() > 0) {
while (tmp.size() > 0 && (tmpRoot = findChild(subRoot, tmp.get(0))) != null) {
tmpRoot.count++;
subRoot = tmpRoot;
tmp.remove(0);
}
}
addNode(subRoot, tmp, trees);
}
return root;
}
4.按照FP-Growth算法挖掘頻繁項,從底向上。
public void FPGrowth(List<List<Integer>> transRecords, List<Integer> postPattern) {
List<FPTreeNode> trees = buildTable(transRecords);// 構建項頭表,同時也是頻繁1項集
// 構建FP-Tree
FPTreeNode root1 = buildFPTree(transRecords, trees);
if (root1.childSet.size() == 0) {
return;
}
if (postPattern.size() > 0) {
for (FPTreeNode node : trees) {
System.out.print(node.count + ":" + node.blockAddress);
for (int val: postPattern) {
System.out.print(" " + val);
}
System.out.println();
}
}
for (int i = trees.size() - 1; i >= 0; i--) {
FPTreeNode node = trees.get(i);
List<Integer> tmp = new ArrayList<>();
tmp.add(node.blockAddress);
if (postPattern.size() > 0) {
tmp.addAll(postPattern);
}
// 尋找header的條件模式基,放入records中
List<List<Integer>> records = new ArrayList<>();
FPTreeNode nextNode = node.nextSimilarNode;
while (nextNode != null) {
int cnt = nextNode.count;
List<Integer> prenodes = new ArrayList<Integer>();
FPTreeNode parent = nextNode;
while ((parent = parent.parent) != null && parent.blockAddress != 0) {
prenodes.add(parent.blockAddress);
}
while (cnt > 0) {
cnt--;
records.add(prenodes);
}
nextNode = nextNode.nextSimilarNode;
}
FPGrowth(records, tmp);
}
}
親測是正確的,數據集格式是:(主要是爲了判斷第二列之間的相關性,按照第一列表示秒,每180s所包含第二列元素表示爲一個關係序列)
60,2104409088,1486
120,2104409088,667
120,2104410112,783
180,2104410112,1467
240,2104410112,1152
240,2104411136,301
300,2104411136,1447
360,2104411136,1429
420,2104411136,225
420,2104412160,1209
480,2104412160,1470
540,2104412160,722
540,2104413184,715
600,2104413184,1455
生成關係序列爲:
[[2104409088, 2104410112], [2104410112, 2104411136], [2104411136, 2104412160, 2104413184], [2104413184]]
最後得到的頻繁項爲:(由於數據量小,頻繁項表現不明顯)
1:2104413184 2104412160
1:2104411136 2104412160
1:2104413184 2104411136 2104412160
1:2104410112 2104409088
1:2104411136 2104413184
1:2104410112 2104411136