在關聯規則挖掘領域最經典的算法法是Apriori,其致命的缺點是需要多次掃描事務數據庫。於是人們提出了各種裁剪(prune)數據集的方法以減少I/O開支,韓嘉煒老師的FP-Tree算法就是其中非常高效的一種。
支持度和置信度
嚴格地說Apriori和FP-Tree都是尋找頻繁項集的算法,頻繁項集就是所謂的“支持度”比較高的項集,下面解釋一下支持度和置信度的概念。
設事務數據庫爲:
A E F G
A F G
A B E F G
E F G
則{A,F,G}的支持度數爲3,支持度爲3/4。
{F,G}的支持度數爲4,支持度爲4/4。
{A}的支持度數爲3,支持度爲3/4。
{F,G}=>{A}的置信度爲:{A,F,G}的支持度數 除以 {F,G}的支持度數,即3/4
{A}=>{F,G}的置信度爲:{A,F,G}的支持度數 除以 {A}的支持度數,即3/3
強關聯規則挖掘是在滿足一定支持度的情況下尋找置信度達到閾值的所有模式。
FP-Tree算法
我們舉個例子來詳細講解FP-Tree算法的完整實現。
事務數據庫如下,一行表示一條購物記錄:
牛奶,雞蛋,麪包,薯片
雞蛋,爆米花,薯片,啤酒
雞蛋,麪包,薯片
牛奶,雞蛋,麪包,爆米花,薯片,啤酒
牛奶,麪包,啤酒
雞蛋,麪包,啤酒
牛奶,麪包,薯片
牛奶,雞蛋,麪包,黃油,薯片
牛奶,雞蛋,黃油,薯片
我們的目的是要找出哪些商品總是相伴出現的,比如人們買薯片的時候通常也會買雞蛋,則[薯片,雞蛋]就是一條頻繁模式(frequent pattern)。
FP-Tree算法第一步:掃描事務數據庫,每項商品按頻數遞減排序,並刪除頻數小於最小支持度MinSup的商品。(第一次掃描數據庫)
薯片:7雞蛋:7麪包:7牛奶:6啤酒:4 (這裏我們令MinSup=3)
以上結果就是頻繁1項集,記爲F1。
第二步:對於每一條購買記錄,按照F1中的順序重新排序。(第二次也是最後一次掃描數據庫)
薯片,雞蛋,麪包,牛奶
薯片,雞蛋,啤酒
薯片,雞蛋,麪包
薯片,雞蛋,麪包,牛奶,啤酒
麪包,牛奶,啤酒
雞蛋,麪包,啤酒
薯片,麪包,牛奶
薯片,雞蛋,麪包,牛奶
薯片,雞蛋,牛奶
第三步:把第二步得到的各條記錄插入到FP-Tree中。剛開始時後綴模式爲空。
插入每一條(薯片,雞蛋,麪包,牛奶)之後
插入第二條記錄(薯片,雞蛋,啤酒)
插入第三條記錄(麪包,牛奶,啤酒)
估計你也知道怎麼插了,最終生成的FP-Tree是:
上圖中左邊的那一叫做表頭項,樹中相同名稱的節點要鏈接起來,鏈表的第一個元素就是表頭項裏的元素。
如果FP-Tree爲空(只含一個虛的root節點),則FP-Growth函數返回。
此時輸出表頭項的每一項+postModel,支持度爲表頭項中對應項的計數。
第四步:從FP-Tree中找出頻繁項。
遍歷表頭項中的每一項(我們拿“牛奶:6”爲例),對於各項都執行以下(1)到(5)的操作:
(1)從FP-Tree中找到所有的“牛奶”節點,向上遍歷它的祖先節點,得到4條路徑:
薯片:7,雞蛋:6,牛奶:1 薯片:7,雞蛋:6,麪包:4,牛奶:3 薯片:7,麪包:1,牛奶:1 麪包:1,牛奶:1
對於每一條路徑上的節點,其count都設置爲牛奶的count
薯片:1,雞蛋:1,牛奶:1 薯片:3,雞蛋:3,麪包:3,牛奶:3 薯片:1,麪包:1,牛奶:1 麪包:1,牛奶:1
因爲每一項末尾都是牛奶,可以把牛奶去掉,得到條件模式基(Conditional Pattern Base,CPB),此時的後綴模式是:(牛奶)。
薯片:1,雞蛋:1 薯片:3,雞蛋:3,麪包:3 薯片:1,麪包:1 麪包:1
(2)我們把上面的結果當作原始的事務數據庫,返回到第3步,遞歸迭代運行。
沒講清楚,你可以參考這篇博客,直接看核心代碼吧:
- public void FPGrowth(List<List<String>> transRecords,
- List<String> postPattern,Context context) throws IOException, InterruptedException {
- // 構建項頭表,同時也是頻繁1項集
- ArrayList<TreeNode> HeaderTable = buildHeaderTable(transRecords);
- // 構建FP-Tree
- TreeNode treeRoot = buildFPTree(transRecords, HeaderTable);
- // 如果FP-Tree爲空則返回
- if (treeRoot.getChildren()==null || treeRoot.getChildren().size() == 0)
- return;
- //輸出項頭表的每一項+postPattern
- if(postPattern!=null){
- for (TreeNode header : HeaderTable) {
- String outStr=header.getName();
- int count=header.getCount();
- for (String ele : postPattern)
- outStr+="\t" + ele;
- context.write(new IntWritable(count), new Text(outStr));
- }
- }
- // 找到項頭表的每一項的條件模式基,進入遞歸迭代
- for (TreeNode header : HeaderTable) {
- // 後綴模式增加一項
- List<String> newPostPattern = new LinkedList<String>();
- newPostPattern.add(header.getName());
- if (postPattern != null)
- newPostPattern.addAll(postPattern);
- // 尋找header的條件模式基CPB,放入newTransRecords中
- List<List<String>> newTransRecords = new LinkedList<List<String>>();
- TreeNode backnode = header.getNextHomonym();
- while (backnode != null) {
- int counter = backnode.getCount();
- List<String> prenodes = new ArrayList<String>();
- TreeNode parent = backnode;
- // 遍歷backnode的祖先節點,放到prenodes中
- while ((parent = parent.getParent()).getName() != null) {
- prenodes.add(parent.getName());
- }
- while (counter-- > 0) {
- newTransRecords.add(prenodes);
- }
- backnode = backnode.getNextHomonym();
- }
- // 遞歸迭代
- FPGrowth(newTransRecords, newPostPattern,context);
- }
- }
對於FP-Tree已經是單枝的情況,就沒有必要再遞歸調用FPGrowth了,直接輸出整條路徑上所有節點的各種組合+postModel就可了。例如當FP-Tree爲:
我們直接輸出:
3 A+postModel
3 B+postModel
3 A+B+postModel
就可以了。
如何按照上面代碼裏的做法,是先輸出:
3 A+postModel
3 B+postModel
然後把B插入到postModel的頭部,重新建立一個FP-Tree,這時Tree中只含A,於是輸出
3 A+(B+postModel)
兩種方法結果是一樣的,但畢竟重新建立FP-Tree計算量大些。
Java實現
FP樹節點定義
- package fptree;
- import java.util.ArrayList;
- import java.util.List;
- public class TreeNode implements Comparable<TreeNode> {
- private String name; // 節點名稱
- private int count; // 計數
- private TreeNode parent; // 父節點
- private List<TreeNode> children; // 子節點
- private TreeNode nextHomonym; // 下一個同名節點
- public TreeNode() {
- }
- public TreeNode(String name) {
- this.name = name;
- }
- public String getName() {
- return name;
- }
- public void setName(String name) {
- this.name = name;
- }
- public int getCount() {
- return count;
- }
- public void setCount(int count) {
- this.count = count;
- }
- public TreeNode getParent() {
- return parent;
- }
- public void setParent(TreeNode parent) {
- this.parent = parent;
- }
- public List<TreeNode> getChildren() {
- return children;
- }
- public void addChild(TreeNode child) {
- if (this.getChildren() == null) {
- List<TreeNode> list = new ArrayList<TreeNode>();
- list.add(child);
- this.setChildren(list);
- } else {
- this.getChildren().add(child);
- }
- }
- public TreeNode findChild(String name) {
- List<TreeNode> children = this.getChildren();
- if (children != null) {
- for (TreeNode child : children) {
- if (child.getName().equals(name)) {
- return child;
- }
- }
- }
- return null;
- }
- public void setChildren(List<TreeNode> children) {
- this.children = children;
- }
- public void printChildrenName() {
- List<TreeNode> children = this.getChildren();
- if (children != null) {
- for (TreeNode child : children) {
- System.out.print(child.getName() + " ");
- }
- } else {
- System.out.print("null");
- }
- }
- public TreeNode getNextHomonym() {
- return nextHomonym;
- }
- public void setNextHomonym(TreeNode nextHomonym) {
- this.nextHomonym = nextHomonym;
- }
- public void countIncrement(int n) {
- this.count += n;
- }
- @Override
- public int compareTo(TreeNode arg0) {
- // TODO Auto-generated method stub
- int count0 = arg0.getCount();
- // 跟默認的比較大小相反,導致調用Arrays.sort()時是按降序排列
- return count0 - this.count;
- }
- }
挖掘頻繁模式
- package fptree;
- import java.io.BufferedReader;
- import java.io.FileReader;
- import java.io.IOException;
- import java.util.ArrayList;
- import java.util.Collections;
- import java.util.Comparator;
- import java.util.HashMap;
- import java.util.LinkedList;
- import java.util.List;
- import java.util.Map;
- import java.util.Map.Entry;
- import java.util.Set;
- public class FPTree {
- private int minSuport;
- public int getMinSuport() {
- return minSuport;
- }
- public void setMinSuport(int minSuport) {
- this.minSuport = minSuport;
- }
- // 從若干個文件中讀入Transaction Record
- public List<List<String>> readTransRocords(String... filenames) {
- List<List<String>> transaction = null;
- if (filenames.length > 0) {
- transaction = new LinkedList<List<String>>();
- for (String filename : filenames) {
- try {
- FileReader fr = new FileReader(filename);
- BufferedReader br = new BufferedReader(fr);
- try {
- String line;
- List<String> record;
- while ((line = br.readLine()) != null) {
- if(line.trim().length()>0){
- String str[] = line.split(",");
- record = new LinkedList<String>();
- for (String w : str)
- record.add(w);
- transaction.add(record);
- }
- }
- } finally {
- br.close();
- }
- } catch (IOException ex) {
- System.out.println("Read transaction records failed."
- + ex.getMessage());
- System.exit(1);
- }
- }
- }
- return transaction;
- }
- // FP-Growth算法
- public void FPGrowth(List<List<String>> transRecords,
- List<String> postPattern) {
- // 構建項頭表,同時也是頻繁1項集
- ArrayList<TreeNode> HeaderTable = buildHeaderTable(transRecords);
- // 構建FP-Tree
- TreeNode treeRoot = buildFPTree(transRecords, HeaderTable);
- // 如果FP-Tree爲空則返回
- if (treeRoot.getChildren()==null || treeRoot.getChildren().size() == 0)
- return;
- //輸出項頭表的每一項+postPattern
- if(postPattern!=null){
- for (TreeNode header : HeaderTable) {
- System.out.print(header.getCount() + "\t" + header.getName());
- for (String ele : postPattern)
- System.out.print("\t" + ele);
- System.out.println();
- }
- }
- // 找到項頭表的每一項的條件模式基,進入遞歸迭代
- for (TreeNode header : HeaderTable) {
- // 後綴模式增加一項
- List<String> newPostPattern = new LinkedList<String>();
- newPostPattern.add(header.getName());
- if (postPattern != null)
- newPostPattern.addAll(postPattern);
- // 尋找header的條件模式基CPB,放入newTransRecords中
- List<List<String>> newTransRecords = new LinkedList<List<String>>();
- TreeNode backnode = header.getNextHomonym();
- while (backnode != null) {
- int counter = backnode.getCount();
- List<String> prenodes = new ArrayList<String>();
- TreeNode parent = backnode;
- // 遍歷backnode的祖先節點,放到prenodes中
- while ((parent = parent.getParent()).getName() != null) {
- prenodes.add(parent.getName());
- }
- while (counter-- > 0) {
- newTransRecords.add(prenodes);
- }
- backnode = backnode.getNextHomonym();
- }
- // 遞歸迭代
- FPGrowth(newTransRecords, newPostPattern);
- }
- }
- // 構建項頭表,同時也是頻繁1項集
- public ArrayList<TreeNode> buildHeaderTable(List<List<String>> transRecords) {
- ArrayList<TreeNode> F1 = null;
- if (transRecords.size() > 0) {
- F1 = new ArrayList<TreeNode>();
- Map<String, TreeNode> map = new HashMap<String, TreeNode>();
- // 計算事務數據庫中各項的支持度
- for (List<String> record : transRecords) {
- for (String item : record) {
- if (!map.keySet().contains(item)) {
- TreeNode node = new TreeNode(item);
- node.setCount(1);
- map.put(item, node);
- } else {
- map.get(item).countIncrement(1);
- }
- }
- }
- // 把支持度大於(或等於)minSup的項加入到F1中
- Set<String> names = map.keySet();
- for (String name : names) {
- TreeNode tnode = map.get(name);
- if (tnode.getCount() >= minSuport) {
- F1.add(tnode);
- }
- }
- Collections.sort(F1);
- return F1;
- } else {
- return null;
- }
- }
- // 構建FP-Tree
- public TreeNode buildFPTree(List<List<String>> transRecords,
- ArrayList<TreeNode> F1) {
- TreeNode root = new TreeNode(); // 創建樹的根節點
- for (List<String> transRecord : transRecords) {
- LinkedList<String> record = sortByF1(transRecord, F1);
- TreeNode subTreeRoot = root;
- TreeNode tmpRoot = null;
- if (root.getChildren() != null) {
- while (!record.isEmpty()
- && (tmpRoot = subTreeRoot.findChild(record.peek())) != null) {
- tmpRoot.countIncrement(1);
- subTreeRoot = tmpRoot;
- record.poll();
- }
- }
- addNodes(subTreeRoot, record, F1);
- }
- return root;
- }
- // 把交易記錄按項的頻繁程序降序排列
- public LinkedList<String> sortByF1(List<String> transRecord,
- ArrayList<TreeNode> F1) {
- Map<String, Integer> map = new HashMap<String, Integer>();
- for (String item : transRecord) {
- // 由於F1已經是按降序排列的,
- for (int i = 0; i < F1.size(); i++) {
- TreeNode tnode = F1.get(i);
- if (tnode.getName().equals(item)) {
- map.put(item, i);
- }
- }
- }
- ArrayList<Entry<String, Integer>> al = new ArrayList<Entry<String, Integer>>(
- map.entrySet());
- Collections.sort(al, new Comparator<Map.Entry<String, Integer>>() {
- @Override
- public int compare(Entry<String, Integer> arg0,
- Entry<String, Integer> arg1) {
- // 降序排列
- return arg0.getValue() - arg1.getValue();
- }
- });
- LinkedList<String> rest = new LinkedList<String>();
- for (Entry<String, Integer> entry : al) {
- rest.add(entry.getKey());
- }
- return rest;
- }
- // 把record作爲ancestor的後代插入樹中
- public void addNodes(TreeNode ancestor, LinkedList<String> record,
- ArrayList<TreeNode> F1) {
- if (record.size() > 0) {
- while (record.size() > 0) {
- String item = record.poll();
- TreeNode leafnode = new TreeNode(item);
- leafnode.setCount(1);
- leafnode.setParent(ancestor);
- ancestor.addChild(leafnode);
- for (TreeNode f1 : F1) {
- if (f1.getName().equals(item)) {
- while (f1.getNextHomonym() != null) {
- f1 = f1.getNextHomonym();
- }
- f1.setNextHomonym(leafnode);
- break;
- }
- }
- addNodes(leafnode, record, F1);
- }
- }
- }
- public static void main(String[] args) {
- FPTree fptree = new FPTree();
- fptree.setMinSuport(3);
- List<List<String>> transRecords = fptree
- .readTransRocords("/home/orisun/test/market");
- fptree.FPGrowth(transRecords, null);
- }
- }
輸入文件
牛奶,雞蛋,麪包,薯片
雞蛋,爆米花,薯片,啤酒
雞蛋,麪包,薯片
牛奶,雞蛋,麪包,爆米花,薯片,啤酒
牛奶,麪包,啤酒
雞蛋,麪包,啤酒
牛奶,麪包,薯片
牛奶,雞蛋,麪包,黃油,薯片
牛奶,雞蛋,黃油,薯片
輸出
6 薯片 雞蛋 5 薯片 麪包 5 雞蛋 麪包 4 薯片 雞蛋 麪包 5 薯片 牛奶 5 麪包 牛奶 4 雞蛋 牛奶 4 薯片 麪包 牛奶 4 薯片 雞蛋 牛奶 3 麪包 雞蛋 牛奶 3 薯片 麪包 雞蛋 牛奶 3 雞蛋 啤酒 3 麪包 啤酒
用Hadoop來實現
在上面的代碼我們把整個事務數據庫放在一個List<List<String>>裏面傳給FPGrowth,在實際中這是不可取的,因爲內存不可能容下整個事務數據庫,我們可能需要從關係關係數據庫中一條一條地讀入來建立FP-Tree。但無論如何 FP-Tree是肯定需要放在內存中的,但內存如果容不下怎麼辦?另外FPGrowth仍然是非常耗時的,你想提高速度怎麼辦?解決辦法:分而治之,並行計算。
我們把原始事務數據庫分成N部分,在N個節點上並行地進行FPGrowth挖掘,最後把關聯規則彙總到一起就可以了。關鍵問題是怎麼“劃分”纔會不遺露任何一條關聯規則呢?參見這篇博客。這裏爲了達到並行計算的目的,採用了一種“冗餘”的劃分方法,即各部分的並集大於原來的集合。這種方法最終求出來的關聯規則也是有冗餘的,比如在節點1上得到一條規則(6:啤酒,尿布),在節點2上得到一條規則(3:尿布,啤酒),顯然節點2上的這條規則是冗餘的,需要採用後續步驟把冗餘的規則去掉。
代碼:
Record.java
- package fptree;
- import java.io.DataInput;
- import java.io.DataOutput;
- import java.io.IOException;
- import java.util.Collections;
- import java.util.LinkedList;
- import org.apache.hadoop.io.WritableComparable;
- public class Record implements WritableComparable<Record>{
- LinkedList<String> list;
- public Record(){
- list=new LinkedList<String>();
- }
- public Record(String[] arr){
- list=new LinkedList<String>();
- for(int i=0;i<arr.length;i++)
- list.add(arr[i]);
- }
- @Override
- public String toString(){
- String str=list.get(0);
- for(int i=1;i<list.size();i++)
- str+="\t"+list.get(i);
- return str;
- }
- @Override
- public void readFields(DataInput in) throws IOException {
- list.clear();
- String line=in.readUTF();
- String []arr=line.split("\\s+");
- for(int i=0;i<arr.length;i++)
- list.add(arr[i]);
- }
- @Override
- public void write(DataOutput out) throws IOException {
- out.writeUTF(this.toString());
- }
- @Override
- public int compareTo(Record obj) {
- Collections.sort(list);
- Collections.sort(obj.list);
- return this.toString().compareTo(obj.toString());
- }
- }
DC_FPTree.java
- package fptree;
- import java.io.BufferedReader;
- import java.io.IOException;
- import java.io.InputStreamReader;
- import java.util.ArrayList;
- import java.util.BitSet;
- import java.util.Collections;
- import java.util.Comparator;
- import java.util.HashMap;
- import java.util.LinkedList;
- import java.util.List;
- import java.util.Map;
- import java.util.Map.Entry;
- import java.util.Set;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.conf.Configured;
- import org.apache.hadoop.fs.FSDataInputStream;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Reducer;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
- import org.apache.hadoop.util.Tool;
- import org.apache.hadoop.util.ToolRunner;
- public class DC_FPTree extends Configured implements Tool {
- private static final int GroupNum = 10;
- private static final int minSuport=6;
- public static class GroupMapper extends
- Mapper<LongWritable, Text, IntWritable, Record> {
- List<String> freq = new LinkedList<String>(); // 頻繁1項集
- List<List<String>> freq_group = new LinkedList<List<String>>(); // 分組後的頻繁1項集
- @Override
- public void setup(Context context) throws IOException {
- // 從文件讀入頻繁1項集
- FileSystem fs = FileSystem.get(context.getConfiguration());
- Path freqFile = new Path("/user/orisun/input/F1");
- FSDataInputStream in = fs.open(freqFile);
- InputStreamReader isr = new InputStreamReader(in);
- BufferedReader br = new BufferedReader(isr);
- try {
- String line;
- while ((line = br.readLine()) != null) {
- String[] str = line.split("\\s+");
- String word = str[0];
- freq.add(word);
- }
- } finally {
- br.close();
- }
- // 對頻繁1項集進行分組
- Collections.shuffle(freq); // 打亂順序
- int cap = freq.size() / GroupNum; // 每段分爲一組
- for (int i = 0; i < GroupNum; i++) {
- List<String> list = new LinkedList<String>();
- for (int j = 0; j < cap; j++) {
- list.add(freq.get(i * cap + j));
- }
- freq_group.add(list);
- }
- int remainder = freq.size() % GroupNum;
- int base = GroupNum * cap;
- for (int i = 0; i < remainder; i++) {
- freq_group.get(i).add(freq.get(base + i));
- }
- }
- @Override
- public void map(LongWritable key, Text value, Context context)
- throws IOException, InterruptedException {
- String[] arr = value.toString().split("\\s+");
- Record record = new Record(arr);
- LinkedList<String> list = record.list;
- BitSet bs=new BitSet(freq_group.size());
- bs.clear();
- while (record.list.size() > 0) {
- String item = list.peekLast(); // 取出record的最後一項
- int i=0;
- for (; i < freq_group.size(); i++) {
- if(bs.get(i))
- continue;
- if (freq_group.get(i).contains(item)) {
- bs.set(i);
- break;
- }
- }
- if(i<freq_group.size()){ //找到了
- context.write(new IntWritable(i), record);
- }
- record.list.pollLast();
- }
- }
- }
- public static class FPReducer extends Reducer<IntWritable,Record,IntWritable,Text>{
- public void reduce(IntWritable key,Iterable<Record> values,Context context)throws IOException,InterruptedException{
- List<List<String>> trans=new LinkedList<List<String>>();
- while(values.iterator().hasNext()){
- Record record=values.iterator().next();
- LinkedList<String> list=new LinkedList<String>();
- for(String ele:record.list)
- list.add(ele);
- trans.add(list);
- }
- FPGrowth(trans, null,context);
- }
- // FP-Growth算法
- public void FPGrowth(List<List<String>> transRecords,
- List<String> postPattern,Context context) throws IOException, InterruptedException {
- // 構建項頭表,同時也是頻繁1項集
- ArrayList<TreeNode> HeaderTable = buildHeaderTable(transRecords);
- // 構建FP-Tree
- TreeNode treeRoot = buildFPTree(transRecords, HeaderTable);
- // 如果FP-Tree爲空則返回
- if (treeRoot.getChildren()==null || treeRoot.getChildren().size() == 0)
- return;
- //輸出項頭表的每一項+postPattern
- if(postPattern!=null){
- for (TreeNode header : HeaderTable) {
- String outStr=header.getName();
- int count=header.getCount();
- for (String ele : postPattern)
- outStr+="\t" + ele;
- context.write(new IntWritable(count), new Text(outStr));
- }
- }
- // 找到項頭表的每一項的條件模式基,進入遞歸迭代
- for (TreeNode header : HeaderTable) {
- // 後綴模式增加一項
- List<String> newPostPattern = new LinkedList<String>();
- newPostPattern.add(header.getName());
- if (postPattern != null)
- newPostPattern.addAll(postPattern);
- // 尋找header的條件模式基CPB,放入newTransRecords中
- List<List<String>> newTransRecords = new LinkedList<List<String>>();
- TreeNode backnode = header.getNextHomonym();
- while (backnode != null) {
- int counter = backnode.getCount();
- List<String> prenodes = new ArrayList<String>();
- TreeNode parent = backnode;
- // 遍歷backnode的祖先節點,放到prenodes中
- while ((parent = parent.getParent()).getName() != null) {
- prenodes.add(parent.getName());
- }
- while (counter-- > 0) {
- newTransRecords.add(prenodes);
- }
- backnode = backnode.getNextHomonym();
- }
- // 遞歸迭代
- FPGrowth(newTransRecords, newPostPattern,context);
- }
- }
- // 構建項頭表,同時也是頻繁1項集
- public ArrayList<TreeNode> buildHeaderTable(List<List<String>> transRecords) {
- ArrayList<TreeNode> F1 = null;
- if (transRecords.size() > 0) {
- F1 = new ArrayList<TreeNode>();
- Map<String, TreeNode> map = new HashMap<String, TreeNode>();
- // 計算事務數據庫中各項的支持度
- for (List<String> record : transRecords) {
- for (String item : record) {
- if (!map.keySet().contains(item)) {
- TreeNode node = new TreeNode(item);
- node.setCount(1);
- map.put(item, node);
- } else {
- map.get(item).countIncrement(1);
- }
- }
- }
- // 把支持度大於(或等於)minSup的項加入到F1中
- Set<String> names = map.keySet();
- for (String name : names) {
- TreeNode tnode = map.get(name);
- if (tnode.getCount() >= minSuport) {
- F1.add(tnode);
- }
- }
- Collections.sort(F1);
- return F1;
- } else {
- return null;
- }
- }
- // 構建FP-Tree
- public TreeNode buildFPTree(List<List<String>> transRecords,
- ArrayList<TreeNode> F1) {
- TreeNode root = new TreeNode(); // 創建樹的根節點
- for (List<String> transRecord : transRecords) {
- LinkedList<String> record = sortByF1(transRecord, F1);
- TreeNode subTreeRoot = root;
- TreeNode tmpRoot = null;
- if (root.getChildren() != null) {
- while (!record.isEmpty()
- && (tmpRoot = subTreeRoot.findChild(record.peek())) != null) {
- tmpRoot.countIncrement(1);
- subTreeRoot = tmpRoot;
- record.poll();
- }
- }
- addNodes(subTreeRoot, record, F1);
- }
- return root;
- }
- // 把交易記錄按項的頻繁程序降序排列
- public LinkedList<String> sortByF1(List<String> transRecord,
- ArrayList<TreeNode> F1) {
- Map<String, Integer> map = new HashMap<String, Integer>();
- for (String item : transRecord) {
- // 由於F1已經是按降序排列的,
- for (int i = 0; i < F1.size(); i++) {
- TreeNode tnode = F1.get(i);
- if (tnode.getName().equals(item)) {
- map.put(item, i);
- }
- }
- }
- ArrayList<Entry<String, Integer>> al = new ArrayList<Entry<String, Integer>>(
- map.entrySet());
- Collections.sort(al, new Comparator<Map.Entry<String, Integer>>() {
- @Override
- public int compare(Entry<String, Integer> arg0,
- Entry<String, Integer> arg1) {
- // 降序排列
- return arg0.getValue() - arg1.getValue();
- }
- });
- LinkedList<String> rest = new LinkedList<String>();
- for (Entry<String, Integer> entry : al) {
- rest.add(entry.getKey());
- }
- return rest;
- }
- // 把record作爲ancestor的後代插入樹中
- public void addNodes(TreeNode ancestor, LinkedList<String> record,
- ArrayList<TreeNode> F1) {
- if (record.size() > 0) {
- while (record.size() > 0) {
- String item = record.poll();
- TreeNode leafnode = new TreeNode(item);
- leafnode.setCount(1);
- leafnode.setParent(ancestor);
- ancestor.addChild(leafnode);
- for (TreeNode f1 : F1) {
- if (f1.getName().equals(item)) {
- while (f1.getNextHomonym() != null) {
- f1 = f1.getNextHomonym();
- }
- f1.setNextHomonym(leafnode);
- break;
- }
- }
- addNodes(leafnode, record, F1);
- }
- }
- }
- }
- public static class InverseMapper extends
- Mapper<LongWritable, Text, Record, IntWritable> {
- @Override
- public void map(LongWritable key, Text value, Context context)
- throws IOException, InterruptedException {
- String []arr=value.toString().split("\\s+");
- int count=Integer.parseInt(arr[0]);
- Record record=new Record();
- for(int i=1;i<arr.length;i++){
- record.list.add(arr[i]);
- }
- context.write(record, new IntWritable(count));
- }
- }
- public static class MaxReducer extends Reducer<Record,IntWritable,IntWritable,Record>{
- public void reduce(Record key,Iterable<IntWritable> values,Context context)throws IOException,InterruptedException{
- int max=-1;
- for(IntWritable value:values){
- int i=value.get();
- if(i>max)
- max=i;
- }
- context.write(new IntWritable(max), key);
- }
- }
- @Override
- public int run(String[] arg0) throws Exception {
- Configuration conf=getConf();
- conf.set("mapred.task.timeout", "6000000");
- Job job=new Job(conf);
- job.setJarByClass(DC_FPTree.class);
- FileSystem fs=FileSystem.get(getConf());
- FileInputFormat.setInputPaths(job, "/user/orisun/input/data");
- Path outDir=new Path("/user/orisun/output");
- fs.delete(outDir,true);
- FileOutputFormat.setOutputPath(job, outDir);
- job.setMapperClass(GroupMapper.class);
- job.setReducerClass(FPReducer.class);
- job.setInputFormatClass(TextInputFormat.class);
- job.setOutputFormatClass(TextOutputFormat.class);
- job.setMapOutputKeyClass(IntWritable.class);
- job.setMapOutputValueClass(Record.class);
- job.setOutputKeyClass(IntWritable.class);
- job.setOutputKeyClass(Text.class);
- boolean success=job.waitForCompletion(true);
- job=new Job(conf);
- job.setJarByClass(DC_FPTree.class);
- FileInputFormat.setInputPaths(job, "/user/orisun/output/part-r-*");
- Path outDir2=new Path("/user/orisun/output2");
- fs.delete(outDir2,true);
- FileOutputFormat.setOutputPath(job, outDir2);
- job.setMapperClass(InverseMapper.class);
- job.setReducerClass(MaxReducer.class);
- //job.setNumReduceTasks(0);
- job.setInputFormatClass(TextInputFormat.class);
- job.setOutputFormatClass(TextOutputFormat.class);
- job.setMapOutputKeyClass(Record.class);
- job.setMapOutputValueClass(IntWritable.class);
- job.setOutputKeyClass(IntWritable.class);
- job.setOutputKeyClass(Record.class);
- success |= job.waitForCompletion(true);
- return success?0:1;
- }
- public static void main(String[] args) throws Exception{
- int res=ToolRunner.run(new Configuration(), new DC_FPTree(), args);
- System.exit(res);
- }
- }
結束語
在實踐中,關聯規則挖掘可能並不像人們期望的那麼有用。一方面是因爲支持度置信度框架會產生過多的規則,並不是每一個規則都是有用的。另一方面大部分的關聯規則並不像“啤酒與尿布”這種經典故事這麼普遍。關聯規則分析是需要技巧的,有時需要用更嚴格的統計學知識來控制規則的增殖。