FP-tree的hadoop实现

FP 树是关联规则算法的一种，主要是用于分析数据项之间的关联性，将关联性大的数据项找出来，具体的一些概念见书《数据挖掘概念与技术》上介绍的频繁项集，支持度等。

算法执行过程：

1. 扫描数据，计算一项集的计数。

2. 根据计数与支持度计算出频繁一项集，对于频繁一项集按照计数从大到小进行排序，并且对它们标上相应的序号后，把它们存放在 DFS 上，后面在做 MAP 或者 REDUCE 之前到 dfs 上读取相应的项集和序号。

3. 根据划分集合的数目将频繁一项集划分成 G 份，并且对每一份有个标号 GID ，把一项集映射到相应的 GID 上，同样把这个 G-List 存放到 dfs 上，以后需要读取。

4. 再次扫描事务数据，将事务项集转换成项集的序号集合，并且对其进行排序，再生成相应的条件事务序号集合。将其根据 GID 收集，再对每个 GID 构造 FP 树，然后得出条件模式基和条件 FP 树，再得出最大的 K 个频繁模式。

5. 将所有项集的频繁模式收集起来，对于每个项生成最大的 K 个频繁模式。

步骤代码如下：

startParallelCounting (params);

startGroupingItems (params);

startTransactionSorting (params);

startParallelFPGrowth (params);

startAggregating (params);

对于每个项，计算它的计数。并且排序后保存到相应的分布式文件中。

在构选 FP 树时首先添加表头的项以及相应的计数。

FPTree tree = new FPTree(featureSetSize);

for ( int i = 0; i < featureSetSize; i++) {

tree.addHeaderCount(i, attributeFrequency[i]);

}

然后对于每个事务项集，将其添加到 FP 树上，即初始化 FP 树。

while (transactions.hasNext()) {

int [] transaction = transactions.next();

Arrays.sort (transaction);

//attribcount += transaction.length;

nodecount += treeAddCount (tree, transaction, 1, minSupportMutable,

attributeFrequency );

i++;

if (i % 10000 == 0) {

log .info( "FPTree Building: Read {} Transactions" , i);

}

事务项添加过程：

然后从根结点开始，判断相应的每一个 attribute 是否在 Temp 结点中的子结点中，如果在，则将其计数加 1 。如果不在，则创建一个新的子结点。

for ( int attribute : myList) {

if (attributeFrequency[attribute] < minSupport.intValue())

return ret;

int child;

if (addCountMode) {

child = tree.childWithAttribute(temp, attribute);

if (child == -1) {

addCountMode = false ;

} else {

tree.addCount(child, addCount );

temp = child;

}

if (!addCountMode) {

child = tree.createNode(temp, attribute, addCount);

temp = child;

ret++;

}

最后，递归对 FP 树进行挖掘，从表头的最后一项开始，得到相应的前缀路径，形成相应的条件模式基，计算出相应的条件 FP 树，得到最后的频繁模式，这里只输出计数最大的 K 个频繁模式。参见书《数据挖掘的概念与技术》中 158 页。

for ( int i = tree .getHeaderTableCount() - 1; i >= 0; i--) {

int attribute = tree .getAttributeAtIndex(i);

if (requiredFeatures.contains(attribute) == false )

continue ;

log .info( "Mining FTree Tree for all patterns with {}" , attribute);

MutableLong minSupport = new MutableLong(minSupportValue);

FrequentPatternMaxHeap frequentPatterns = growth(tree , minSupport, K,

treeCache, 0, attribute);

Patterns.put(attribute, frequentPatterns);

outputCollector.collect(attribute, frequentPatterns);

minSupportValue = Math.max (minSupportValue, minSupport.intValue() / 2);

log .info( "Found {} Patterns with Least Support {}" , Patterns.get(

attribute).count(), Patterns.get(attribute).leastSupport());

}

对于每个项，总结它的所有的频繁项集以及相应的计数，并且输出计数最大的 K 个频繁模式。

例子：

实例测试：

利用 HADOOP 的命令先将数据集从本地放入 HDFS 上。

如：

bin/hadoop fs -put E:/cygwin/home/lianhui/datasets datasets

然后执行相应的数据处理类

bin/hadoop fs -copyFromLocal E:/cygwin/home/lianhui/datasets/mushroom.dat datasets

bin/hadoop jar mahout-examples-0.3.job org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver -i datasets/mushroom.dat -o patterns -k 50 -method mapreduce -g 10 -regex [/ ] -s 2

查看最终的关联结果：

即读取分布式文件 : patterns / frequentPatterns 文件中的内容。

try {

Configuration conf = new Configuration();

Path path = new Path( "hdfs://localhost:9999/user/lianhui/patterns/frequentPatterns/part-r-00000" );

FileSystem fs = FileSystem.get (path.toUri(), conf);

//fs = path.getFileSystem(conf);

SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);

// Text key=new Text();

// TopKStringPatterns patterns = new TopKStringPatterns();

Writable key=(Writable) ReflectionUtils.newInstance (reader.getKeyClass(),conf);

Writable value=(Writable) ReflectionUtils.newInstance (reader.getValueClass(),conf);

while (reader.next(key,value)){

System. out .println( "key=" +key+ ",value=" +value);

long position=reader.getPosition();

//System.out.println("position="+position);

}

reader.close();

FP-tree的hadoop实现

剛剛寫了一些關於找工作的事情，突然間有BUG。

FP-tree的hadoop實現

網易面試歸來

mapreduce的總結

花了幾天時間，把hadoop的基本平臺在機器上運行起來了。

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結