mahout中k-means算法运行与查看

原創

2020-06-28 02:37

1.首先，下载数据集synthetic_control.data，并将其导入到分布式文件系统上。
运行hadoop 命令建立testdata文件夹:
$HADOOP_HOME/bin/hadoop fs -mkdir testdata
并将该文件放入改文件夹下面

 $HADOOP_HOME/bin/hadoop fs -put  synthetic_control.data testdata

运行mahout上的kmeans例子
$HADOOP_HOME/ 为hadoop 安装目录

$HADOOP_HOME/bin/hadoop jar /home/hadoop/mahout-distribution-0.4/mahout-examples-0.4-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

查看运行结果
kmeans的运行结果在output 目录下：

hadoop fs -du -h output
194      582      output/_policy
354.8 K  1.0 M    output/clusteredPoints
11.3 K   33.8 K   output/clusters-0
8.8 K    26.3 K   output/clusters-1
8.8 K    26.3 K   output/clusters-10-final
8.8 K    26.3 K   output/clusters-2
8.8 K    26.3 K   output/clusters-3
8.8 K    26.3 K   output/clusters-4
8.8 K    26.3 K   output/clusters-5
8.8 K    26.3 K   output/clusters-6
8.8 K    26.3 K   output/clusters-7
8.8 K    26.3 K   output/clusters-8
8.8 K    26.3 K   output/clusters-9
327.6 K  982.8 K  output/data
7.6 K    22.8 K   output/random-seeds

clusteredPoints：存放的是最后聚类的结果，将cluster-id和documents-id都展示出来了，用mahout seqdumper读clusteredPoints结果的key-value类型是(IntWritable,WeightedVectorWritable)

clusters-N：是第N次聚类的结果，其中n为某类的样本数目，c为各类各属性的中心，r为各类属性的半径。 clusters-N结果类型是(Text,Cluster)

data：存放的是原始数据，这个文件夹下的文件可以用mahout vectordump来读取，原始数据是向量形式的，其它的都只能用mahout seqdumper来读取，向量文件也可以用mahout seqdumper来读取，只是用vectordump读取出来的是数字结果，没有对应的key，用seqdumper读出来的可以看到key，即对应的url，而value读出来的是一个类描述，而不是数组向量

将聚类结果写到文件中

 $HADOOP_HOME/bin/mahout seqdumper -i output/clusteredPoints -o mahout_kmeans.txt

说明： -i 是输入数据，-o 输出，将文件输出到mahout_kmeans.txt ，可直接打开查看
key是聚类中心的行数，可查看聚类数据

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

mahout中k-means算法运行与查看

mahout中k-means算法運行與查看

hive 自定義UDF處理的一些網址

基於物品的相似度計算

Hbase 刪除表格問題--- Table already exists

相似度計算

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結