不等key的reduce

场景描述
假设有这样的场景，现在需要计算文章标题的相似度，具体算法见http://www.lanceyan.com/tech/arch/simhash_hamming_distance_similarity.html。接下去算下的结果为：

标题相似度值
A 13
B 14
C 15
D 16
...... ......
现在需要进行去重，去重的规则为：相似度相减的绝对值在3以内（包括3）的认为是同样的标题，我们可以将他们归并为一组。代码的input为上表，output如下显示：

{'A_13': ['B_14', 'D_16', 'C_15', 'E_16']}
{'B_14': ['F_17', 'D_16', 'C_15', 'A_13', 'E_16']}
{'C_15': ['G_18', 'B_14', 'D_16', 'A_13', 'E_16', 'F_17']}
{'E_16': ['G_18', 'B_14', 'A_13', 'F_17', 'C_15', 'H_19']}
{'D_16': ['G_18', 'B_14', 'A_13', 'F_17', 'C_15', 'H_19']}
{'F_17': ['G_18', 'B_14', 'D_16', 'I_20', 'E_16', 'C_15', 'H_19']}
{'G_18': ['E_16', 'J_21', 'D_16', 'I_20', 'F_17', 'C_15', 'H_19']}
{'H_19': ['G_18', 'J_21', 'K_22', 'D_16', 'I_20', 'E_16', 'F_17']}
{'I_20': ['L_23', 'G_18', 'J_21', 'K_22', 'F_17', 'H_19']}
{'J_21': ['L_23', 'G_18', 'M_24', 'K_22', 'I_20', 'H_19']}
{'K_22': ['N_25', 'L_23', 'M_24', 'J_21', 'I_20', 'H_19']}
{'L_23': ['N_25', 'M_24', 'J_21', 'O_26', 'K_22', 'I_20']}
{'M_24': ['J_21', 'K_22', 'N_25', 'O_26', 'L_23']}
{'N_25': ['K_22', 'L_23', 'M_24', 'O_26']}
{'O_26': ['L_23', 'M_24', 'N_25']}
解决方案
由于input数据量比较大，只能考虑使用分布式的计算平台计算，选用mapreduce，但是mapreduce从map到reduce端是将key相同的组织到一个list中，并不符合我们的需求，但是我们可以在map的时候将每条输入的数据进行扩维，一条变7条，其中hash值为当条数据的hash值分别加上-3，-2，-1，0，1，2，3。这样设计可以保证绝对值相差为3以内的可以作为key相同的组织到同一个list中，这样会引入重复数据，需要在reduce做相应的处理。算法还算比较简单，直接贴上streaming的mr代码。

map：

     #!/usr/bin/env python
     # vim: set fileencoding=utf-8
     import sys

     def main(separator = '\t'):
            for data in sys.stdin:
                _title, _hash = data.strip().split(separator)
                for i in range(-3, 4):
                    if (i == 0):
                        print '%d\t%s' % (int(_hash) + i, '0_' + _title + '_' + _hash)
                    else:
                        print '%d\t%s' % (int(_hash) + i, '1_' + _title + '_' + _hash)
     if __name__ == '__main__':
        main()

reduce:

     #!/usr/bin/env python
     # vim: set fileencoding=utf-8
     import sys
     from itertools import groupby
     from operator import itemgetter
     import math

     def read_from_mapper(file, separator):
        for line in file:
            yield line.strip().split(separator, 2)

     def main(separator = '\t'):
        data = read_from_mapper(sys.stdin, separator)
        for key, group in groupby(data, itemgetter(0)):
        try:
            _left = []
            _right = []
            for k, value in group:
                if (value.startswith('0')):
                    _left.append(value)
                else:
                    _right.append(value)

            left_rs = list(set(_left))
            right_rs = list(set(_right))

            if (len(left_rs) > 0):
                for l in left_rs:
                    rs = {}
                    _flag, _title, _hash = l.strip().split('_')
                    titles = []
                    for r in right_rs:
                        _flag_, _title_, _hash_ = r.strip().split('_')
                        titles.append(_title_ + '_' + _hash_)
                        rs[_title + '_' + _hash] = titles
                    print rs

        except ValueError:
            pass
     if __name__ == '__main__':
        main()

代码比较简单，不做过多的解释，主要是一个扩维的想法。绝对值相差为3这样需要扩充7倍的数据，如果需求改成相差100是否需要扩充201倍的数据，如果本身数据量就很大，那这么做无疑是加大了reduce的难度，有可能跑不出结果。我们需要换一种方式来实现。

问题解决
上一节的解决方案并非是最优的解决方法，对于小数据量的扩充是可以的，但是一旦需要扩充的维度太大就无法解决。那么从mapreduce实现的核心去思考是否有别的解决方案，其实我们可以在map的输出key做文章，自定义map的输出key，然后覆写compareTo方法时abs(a - b) <= 3 return 0就可以了。直接贴上自定义map key的代码：

package xxx.xxx.xx.xx.model.output.map;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class DistinctMapKey implements WritableComparable<DistinctMapKey> {
    private int hashV; // 表示计算出来的相似度值


    @Override
    public int compareTo(DistinctMapKey o) {
        if (o == null) {
            return 1;
        }
        if (this == o) {
            return 0;
        }
        int result = this.hashV - o.hashV;
        if (result > 3 || result < -3) {
            return result;
        }
        return 0;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeInt(this.hashV);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.hashV = in.readInt();
    }

    public int getHashV() {
        return hashV;
    }

    public void setHashV(int hashV) {
        this.hashV = hashV;
    }
}

这样就不需要扩充维度，只需要修改compareTo方法。

更新
2015年12月15日21点03分更新
用spark实现第二节中的算法，代码如下：

package com.xxx.xxx.spark.learning.help

import org.apache.spark.{SparkContext, SparkConf}

import scala.collection.mutable.{ArrayBuffer, HashSet, HashMap}

/**
 * Created by xxx on 12/14/15.
 */
object RangeKeyReduce {
  def main(args: Array[String]) {
    val conf = new SparkConf()
      .setAppName("hebei")
      .set("spark.executor.memory", "4g")
    val sc = new SparkContext(conf)

    val data = sc.textFile("/user/xxx/xxx/range_key/input/data")

    val groups = data.flatMap { line =>
      val detail = line.split("\t")
      val title = detail(0)
      val hash = detail(1)
      for (i <- -3 to 3) yield {
        val flag = if (i == 0) 0 else 1
        (hash.toLong + i, flag + "_" + title + "_" + hash)
      }
    }.groupByKey()

    val res = groups.map { line =>
      val left: HashSet[String] = new HashSet[String]()
      val right: HashSet[String] = new HashSet[String]()

      line._2.foreach { value =>
        if (value.startsWith("0")) left += value
        else right += value
      }
      val rs: HashMap[String, ArrayBuffer[String]] = new HashMap[String, ArrayBuffer[String]]()
      if (left.size > 0) {
        left.foreach { l =>
          val Array(flag, title, hash) = l.split("_")
          val titles: ArrayBuffer[String] = new ArrayBuffer[String]()
          right.foreach { r =>
            val Array(_flag, _title, _hash) = r.split("_")
            titles += (_title + "_" + _hash)
          }
          rs.put(title + "_" + hash, titles)
        }
        rs
      } else {
        new HashMap[String, ArrayBuffer[String]]()
      }
    }.filter(_.size > 0).flatMap { l =>
      for (k <- l.keys) yield {
        k + "\t" + l.get(k).toArray.mkString(";")
      }
    }

    res.saveAsTextFile("/user/xxx/xxx/range_key/output")
  }
}

不等key的reduce

sm4加密工具类

hive udf寫hbase

disconf-基於xml分佈式配置管理mongo

disconf-基於xml分佈式配置管理cronjob

disconf-基於xml分佈式配置管理redis

mongodb集羣搭建-分片

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

标题	相似度值
A	13
B	14
C	15
D	16
......	......