不等key的reduce

場景描述
假設有這樣的場景，現在需要計算文章標題的相似度，具體算法見http://www.lanceyan.com/tech/arch/simhash_hamming_distance_similarity.html。接下去算下的結果爲：

標題相似度值
A 13
B 14
C 15
D 16
...... ......
現在需要進行去重，去重的規則爲：相似度相減的絕對值在3以內（包括3）的認爲是同樣的標題，我們可以將他們歸併爲一組。代碼的input爲上表，output如下顯示：

{'A_13': ['B_14', 'D_16', 'C_15', 'E_16']}
{'B_14': ['F_17', 'D_16', 'C_15', 'A_13', 'E_16']}
{'C_15': ['G_18', 'B_14', 'D_16', 'A_13', 'E_16', 'F_17']}
{'E_16': ['G_18', 'B_14', 'A_13', 'F_17', 'C_15', 'H_19']}
{'D_16': ['G_18', 'B_14', 'A_13', 'F_17', 'C_15', 'H_19']}
{'F_17': ['G_18', 'B_14', 'D_16', 'I_20', 'E_16', 'C_15', 'H_19']}
{'G_18': ['E_16', 'J_21', 'D_16', 'I_20', 'F_17', 'C_15', 'H_19']}
{'H_19': ['G_18', 'J_21', 'K_22', 'D_16', 'I_20', 'E_16', 'F_17']}
{'I_20': ['L_23', 'G_18', 'J_21', 'K_22', 'F_17', 'H_19']}
{'J_21': ['L_23', 'G_18', 'M_24', 'K_22', 'I_20', 'H_19']}
{'K_22': ['N_25', 'L_23', 'M_24', 'J_21', 'I_20', 'H_19']}
{'L_23': ['N_25', 'M_24', 'J_21', 'O_26', 'K_22', 'I_20']}
{'M_24': ['J_21', 'K_22', 'N_25', 'O_26', 'L_23']}
{'N_25': ['K_22', 'L_23', 'M_24', 'O_26']}
{'O_26': ['L_23', 'M_24', 'N_25']}
解決方案
由於input數據量比較大，只能考慮使用分佈式的計算平臺計算，選用mapreduce，但是mapreduce從map到reduce端是將key相同的組織到一個list中，並不符合我們的需求，但是我們可以在map的時候將每條輸入的數據進行擴維，一條變7條，其中hash值爲當條數據的hash值分別加上-3，-2，-1，0，1，2，3。這樣設計可以保證絕對值相差爲3以內的可以作爲key相同的組織到同一個list中，這樣會引入重複數據，需要在reduce做相應的處理。算法還算比較簡單，直接貼上streaming的mr代碼。

map：

     #!/usr/bin/env python
     # vim: set fileencoding=utf-8
     import sys

     def main(separator = '\t'):
            for data in sys.stdin:
                _title, _hash = data.strip().split(separator)
                for i in range(-3, 4):
                    if (i == 0):
                        print '%d\t%s' % (int(_hash) + i, '0_' + _title + '_' + _hash)
                    else:
                        print '%d\t%s' % (int(_hash) + i, '1_' + _title + '_' + _hash)
     if __name__ == '__main__':
        main()

reduce:

     #!/usr/bin/env python
     # vim: set fileencoding=utf-8
     import sys
     from itertools import groupby
     from operator import itemgetter
     import math

     def read_from_mapper(file, separator):
        for line in file:
            yield line.strip().split(separator, 2)

     def main(separator = '\t'):
        data = read_from_mapper(sys.stdin, separator)
        for key, group in groupby(data, itemgetter(0)):
        try:
            _left = []
            _right = []
            for k, value in group:
                if (value.startswith('0')):
                    _left.append(value)
                else:
                    _right.append(value)

            left_rs = list(set(_left))
            right_rs = list(set(_right))

            if (len(left_rs) > 0):
                for l in left_rs:
                    rs = {}
                    _flag, _title, _hash = l.strip().split('_')
                    titles = []
                    for r in right_rs:
                        _flag_, _title_, _hash_ = r.strip().split('_')
                        titles.append(_title_ + '_' + _hash_)
                        rs[_title + '_' + _hash] = titles
                    print rs

        except ValueError:
            pass
     if __name__ == '__main__':
        main()

代碼比較簡單，不做過多的解釋，主要是一個擴維的想法。絕對值相差爲3這樣需要擴充7倍的數據，如果需求改成相差100是否需要擴充201倍的數據，如果本身數據量就很大，那這麼做無疑是加大了reduce的難度，有可能跑不出結果。我們需要換一種方式來實現。

問題解決
上一節的解決方案並非是最優的解決方法，對於小數據量的擴充是可以的，但是一旦需要擴充的維度太大就無法解決。那麼從mapreduce實現的核心去思考是否有別的解決方案，其實我們可以在map的輸出key做文章，自定義map的輸出key，然後覆寫compareTo方法時abs(a - b) <= 3 return 0就可以了。直接貼上自定義map key的代碼：

package xxx.xxx.xx.xx.model.output.map;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class DistinctMapKey implements WritableComparable<DistinctMapKey> {
    private int hashV; // 表示計算出來的相似度值


    @Override
    public int compareTo(DistinctMapKey o) {
        if (o == null) {
            return 1;
        }
        if (this == o) {
            return 0;
        }
        int result = this.hashV - o.hashV;
        if (result > 3 || result < -3) {
            return result;
        }
        return 0;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeInt(this.hashV);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.hashV = in.readInt();
    }

    public int getHashV() {
        return hashV;
    }

    public void setHashV(int hashV) {
        this.hashV = hashV;
    }
}

這樣就不需要擴充維度，只需要修改compareTo方法。

更新
2015年12月15日21點03分更新
用spark實現第二節中的算法，代碼如下：

package com.xxx.xxx.spark.learning.help

import org.apache.spark.{SparkContext, SparkConf}

import scala.collection.mutable.{ArrayBuffer, HashSet, HashMap}

/**
 * Created by xxx on 12/14/15.
 */
object RangeKeyReduce {
  def main(args: Array[String]) {
    val conf = new SparkConf()
      .setAppName("hebei")
      .set("spark.executor.memory", "4g")
    val sc = new SparkContext(conf)

    val data = sc.textFile("/user/xxx/xxx/range_key/input/data")

    val groups = data.flatMap { line =>
      val detail = line.split("\t")
      val title = detail(0)
      val hash = detail(1)
      for (i <- -3 to 3) yield {
        val flag = if (i == 0) 0 else 1
        (hash.toLong + i, flag + "_" + title + "_" + hash)
      }
    }.groupByKey()

    val res = groups.map { line =>
      val left: HashSet[String] = new HashSet[String]()
      val right: HashSet[String] = new HashSet[String]()

      line._2.foreach { value =>
        if (value.startsWith("0")) left += value
        else right += value
      }
      val rs: HashMap[String, ArrayBuffer[String]] = new HashMap[String, ArrayBuffer[String]]()
      if (left.size > 0) {
        left.foreach { l =>
          val Array(flag, title, hash) = l.split("_")
          val titles: ArrayBuffer[String] = new ArrayBuffer[String]()
          right.foreach { r =>
            val Array(_flag, _title, _hash) = r.split("_")
            titles += (_title + "_" + _hash)
          }
          rs.put(title + "_" + hash, titles)
        }
        rs
      } else {
        new HashMap[String, ArrayBuffer[String]]()
      }
    }.filter(_.size > 0).flatMap { l =>
      for (k <- l.keys) yield {
        k + "\t" + l.get(k).toArray.mkString(";")
      }
    }

    res.saveAsTextFile("/user/xxx/xxx/range_key/output")
  }
}

不等key的reduce

[轉帖]

python列出centos7內存使用前50的進程信息

Garnet：微軟官方基於.NET開源的高性能分佈式緩存存儲數據庫

Flink執行圖

Java響應式編程

評估統計算法在銀行僞造鈔票檢測中的價值

hive udf寫hbase

disconf-基於xml分佈式配置管理mongo

disconf-基於xml分佈式配置管理cronjob

disconf-基於xml分佈式配置管理redis

mongodb集羣搭建-分片

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

標題	相似度值
A	13
B	14
C	15
D	16
......	......