去除Hadoop-Streaming行末多餘的TAB

單位有一組業務一直都是使用Streaming壓縮文本日誌，大體上就是設置作業輸出爲BZ2格式，怎麼輸入就怎麼輸出，沒有任何處理功能在裏面。但是每行結尾都多出來一個TAB。終於，有一個業務需要使用TAB前的最後一個字段，不去掉不行了。

雖然是個小問題，但是網上搜了一圈，也沒有很好的解決。很多人都遇到了，但是單位的業務比較特殊，只有map沒有reduce。http://stackoverflow.com/questions/20137618/hadoop-streaming-api-how-to-remove-unwanted-delimiters這個上面直接說“As I discussed with friends, there's no easy way to achieve the goal,...”。

Streaming有個特點，默認是按照TAB去區分Key和Value。如果沒有設置Key字段的數目，默認一行裏面第一個TAB之前的做Key，後面的是Value。如果沒有找到Tab，就全都是Key字段，Value是空。之所以後面會多出個Tab，正是Key和Value之間的那個Tab。

首先是考察Streaming的Map，在PipeMapper.java。InputWriter處理輸出，所以嘗試實現自定義輸出。在MapReduce作業配置裏面，stream.map.input.writer.class負責指定InputWriter是哪一個，默認是TextInputWriter。Streaming在這裏比較坑，增加-Dstream.map.input.writer.class=XXX的選項並不能令Streaming使用自定義的實現類，必須實現自己的IdentifierResolver，然後在其中對不同類型的輸入設定不同類型的InputWriter，而其中的輸入類型，必須由stream.map.input選項傳入。是否設置成功以作業運行時候JobTracker的配置參數表爲準。

不巧的是，使用自定義的InputWriter代替TextInputWriter，行尾的Tab是沒了，行首又多了個數字。估計是Hadoop給Mapper傳入的Key被打印出來了。oooorz....不能瞎猜了，還是看看代碼吧。

好在代碼蠻短的還是。

Streaming會把本身、以及用戶-file -cacheFile -cacheArchive 等選項指定的文件，打成一個Jar包提交到集羣進行MR作業。把集羣的輸出，作爲用戶實現Mapper的輸入；讀取用戶實現Mapper的輸出，作爲整個Map作業的輸出。Input/Output相對於用戶自定義作業，Writer/Reader體現爲Streaming的行爲，因此是InputWriter和OutputReader。簡單來講，

Hadoop給出的（K,V）---streaming---> 用戶自定義Mapper ---streaming--->Hadoop的Mapper輸出

Streaming由PipeMapRunner啓動作業，異步收集用戶作業輸出，進而向Hadoop彙報作業進度。整個作業的基礎設置、作業提交都是由StreamJob類完成。

作業的執行是PipeMapRed/PipeMapper/PipReducer/PipCombiner這幾個類。解決方案也就在這裏。在MROutputThread的run方法裏面，outCollector.collect(key, value);這句之前，加上下面的代碼片段即可。

          if (value instanceof Text) {
            if (value.toString().isEmpty())
              value = NullWritable.get();
          }

是不是很簡單。

爲什麼這樣做是可行的？還是源於org.apache.hadoop.mapred.TextOutputFormat。直接上代碼。

package org.apache.hadoop.mapred;

import java.io.DataOutputStream;
import java.io.IOException;
import java.io.UnsupportedEncodingException;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataOutputStream;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.util.*;

/** An {@link OutputFormat} that writes plain text files. 
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> {

  protected static class LineRecordWriter<K, V>
    implements RecordWriter<K, V> {
    private static final String utf8 = "UTF-8";
    private static final byte[] newline;
    static {
      try {
        newline = "\n".getBytes(utf8);
      } catch (UnsupportedEncodingException uee) {
        throw new IllegalArgumentException("can't find " + utf8 + " encoding");
      }
    }

    protected DataOutputStream out;
    private final byte[] keyValueSeparator;

    public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
      this.out = out;
      try {
        this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
      } catch (UnsupportedEncodingException uee) {
        throw new IllegalArgumentException("can't find " + utf8 + " encoding");
      }
    }

    public LineRecordWriter(DataOutputStream out) {
      this(out, "\t");
    }

    /**
     * Write the object to the byte stream, handling Text as a special
     * case.
     * @param o the object to print
     * @throws IOException if the write throws, we pass it on
     */
    private void writeObject(Object o) throws IOException {
      if (o instanceof Text) {
        Text to = (Text) o;
        out.write(to.getBytes(), 0, to.getLength());
      } else {
        out.write(o.toString().getBytes(utf8));
      }
    }

    public synchronized void write(K key, V value)
      throws IOException {
            boolean nullKey = key == null || key instanceof NullWritable;
      boolean nullValue = value == null || value instanceof NullWritable;
      if (nullKey && nullValue) {
        return; 
      } 
      if (!nullKey) {
        writeObject(key);
      }
      if (!(nullKey || nullValue)) {
        out.write(keyValueSeparator);
      }
      if (!nullValue) {
        writeObject(value);
      }
      out.write(newline);
    }   
      
    public synchronized void close(Reporter reporter) throws IOException {
      out.close();
    }
  }
    
  public RecordWriter<K, V> getRecordWriter(FileSystem ignored,
                                                  JobConf job,
                                                  String name,
                                                  Progressable progress)
    throws IOException {
    boolean isCompressed = getCompressOutput(job);
    String keyValueSeparator = job.get("mapreduce.output.textoutputformat.separator",
                                       "\t");
    if (!isCompressed) {
      Path file = FileOutputFormat.getTaskOutputPath(job, name);
      FileSystem fs = file.getFileSystem(job);
      FSDataOutputStream fileOut = fs.create(file, progress);
      return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
    } else { 
      Class<? extends CompressionCodec> codecClass =
        getOutputCompressorClass(job, GzipCodec.class);
      // create the named codec
      CompressionCodec codec = ReflectionUtils.newInstance(codecClass, job);
      // build the filename including the extension
      Path file = 
        FileOutputFormat.getTaskOutputPath(job,
        Path file =
        FileOutputFormat.getTaskOutputPath(job,
                                           name + codec.getDefaultExtension());
      FileSystem fs = file.getFileSystem(job);
      FSDataOutputStream fileOut = fs.create(file, progress);
      return new LineRecordWriter<K, V>(new DataOutputStream
                                        (codec.createOutputStream(fileOut)),
                                        keyValueSeparator);
    }
  }
}

注意到LineRecordWriter.write了麼？

後記：

A. 網上很多是想辦法修改分隔符，把TAB換成空字符。這是一個非常粗暴的做法，基本上就是埋坑！爲什麼呢？

日誌文本內容可以是很豐富的，這次出問題是因爲每行沒有TAB。如果換做含有TAB的文本，把分隔符變爲空串，就把日誌中原有的TAB去掉了。

B. 之所以這麼搞，也是受到了stackoverflow的這個Q&A的啓發。http://stackoverflow.com/questions/18133290/hadoop-streaming-remove-trailing-tab-from-reducer-output。類似的，Q&A也是採用修改分隔符的辦法，是不可取的。但是仔細發現，是可以在自己重寫的TextOutputFormat<K,V>裏，修改LineRecordWriter.write方法的。

重寫TextOutputFormat是十分優雅的解決，看似修改了Hadoop本身的東西，但是在Streaming最新版沒有加入這個fix之前，防止對每個版本的Streaming都要變更、重新編譯打包。另外，Streaming不是獨立的項目，編譯它需要同時編譯Hadoop！

加上下面這段

    if (!nullValue) {
      if (value instanceof Text) {
        if (value.toString().isEmpty())
          nullValue = true;
      }   
    }

C. 雖然是修改了Streaming代碼，但是不需要考慮會影響同一機器所有用戶的問題，也不用修改$HADOOP_HOME下的Streaming包。streaming提供了這個參數stream.shipped.hadoopstreaming。

D. 有些設置似乎是指對Reducer生效，對於這種只有Mapper的作業不起作用。比如

mapred.textoutputformat.ignoreseparator
mapred.textoutputformat.separator

設置了，沒看到什麼效果。

再有就是，命令行選項裏面如果寫-DXXX= \ 這樣的語句，似乎也沒有把這個參數設置爲空串的效果，寫-DXXX= ""也是一樣。

去除Hadoop-Streaming行末多餘的TAB

MySQL-NonMySQL同步工具源碼解讀——鑑權與註冊

學習Java8的Stream

關於MySQL Binlog類型的一個謠言

MySQL-NonMySQL同步工具源碼解讀——發起第一個同步

Thrift解讀（六）——客戶端基本RPC邏輯

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結