從JDK提供的GZIP工具來窺視GZIP格式
Flume中 HDFSCompressedDataStream的一個玩笑
戰場轉至Hadoop GzipCodec源碼
回到問題，如何處理

從JDK提供的GZIP工具來窺視GZIP格式

class GZIPOutputStream extends DeflaterOutputStream {
	public GZIPOutputStream(OutputStream out, int size, boolean syncFlush) {
        super(out, new Deflater(Deflater.DEFAULT_COMPRESSION, true), size, syncFlush);
        writeHeader();
    }

    public void finish() throws IOException {
        if (!def.finished()) {
            def.finish();
            while (!def.finished()) {
                int len = def.deflate(buf, 0, buf.length);
                if (def.finished() && len <= buf.length - TRAILER_SIZE) {
                    // last deflater buffer. Fit trailer at the end
                    writeTrailer(buf, len);
                    len = len + TRAILER_SIZE;
                    out.write(buf, 0, len);
                    return;
                }
                if (len > 0)
                    out.write(buf, 0, len);
            }
            // if we can't fit the trailer at the end of the last
            // deflater buffer, we write it separately
            byte[] trailer = new byte[TRAILER_SIZE];
            writeTrailer(trailer, 0);
            out.write(trailer);
        }
    }
}

從上述源碼可知一下信息：

GZIPOutputStream 繼承至 DeflaterOutputStream
構造方法：創建DeflaterOutputStream，並write header
finish方法：將數據全部處理完成之後，write tailer

總的來說：

Gzip與Deflater的關係：Gzip是一種格式，Deflater是一種壓縮算法
Gzip的格式：Gzip使用Deflater算法進行數據壓縮，並添加 Header和Tailer

Flume中 HDFSCompressedDataStream的一個玩笑

@Override
public void sync() throws IOException {
  // We must use finish() and resetState() here -- flush() is apparently not
  // supported by the compressed output streams (it's a no-op).
  // Also, since resetState() writes headers, avoid calling it without an
  // additional write/append operation.
  // Note: There are bugs in Hadoop & JDK w/ pure-java gzip; see HADOOP-8522.
  serializer.flush();
  if (!isFinished) {
    cmpOut.finish(); // 觸發finish邏輯
    isFinished = true;
  }
  fsOut.flush();
  hflushOrSync(this.fsOut);
}

何時會調用sync

當write中的batchCounter(計數器) 等於batchSize時
當HDFSEventSink處理完一次事務之後

可以發現有這麼一段神奇的註釋
Note: There are bugs in Hadoop & JDK w/ pure-java gzip; see HADOOP-8522.

摘錄至：https://issues.apache.org/jira/browse/HADOOP-8522
* ResetableGzipOutputStream creates invalid gzip files when finish() and resetState() are used *
Description
	ResetableGzipOutputStream creates invalid gzip files when finish() and resetState() are used. The issue is that finish() flushes the compressor buffer and writes the gzip CRC32 + data length trailer. After that, resetState() does not repeat the gzip header, but simply starts writing more deflate-compressed data. The resultant files are not readable by the Linux "gunzip" tool. ResetableGzipOutputStream should write valid multi-member gzip files.

轉換一下描述（針對ResetableGzipOutputStream）：

開始時，寫入Header信息
隨後，寫入壓縮數據
finish時，寫入Tailer信息
之後再調用resetState，再寫入壓縮數據，重複以往

這樣的邏輯下生成的gz文件，gunzip是隻能解析 最前面的 一塊數據的，後面的數據快均不能解析。以上就是 pure-java gzip 的bug。

那麼，很顯然，按照以上的邏輯，Flume HDFS Sink “不能” 使用Gzip進行數據壓縮。但是也很神奇，Flume官網就是說支持（捂臉.png）。

戰場轉至Hadoop GzipCodec源碼

public class GzipCodec extends DefaultCodec {
	
    public CompressionOutputStream createOutputStream(OutputStream out) {
        return (CompressionOutputStream)(ZlibFactory.isNativeZlibLoaded(this.conf) ? 
        new CompressorStream(out, this.createCompressor(), this.conf.getInt("io.file.buffer.size", 4096)) 
        : new GzipCodec.GzipOutputStream(out));
    }

    public CompressionOutputStream createOutputStream(OutputStream out, Compressor compressor) {
        return (CompressionOutputStream)(compressor != null ? 
        new CompressorStream(out, compressor, this.conf.getInt("io.file.buffer.size", 4096)) 
        : this.createOutputStream(out));
    }

}

顯然，可以發現，GzipCodec中提供了兩種Gzip實現：

GzipOutputStream，正是上面提到的“玩笑”（但是GzipCodec中並沒有提供對應的InputStream）
CompressorStream，這個則需要依賴於ZlibFactory.isNativeZlibLoaded(this.conf)的結果

那麼NativeZlib是個什麼鬼呢？跟蹤一下，發現：

public class ZlibCompressor implements Compressor {
  static {
    if (NativeCodeLoader.isNativeCodeLoaded()) {
      try {
        // Initialize the native library
        initIDs();
        nativeZlibLoaded = true;
      } catch (Throwable t) {
        // Ignore failure to load/initialize native-zlib
      }
    }
  }
}

結合一些其他資料，可以得出這樣的結論：
Hadoop會load native包，並在GzipCodec使用時根據 zlib包是否loaded 來決定 是否使用GzipOutputStream（這個Bug）。這樣說，可能大家不是很清楚，但是相信大家一定都見過這樣一個WARN

常見的WARN：Unable to load native-hadoop library for your platform

這個WARN就是Hadoop沒有找到native包於是就無法成功load時的信息

public final class NativeCodeLoader {

  static {
    try {
      System.loadLibrary("hadoop");
      LOG.debug("Loaded the native-hadoop library");
      nativeCodeLoaded = true;
    } catch (Throwable t) {
      // Ignore failure to load
      LOG.debug("Failed to load native-hadoop with error: " + t);
      LOG.debug("java.library.path=" + System.getProperty("java.library.path"));
    }
    
    if (!nativeCodeLoaded) {
      LOG.warn("Unable to load native-hadoop library for your platform... using builtin-java classes where applicable");
    }
  }
}

這裏大家請自行查閱System.loadLibrary()，然後就會發現這麼個東西 – java.library.path
由於內容太多，這裏就不繼續往下了。

回到問題，如何處理

通過上面的說明已經一定程度上描述了問題，接下來就是如何處理這個問題：
其實，最爲本質的方式就是調整環境讓Hadoop load 到 native包，這樣就能讓GzipCodec不使用那個BugClass，而使用Zlib的壓縮了。

GZIP 在 Flume Sink 中的應用

從JDK提供的GZIP工具來窺視GZIP格式

Flume中 HDFSCompressedDataStream的一個玩笑

戰場轉至Hadoop GzipCodec源碼

回到問題，如何處理

雙親委派 & URLClassLoader & SPI

Avro（非RPC部分）個人知識總結與感悟

GZIP 在 Flume Sink 中的應用

關於flink-metrics-influxdb中的bug

CLASSPATH 官網摘要

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結