- 從JDK提供的GZIP工具來窺視GZIP格式
- java.util.zip.GZIPOutputStream
- java.util.zip.DeflaterOutputStream
- Flume中 HDFSCompressedDataStream的一個玩笑
- 戰場轉至Hadoop GzipCodec源碼
- 回到問題,如何處理
從JDK提供的GZIP工具來窺視GZIP格式
java.util.zip.GZIPOutputStream
class GZIPOutputStream extends DeflaterOutputStream {
public GZIPOutputStream(OutputStream out, int size, boolean syncFlush) {
super(out, new Deflater(Deflater.DEFAULT_COMPRESSION, true), size, syncFlush);
writeHeader();
}
/**
* Writes array of bytes to the compressed output stream. This method
* will block until all the bytes are written.
* @param buf the data to be written
* @param off the start offset of the data
* @param len the length of the data
* @exception IOException If an I/O error has occurred.
*/
public synchronized void write(byte[] buf, int off, int len) {
super.write(buf, off, len);
crc.update(buf, off, len);
}
public void finish() throws IOException {
if (!def.finished()) {
def.finish();
while (!def.finished()) {
int len = def.deflate(buf, 0, buf.length);
if (def.finished() && len <= buf.length - TRAILER_SIZE) {
// last deflater buffer. Fit trailer at the end
writeTrailer(buf, len);
len = len + TRAILER_SIZE;
out.write(buf, 0, len);
return;
}
if (len > 0)
out.write(buf, 0, len);
}
// if we can't fit the trailer at the end of the last
// deflater buffer, we write it separately
byte[] trailer = new byte[TRAILER_SIZE];
writeTrailer(trailer, 0);
out.write(trailer);
}
}
/*
* Writes GZIP member header.
*/
private void writeHeader() throws IOException {
out.write(new byte[] {
(byte) GZIP_MAGIC, // Magic number (short)
(byte)(GZIP_MAGIC >> 8), // Magic number (short)
Deflater.DEFLATED, // Compression method (CM)
0, // Flags (FLG)
0, // Modification time MTIME (int)
0, // Modification time MTIME (int)
0, // Modification time MTIME (int)
0, // Modification time MTIME (int)
0, // Extra flags (XFLG)
0 // Operating system (OS)
});
}
/*
* Writes GZIP member trailer to a byte array, starting at a given
* offset.
*/
private void writeTrailer(byte[] buf, int offset) throws IOException {
writeInt((int)crc.getValue(), buf, offset); // CRC-32 of uncompr. data
writeInt(def.getTotalIn(), buf, offset + 4); // Number of uncompr. bytes
}
}
從上述源碼可知一下信息:
- GZIPOutputStream 繼承至 DeflaterOutputStream
- constructor:創建DeflaterOutputStream,並write header
- write:交給DeflaterOutputStream進行write,並更新crc
- finish:將數據全部處理完成之後,write tailer
總的來說:GZIPOutputStream在DeflaterOutputStream外層包了一層Header & Tailer
java.util.zip.DeflaterOutputStream
public class DeflaterOutputStream extends FilterOutputStream {
/**
* Writes an array of bytes to the compressed output stream. This
* method will block until all the bytes are written.
* @param b the data to be written
* @param off the start offset of the data
* @param len the length of the data
* @exception IOException if an I/O error has occurred
*/
public void write(byte[] b, int off, int len) throws IOException {
if (def.finished()) {
throw new IOException("write beyond end of stream");
}
if ((off | len | (off + len) | (b.length - (off + len))) < 0) {
throw new IndexOutOfBoundsException();
} else if (len == 0) {
return;
}
if (!def.finished()) {
// 有興趣的可以單獨學習一下Deflater的用法
def.setInput(b, off, len);
while (!def.needsInput()) {
deflate();
}
}
}
/**
* Writes next block of compressed data to the output stream.
* @throws IOException if an I/O error has occurred
*/
protected void deflate() throws IOException {
int len = def.deflate(buf, 0, buf.length);
if (len > 0) {
out.write(buf, 0, len);
}
}
}
NOTE:從write方法中可以看出,每次的write都會出發def的操作。所以爲了得到一定程度的壓縮效果,最好不要傳入小byte數組,不然壓縮效果不明顯。
部分總結
- Gzip與Deflater的關係:Gzip是一種格式,Deflater是一種壓縮算法
- Gzip的格式:Gzip使用Deflater算法進行數據壓縮,並添加Header和Tailer
Flume中 HDFSCompressedDataStream的一個玩笑
@Override
public void sync() throws IOException {
// We must use finish() and resetState() here -- flush() is apparently not
// supported by the compressed output streams (it's a no-op).
// Also, since resetState() writes headers, avoid calling it without an
// additional write/append operation.
// Note: There are bugs in Hadoop & JDK w/ pure-java gzip; see HADOOP-8522.
serializer.flush();
if (!isFinished) {
cmpOut.finish(); // 觸發finish邏輯
isFinished = true;
}
fsOut.flush();
hflushOrSync(this.fsOut);
}
何時會調用sync
- 當write中的batchCounter(計數器) 等於batchSize時
- 當HDFSEventSink處理完一次事務之後
可以發現有這麼一段神奇的註釋
Note: There are bugs in Hadoop & JDK w/ pure-java gzip; see HADOOP-8522.
摘錄至:https://issues.apache.org/jira/browse/HADOOP-8522
* ResetableGzipOutputStream creates invalid gzip files when finish() and resetState() are used *
Description
ResetableGzipOutputStream creates invalid gzip files when finish() and resetState() are used. The issue is that finish() flushes the compressor buffer and writes the gzip CRC32 + data length trailer. After that, resetState() does not repeat the gzip header, but simply starts writing more deflate-compressed data. The resultant files are not readable by the Linux "gunzip" tool. ResetableGzipOutputStream should write valid multi-member gzip files.
轉換一下描述(針對ResetableGzipOutputStream):
- 開始時,寫入Header信息
- 隨後,寫入壓縮數據
- finish時,寫入Tailer信息
- 之後再調用resetState,再寫入壓縮數據,重複以往
這樣的邏輯下生成的gz文件,gunzip是隻能解析最前面的一塊數據的。以上就是 pure-java gzip 的bug。
那麼,很顯然,按照以上的邏輯,Flume HDFS Sink不能使用Gzip進行數據壓縮。但是也很神奇,Flume官網就是說支持(捂臉.png)。
戰場轉至Hadoop GzipCodec源碼
public class GzipCodec extends DefaultCodec {
public CompressionOutputStream createOutputStream(OutputStream out) {
return (CompressionOutputStream)(ZlibFactory.isNativeZlibLoaded(this.conf) ?
new CompressorStream(out, this.createCompressor(), this.conf.getInt("io.file.buffer.size", 4096))
: new GzipCodec.GzipOutputStream(out));
}
public CompressionOutputStream createOutputStream(OutputStream out, Compressor compressor) {
return (CompressionOutputStream)(compressor != null ?
new CompressorStream(out, compressor, this.conf.getInt("io.file.buffer.size", 4096))
: this.createOutputStream(out));
}
}
顯然,可以發現,GzipCodec中提供了兩種Gzip實現:
- GzipOutputStream,正是上面提到的“玩笑”
- CompressorStream,這個則需要依賴於ZlibFactory.isNativeZlibLoaded(this.conf)的結果
那麼NativeZlib是個什麼鬼呢?跟蹤一下,發現:
public class ZlibCompressor implements Compressor {
static {
if (NativeCodeLoader.isNativeCodeLoaded()) {
try {
// Initialize the native library
initIDs();
nativeZlibLoaded = true;
} catch (Throwable t) {
// Ignore failure to load/initialize native-zlib
}
}
}
}
結合一些其他資料,可以得出這樣的結論:
Hadoop會load native包,並在GzipCodec使用時 根據 zlib包是否loaded 來決定 是否使用GzipOutputStream(這個Bug)。這樣說,可能大家不是很清楚,但是 相信大家一定都見過這樣一個WARN
常見的WARN:Unable to load native-hadoop library for your platform
這個WARN就是Hadoop沒有找到native包 於是就無法成功load時的信息
public final class NativeCodeLoader {
static {
try {
System.loadLibrary("hadoop");
LOG.debug("Loaded the native-hadoop library");
nativeCodeLoaded = true;
} catch (Throwable t) {
// Ignore failure to load
LOG.debug("Failed to load native-hadoop with error: " + t);
LOG.debug("java.library.path=" + System.getProperty("java.library.path"));
}
if (!nativeCodeLoaded) {
LOG.warn("Unable to load native-hadoop library for your platform... using builtin-java classes where applicable");
}
}
}
這裏大家請自行查閱System.loadLibrary(),然後就會發現這麼個東西 – java.library.path
由於內容太多,這裏就不繼續往下了。
回到問題,如何處理
通過上面的說明 已經一定程度上描述了問題,接下來就是如何處理這個問題:
其實,最爲本質的方式 就是調整環境讓Hadoop load 到 native包,這樣就能讓GzipCodec不使用那個BugClass了
但是還有一種繞道的處理方式,就是通過控制Flume參數,讓Sink時每次finish都產生一個完整的File。