- 从JDK提供的GZIP工具来窥视GZIP格式
- Flume中 HDFSCompressedDataStream的一个玩笑
- 战场转至Hadoop GzipCodec源码
- 回到问题,如何处理
从JDK提供的GZIP工具来窥视GZIP格式
class GZIPOutputStream extends DeflaterOutputStream {
public GZIPOutputStream(OutputStream out, int size, boolean syncFlush) {
super(out, new Deflater(Deflater.DEFAULT_COMPRESSION, true), size, syncFlush);
writeHeader();
}
public void finish() throws IOException {
if (!def.finished()) {
def.finish();
while (!def.finished()) {
int len = def.deflate(buf, 0, buf.length);
if (def.finished() && len <= buf.length - TRAILER_SIZE) {
// last deflater buffer. Fit trailer at the end
writeTrailer(buf, len);
len = len + TRAILER_SIZE;
out.write(buf, 0, len);
return;
}
if (len > 0)
out.write(buf, 0, len);
}
// if we can't fit the trailer at the end of the last
// deflater buffer, we write it separately
byte[] trailer = new byte[TRAILER_SIZE];
writeTrailer(trailer, 0);
out.write(trailer);
}
}
}
从上述源码可知一下信息:
- GZIPOutputStream 继承至 DeflaterOutputStream
- 构造方法:创建DeflaterOutputStream,并write header
- finish方法:将数据全部处理完成之后,write tailer
总的来说:
- Gzip与Deflater的关系:Gzip是一种格式,Deflater是一种压缩算法
- Gzip的格式:Gzip使用Deflater算法进行数据压缩,并添加 Header和Tailer
Flume中 HDFSCompressedDataStream的一个玩笑
@Override
public void sync() throws IOException {
// We must use finish() and resetState() here -- flush() is apparently not
// supported by the compressed output streams (it's a no-op).
// Also, since resetState() writes headers, avoid calling it without an
// additional write/append operation.
// Note: There are bugs in Hadoop & JDK w/ pure-java gzip; see HADOOP-8522.
serializer.flush();
if (!isFinished) {
cmpOut.finish(); // 触发finish逻辑
isFinished = true;
}
fsOut.flush();
hflushOrSync(this.fsOut);
}
何时会调用sync
- 当write中的batchCounter(计数器) 等于batchSize时
- 当HDFSEventSink处理完一次事务之后
可以发现有这么一段神奇的注释
Note: There are bugs in Hadoop & JDK w/ pure-java gzip; see HADOOP-8522.
摘录至:https://issues.apache.org/jira/browse/HADOOP-8522
* ResetableGzipOutputStream creates invalid gzip files when finish() and resetState() are used *
Description
ResetableGzipOutputStream creates invalid gzip files when finish() and resetState() are used. The issue is that finish() flushes the compressor buffer and writes the gzip CRC32 + data length trailer. After that, resetState() does not repeat the gzip header, but simply starts writing more deflate-compressed data. The resultant files are not readable by the Linux "gunzip" tool. ResetableGzipOutputStream should write valid multi-member gzip files.
转换一下描述(针对ResetableGzipOutputStream):
- 开始时,写入Header信息
- 随后,写入压缩数据
- finish时,写入Tailer信息
- 之后再调用resetState,再写入压缩数据,重复以往
这样的逻辑下生成的gz文件,gunzip是只能解析 最前面的 一块数据的,后面的数据快均不能解析。以上就是 pure-java gzip 的bug。
那么,很显然,按照以上的逻辑,Flume HDFS Sink “不能” 使用Gzip进行数据压缩。但是也很神奇,Flume官网就是说支持(捂脸.png)。
战场转至Hadoop GzipCodec源码
public class GzipCodec extends DefaultCodec {
public CompressionOutputStream createOutputStream(OutputStream out) {
return (CompressionOutputStream)(ZlibFactory.isNativeZlibLoaded(this.conf) ?
new CompressorStream(out, this.createCompressor(), this.conf.getInt("io.file.buffer.size", 4096))
: new GzipCodec.GzipOutputStream(out));
}
public CompressionOutputStream createOutputStream(OutputStream out, Compressor compressor) {
return (CompressionOutputStream)(compressor != null ?
new CompressorStream(out, compressor, this.conf.getInt("io.file.buffer.size", 4096))
: this.createOutputStream(out));
}
}
显然,可以发现,GzipCodec中提供了两种Gzip实现:
- GzipOutputStream,正是上面提到的“玩笑”(但是GzipCodec中并没有提供对应的InputStream)
- CompressorStream,这个则需要依赖于ZlibFactory.isNativeZlibLoaded(this.conf)的结果
那么NativeZlib是个什么鬼呢?跟踪一下,发现:
public class ZlibCompressor implements Compressor {
static {
if (NativeCodeLoader.isNativeCodeLoaded()) {
try {
// Initialize the native library
initIDs();
nativeZlibLoaded = true;
} catch (Throwable t) {
// Ignore failure to load/initialize native-zlib
}
}
}
}
结合一些其他资料,可以得出这样的结论:
Hadoop会load native包,并在GzipCodec使用时 根据 zlib包是否loaded 来决定 是否使用GzipOutputStream(这个Bug)。这样说,可能大家不是很清楚,但是 相信大家一定都见过这样一个WARN
常见的WARN:Unable to load native-hadoop library for your platform
这个WARN就是Hadoop没有找到native包 于是就无法成功load时的信息
public final class NativeCodeLoader {
static {
try {
System.loadLibrary("hadoop");
LOG.debug("Loaded the native-hadoop library");
nativeCodeLoaded = true;
} catch (Throwable t) {
// Ignore failure to load
LOG.debug("Failed to load native-hadoop with error: " + t);
LOG.debug("java.library.path=" + System.getProperty("java.library.path"));
}
if (!nativeCodeLoaded) {
LOG.warn("Unable to load native-hadoop library for your platform... using builtin-java classes where applicable");
}
}
}
这里大家请自行查阅System.loadLibrary(),然后就会发现这么个东西 – java.library.path
由于内容太多,这里就不继续往下了。
回到问题,如何处理
通过上面的说明 已经一定程度上描述了问题,接下来就是如何处理这个问题:
其实,最为本质的方式 就是调整环境让Hadoop load 到 native包,这样就能让GzipCodec不使用那个BugClass,而使用Zlib的压缩了。