从JDK提供的GZIP工具来窥视GZIP格式
- java.util.zip.GZIPOutputStream
- java.util.zip.DeflaterOutputStream
Flume中 HDFSCompressedDataStream的一个玩笑
战场转至Hadoop GzipCodec源码
回到问题，如何处理

从JDK提供的GZIP工具来窥视GZIP格式

java.util.zip.GZIPOutputStream

class GZIPOutputStream extends DeflaterOutputStream {
	public GZIPOutputStream(OutputStream out, int size, boolean syncFlush) {
        super(out, new Deflater(Deflater.DEFAULT_COMPRESSION, true), size, syncFlush);
        writeHeader();
    }

	/**
     * Writes array of bytes to the compressed output stream. This method
     * will block until all the bytes are written.
     * @param buf the data to be written
     * @param off the start offset of the data
     * @param len the length of the data
     * @exception IOException If an I/O error has occurred.
     */
    public synchronized void write(byte[] buf, int off, int len) {
        super.write(buf, off, len);
        crc.update(buf, off, len);
    }

    public void finish() throws IOException {
        if (!def.finished()) {
            def.finish();
            while (!def.finished()) {
                int len = def.deflate(buf, 0, buf.length);
                if (def.finished() && len <= buf.length - TRAILER_SIZE) {
                    // last deflater buffer. Fit trailer at the end
                    writeTrailer(buf, len);
                    len = len + TRAILER_SIZE;
                    out.write(buf, 0, len);
                    return;
                }
                if (len > 0)
                    out.write(buf, 0, len);
            }
            // if we can't fit the trailer at the end of the last
            // deflater buffer, we write it separately
            byte[] trailer = new byte[TRAILER_SIZE];
            writeTrailer(trailer, 0);
            out.write(trailer);
        }
    }

    /*
     * Writes GZIP member header.
     */
    private void writeHeader() throws IOException {
        out.write(new byte[] {
                      (byte) GZIP_MAGIC,        // Magic number (short)
                      (byte)(GZIP_MAGIC >> 8),  // Magic number (short)
                      Deflater.DEFLATED,        // Compression method (CM)
                      0,                        // Flags (FLG)
                      0,                        // Modification time MTIME (int)
                      0,                        // Modification time MTIME (int)
                      0,                        // Modification time MTIME (int)
                      0,                        // Modification time MTIME (int)
                      0,                        // Extra flags (XFLG)
                      0                         // Operating system (OS)
                  });
    }

    /*
     * Writes GZIP member trailer to a byte array, starting at a given
     * offset.
     */
    private void writeTrailer(byte[] buf, int offset) throws IOException {
        writeInt((int)crc.getValue(), buf, offset); // CRC-32 of uncompr. data
        writeInt(def.getTotalIn(), buf, offset + 4); // Number of uncompr. bytes
    }
}

从上述源码可知一下信息：

GZIPOutputStream 继承至 DeflaterOutputStream
constructor：创建DeflaterOutputStream，并write header
write：交给DeflaterOutputStream进行write，并更新crc
finish：将数据全部处理完成之后，write tailer
总的来说：GZIPOutputStream在DeflaterOutputStream外层包了一层Header & Tailer

java.util.zip.DeflaterOutputStream

public class DeflaterOutputStream extends FilterOutputStream {
	/**
     * Writes an array of bytes to the compressed output stream. This
     * method will block until all the bytes are written.
     * @param b the data to be written
     * @param off the start offset of the data
     * @param len the length of the data
     * @exception IOException if an I/O error has occurred
     */
	public void write(byte[] b, int off, int len) throws IOException {
        if (def.finished()) {
            throw new IOException("write beyond end of stream");
        }
        if ((off | len | (off + len) | (b.length - (off + len))) < 0) {
            throw new IndexOutOfBoundsException();
        } else if (len == 0) {
            return;
        }
        if (!def.finished()) {
        	// 有兴趣的可以单独学习一下Deflater的用法
            def.setInput(b, off, len);
            while (!def.needsInput()) {
                deflate();
            }
        }
    }

    /**
     * Writes next block of compressed data to the output stream.
     * @throws IOException if an I/O error has occurred
     */
    protected void deflate() throws IOException {
        int len = def.deflate(buf, 0, buf.length);
        if (len > 0) {
            out.write(buf, 0, len);
        }
    }
}

NOTE：从write方法中可以看出，每次的write都会出发def的操作。所以为了得到一定程度的压缩效果，最好不要传入小byte数组，不然压缩效果不明显。

部分总结

Gzip与Deflater的关系：Gzip是一种格式，Deflater是一种压缩算法
Gzip的格式：Gzip使用Deflater算法进行数据压缩，并添加Header和Tailer

Flume中 HDFSCompressedDataStream的一个玩笑

@Override
public void sync() throws IOException {
  // We must use finish() and resetState() here -- flush() is apparently not
  // supported by the compressed output streams (it's a no-op).
  // Also, since resetState() writes headers, avoid calling it without an
  // additional write/append operation.
  // Note: There are bugs in Hadoop & JDK w/ pure-java gzip; see HADOOP-8522.
  serializer.flush();
  if (!isFinished) {
    cmpOut.finish(); // 触发finish逻辑
    isFinished = true;
  }
  fsOut.flush();
  hflushOrSync(this.fsOut);
}

何时会调用sync

当write中的batchCounter(计数器) 等于batchSize时
当HDFSEventSink处理完一次事务之后

可以发现有这么一段神奇的注释
Note: There are bugs in Hadoop & JDK w/ pure-java gzip; see HADOOP-8522.

摘录至：https://issues.apache.org/jira/browse/HADOOP-8522
* ResetableGzipOutputStream creates invalid gzip files when finish() and resetState() are used *
Description
	ResetableGzipOutputStream creates invalid gzip files when finish() and resetState() are used. The issue is that finish() flushes the compressor buffer and writes the gzip CRC32 + data length trailer. After that, resetState() does not repeat the gzip header, but simply starts writing more deflate-compressed data. The resultant files are not readable by the Linux "gunzip" tool. ResetableGzipOutputStream should write valid multi-member gzip files.

转换一下描述（针对ResetableGzipOutputStream）：

开始时，写入Header信息
随后，写入压缩数据
finish时，写入Tailer信息
之后再调用resetState，再写入压缩数据，重复以往
这样的逻辑下生成的gz文件，gunzip是只能解析最前面的一块数据的。以上就是 pure-java gzip 的bug。

那么，很显然，按照以上的逻辑，Flume HDFS Sink不能使用Gzip进行数据压缩。但是也很神奇，Flume官网就是说支持（捂脸.png）。

战场转至Hadoop GzipCodec源码

public class GzipCodec extends DefaultCodec {
	
    public CompressionOutputStream createOutputStream(OutputStream out) {
        return (CompressionOutputStream)(ZlibFactory.isNativeZlibLoaded(this.conf) ? 
        new CompressorStream(out, this.createCompressor(), this.conf.getInt("io.file.buffer.size", 4096)) 
        : new GzipCodec.GzipOutputStream(out));
    }

    public CompressionOutputStream createOutputStream(OutputStream out, Compressor compressor) {
        return (CompressionOutputStream)(compressor != null ? 
        new CompressorStream(out, compressor, this.conf.getInt("io.file.buffer.size", 4096)) 
        : this.createOutputStream(out));
    }

}

显然，可以发现，GzipCodec中提供了两种Gzip实现：

GzipOutputStream，正是上面提到的“玩笑”
CompressorStream，这个则需要依赖于ZlibFactory.isNativeZlibLoaded(this.conf)的结果

那么NativeZlib是个什么鬼呢？跟踪一下，发现：

public class ZlibCompressor implements Compressor {
  static {
    if (NativeCodeLoader.isNativeCodeLoaded()) {
      try {
        // Initialize the native library
        initIDs();
        nativeZlibLoaded = true;
      } catch (Throwable t) {
        // Ignore failure to load/initialize native-zlib
      }
    }
  }
}

结合一些其他资料，可以得出这样的结论：
Hadoop会load native包，并在GzipCodec使用时根据 zlib包是否loaded 来决定 是否使用GzipOutputStream（这个Bug）。这样说，可能大家不是很清楚，但是相信大家一定都见过这样一个WARN

常见的WARN：Unable to load native-hadoop library for your platform

这个WARN就是Hadoop没有找到native包于是就无法成功load时的信息

public final class NativeCodeLoader {

  static {
    try {
      System.loadLibrary("hadoop");
      LOG.debug("Loaded the native-hadoop library");
      nativeCodeLoaded = true;
    } catch (Throwable t) {
      // Ignore failure to load
      LOG.debug("Failed to load native-hadoop with error: " + t);
      LOG.debug("java.library.path=" + System.getProperty("java.library.path"));
    }
    
    if (!nativeCodeLoaded) {
      LOG.warn("Unable to load native-hadoop library for your platform... using builtin-java classes where applicable");
    }
  }
}

这里大家请自行查阅System.loadLibrary()，然后就会发现这么个东西 – java.library.path
由于内容太多，这里就不继续往下了。

回到问题，如何处理

通过上面的说明已经一定程度上描述了问题，接下来就是如何处理这个问题：
其实，最为本质的方式就是调整环境让Hadoop load 到 native包，这样就能让GzipCodec不使用那个BugClass了

但是还有一种绕道的处理方式，就是通过控制Flume参数，让Sink时每次finish都产生一个完整的File。

Flume HDFS Sink 中的 GZIP

从JDK提供的GZIP工具来窥视GZIP格式

java.util.zip.GZIPOutputStream

java.util.zip.DeflaterOutputStream

部分总结

Flume中 HDFSCompressedDataStream的一个玩笑

战场转至Hadoop GzipCodec源码

回到问题，如何处理

使用c#强大的表达式树实现对象的深克隆之解决循环引用的问题

GPT-4o 引领人机交互新风向，向量数据库赛道沸腾了

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU启动那些事（12.A）- uSDHC eMMC启动时间(RT1170)

基于Ubuntu-22.04安装K8s-v1.28.2实验（二）使用kube-vip实现集群VIP访问

企业大模型如何成为自己数据的“百科全书”？

本地SSL证书过期输入命令在IIS自动生成

.NET周刊【5月第2期 2024-05-12】

基于Ubuntu-22.04安装K8s-v1.28.2实验（一）部署K8s

基于Ubuntu-22.04安装K8s-v1.28.2实验（三）数据卷挂载NFS（网络文件系统）

雙親委派 & URLClassLoader & SPI

Avro（非RPC部分）個人知識總結與感悟

GZIP 在 Flume Sink 中的應用

關於flink-metrics-influxdb中的bug

CLASSPATH 官網摘要

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結