文章目錄

Flume 源碼解析：HDFS Sink

轉自: https://blog.csdn.net/zjerryj/article/details/82937232

Flume 源碼解析：HDFS Sink

Apache Flume 數據流程的最後一部分是 Sink，它會將上游抽取並轉換好的數據輸送到外部存儲中去，如本地文件、HDFS、ElasticSearch 等。本文將通過分析源碼來展現 HDFS Sink 的工作流程。

Sink 組件的生命週期

HDFS Sink 模塊中的類

HDFS Sink 模塊的源碼在 flume-hdfs-sink 子目錄中，主要由以下幾個類組成：

HDFSEventSink 類實現了生命週期的各個方法，包括 configure、start、process、stop 等。它啓動後會維護一組 BucketWriter 實例，每個實例對應一個 HDFS 輸出文件路徑，上游的消息會傳遞給它，並寫入 HDFS。通過不同的 HDFSWriter 實現，它可以將數據寫入文本文件、壓縮文件、或是 SequenceFile。

配置與啓動

Flume 配置文件加載時，會實例化各個組件，並調用它們的 configure 方法，其中就包括 Sink 組件。在 HDFSEventSink#configure 方法中，程序會讀取配置文件中以 hdfs. 爲開頭的項目，爲其提供默認值，並做基本的參數校驗。如，batchSize 必須大於零，fileType 指定爲 CompressedStream 時 codeC 參數也必須指定等等。同時，程序還會初始化一個 SinkCounter，用於統計運行過程中的各項指標。

public void configure(Context context) {
  filePath = Preconditions.checkNotNull(
      context.getString("hdfs.path"), "hdfs.path is required");
  rollInterval = context.getLong("hdfs.rollInterval", defaultRollInterval);

  if (sinkCounter == null) {
    sinkCounter = new SinkCounter(getName());
  }
}

HDFSEventSink#start 方法中會創建兩個線程池：callTimeoutPool 線程池會在 BucketWriter#callWithTimeout 方法中使用，用來限定 HDFS 遠程調用的請求時間，如 FileSystem#create 或 FSDataOutputStream#hflush 都有可能超時；timedRollerPool 則用於對文件進行滾動，前提是用戶配置了 rollInterval 選項，我們將在下一節詳細說明。

public void start() {
  callTimeoutPool = Executors.newFixedThreadPool(threadsPoolSize,
      new ThreadFactoryBuilder().setNameFormat(timeoutName).build());
  timedRollerPool = Executors.newScheduledThreadPool(rollTimerPoolSize,
      new ThreadFactoryBuilder().setNameFormat(rollerName).build());
}

處理數據

process 方法包含了 HDFS Sink 的主要邏輯，也就是從上游的 Channel 中獲取數據，並寫入指定的 HDFS 文件，流程圖如下：

Channel 事務

處理邏輯的外層是一個 Channel 事務，並提供了異常處理。以 Kafka Channel 爲例：事務開始時，程序會從 Kafka 中讀取數據，但不會立刻提交變動後的偏移量。只有當這些消息被成功寫入 HDFS 文件之後，偏移量纔會提交給 Kafka，下次循環將從新的偏移量開始消費。

Channel channel = getChannel();
Transaction transaction = channel.getTransaction();
transaction.begin()
try {
  event = channel.take();
  bucketWriter.append(event);
  transaction.commit()
} catch (Throwable th) {
  transaction.rollback();
  throw new EventDeliveryException(th);
} finally {
  transaction.close();
}

查找或創建 BucketWriter

BucketWriter 實例和 HDFS 文件一一對應，文件路徑是通過配置生成的，例如：

a1.sinks.access_log.hdfs.path = /user/flume/access_log/dt=%Y%m%d
a1.sinks.access_log.hdfs.filePrefix = events.%[localhost]
a1.sinks.access_log.hdfs.inUsePrefix = .
a1.sinks.access_log.hdfs.inUseSuffix = .tmp
a1.sinks.access_log.hdfs.rollInterval = 300
a1.sinks.access_log.hdfs.fileType = CompressedStream
a1.sinks.access_log.hdfs.codeC = lzop

以上配置生成的臨時文件和目標文件路徑爲：

/user/flume/access_log/dt=20180925/.events.hostname1.1537848761307.lzo.tmp
/user/flume/access_log/dt=20180925/events.hostname1.1537848761307.lzo

配置中的佔位符會由 BucketPath#escapeString 方法替換，Flume 支持三類佔位符：

%{…}：使用消息中的頭信息進行替換；
%[…]：目前僅支持 %[localhost]、%[ip]、以及 %[fqdn]；
%x：日期佔位符，通過頭信息中的 timestamp 來生成，或者使用 useLocalTimeStamp 配置項。
文件的前後綴則是在 BucketWriter#open 方法中追加的。代碼中的 counter 是當前文件的創建時間戳，lzo 則是當前壓縮格式的默認文件後綴。

String fullFileName = fileName + "." + counter;
fullFileName += fileSuffix;
fullFileName += codeC.getDefaultExtension();
bucketPath = filePath + "/" + inUsePrefix + fullFileName + inUseSuffix;
targetPath = filePath + "/" + fullFileName;

如果指定路徑沒有對應的 BucketWriter 實例，程序會創建一個，並根據 fileType 配置項來生成對應的 HDFSWriter 實例。Flume 支持的三種類型是：HDFSSequenceFile、HDFSDataStream、以及 HDFSCompressedDataStream，寫入 HDFS 的動作是由這些類中的代碼完成的。

bucketWriter = sfWriters.get(lookupPath);
if (bucketWriter == null) {
  hdfsWriter = writerFactory.getWriter(fileType);
  bucketWriter = new BucketWriter(hdfsWriter);
  sfWriters.put(lookupPath, bucketWriter);
}

寫入數據並刷新

在寫入數據之前，BucketWriter 首先會檢查文件是否已經打開，如未打開則會命關聯的 HDFSWriter 類開啓新的文件，以 HDFSCompressedDataStream 爲例：

public void open(String filePath, CompressionCodec codec) {
  FileSystem hdfs = dstPath.getFileSystem(conf);
  fsOut = hdfs.append(dstPath)
  compressor = CodedPool.getCompressor(codec, conf);
  cmpOut = codec.createOutputStream(fsOut, compressor);
  serializer = EventSerializerFactory.getInstance(serializerType, cmpOut);
}

public void append(Event e) throws IO Exception {
  serializer.write(event);
}

Flume 默認的 serializerType 配置是 TEXT，即使用 BodyTextEventSerializer 來序列化數據，不做加工，直接寫進輸出流：

public void write(Event e) throws IOException {
  out.write(e.getBody());
  if (appendNewline) {
    out.write('\n');
  }
}

當 BucketWriter 需要關閉或重開時會調用 HDFSWriter#sync 方法，進而執行序列化實例和輸出流實例上的 flush 方法：

public void sync() throws IOException {
  serializer.flush();
  compOut.finish();
  fsOut.flush();
  hflushOrSync(fsOut);
}

從 Hadoop 0.21.0 開始，Syncable#sync 拆分成了 hflush 和 hsync 兩個方法，前者只是將數據從客戶端的緩存中刷新出去，後者則會保證數據已被寫入 HDFS 本地磁盤。爲了兼容新老 API，Flume 會通過 Java 反射機制來確定 hflush 是否存在，不存在則調用 sync 方法。上述代碼中的 flushOrSync 正是做了這樣的判斷。

文件滾動

HDFS Sink 支持三種滾動方式：按文件大小、按消息數量、以及按時間間隔。按大小和按數量的滾動是在 BucketWriter#shouldRotate 方法中判斷的，每次 append 時都會調用：

private boolean shouldRotate() {
  boolean doRotate = false;
  if ((rollCount > 0) && (rollCount <= eventCounter)) {
    doRotate = true;
  }
  if ((rollSize > 0) && (rollSize <= processSize)) {
    doRotate = true;
  }
  return doRotate;
}

按時間滾動則是使用了上文提到的 timedRollerPool 線程池，通過啓動一個定時線程來實現：

private void open() throws IOException, InterruptedException {
  if (rollInterval > 0) {
    Callable<Void> action = new Callable<Void>() {
      public Void call() throws Exception {
        close(true);
      }
    };
    timedRollFuture = timedRollerPool.schedule(action, rollInterval);
  }
}

關閉與停止

當 HDFSEventSink#close 被觸發時，它會遍歷所有的 BucketWriter 實例，調用它們的 close 方法，進而關閉下屬的 HDFSWriter。這個過程和 flush 類似，只是還會做一些額外操作，如關閉後的 BucketWriter 會將自身從 sfWriters 哈希表中移除：

public synchronized void close(boolean callCloseCallback) {
  writer.close();
  timedRollFuture.cancel(false);
  onCloseCallback.run(onCloseCallbackPath);
}

onCloseCallback 回調函數是在 HDFSEventSink 初始化 BucketWriter 時傳入的：

WriterCallback closeCallback = new WriterCallback() {
  public void run(String bucketPath) {
      synchronized (sfWritersLock) {
        sfWriters.remove(bucketPath);
      }
  }
}
bucketWriter = new BucketWriter(lookPath, closeCallback);

最後，HDFSEventSink 會關閉 callTimeoutPool 和 timedRollerPool 線程池，整個組件隨即停止。

ExecutorService[] toShutdown = { callTimeoutPool, timedRollerPool };
for (ExecutorService execService : toShutdown) {
  execService.shutdown();
}

參考資料
https://flume.apache.org/FlumeUserGuide.html#hdfs-sink
https://github.com/apache/flume
https://data-flair.training/blogs/flume-sink-processors/
http://hadoop-hbase.blogspot.com/2012/05/hbase-hdfs-and-durable-sync.html

Flume 源碼解析：HDFS Sink

文章目錄

Flume 源碼解析：HDFS Sink

Sink 組件的生命週期

HDFS Sink 模塊中的類

配置與啓動

處理數據

Channel 事務

查找或創建 BucketWriter

寫入數據並刷新

文件滾動

關閉與停止

lfrABHSoXu

Flink Streaming專題 -1 FlinkStreaming 概述和事件時間EventTime解讀

java類的加載機制

Flink 專題 -3窗口滾動和滑動操作

Flink Stream dataSource/ watermarks /eventtime

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結