寫在前面
我在用datax開發同步工具插件,需要從kafka消費數據,寫入HIVE中。測試工具的時候先使用TxtFileWriter作爲writer,觀察中間結果。
遇到問題
由於我在reader裏面使用while(true)來消費數據。如下圖,打日誌發現數據讀到了,也sendToWriter了,但是生產文件大小爲0
public void startRead(RecordSender recordSender) {
LOG.info("[RowInKafkaTask] start to read here.");
Record record = recordSender.createRecord();
while (true) {
ConsumerRecords<String, String> messages = consumer.poll(Constant.TIME_OUT);
for (ConsumerRecord<String, String> message : messages) {
byte[] row = parseRowKeyFromKafkaMsg(message.value(), this.kafkaColumn);
try {
boolean result = putDataToRecord(record, row);
if (result) {
Log.info("[RowInKafkaTask] result is {}", result);
recordSender.sendToWriter(record);
recordSender.flush();
} else
LOG.error("[RowInKafkaTask] putDataToRecord false");
} catch (Exception e) {
LOG.error("[RowInKafkaTask] exception found.", e);
}
record = recordSender.createRecord();
}
recordSender.flush();
}
}
定位一下:
1、懷疑Writer沒有拿到數據,於是走讀了下TxtFileWriter的流程。發現寫入文件是通過FileOutputStream來寫入文件的。
@Override
public void startWrite(RecordReceiver lineReceiver) {
LOG.info("begin do write...");
String fileFullPath = this.buildFilePath();
LOG.info(String.format("write to file : [%s]", fileFullPath));
OutputStream outputStream = null;
try {
//此處用FileOutPutStream來寫
File newFile = new File(fileFullPath);
newFile.createNewFile();
outputStream = new FileOutputStream(newFile);
UnstructuredStorageWriterUtil.writeToStream(lineReceiver,
outputStream, this.writerSliceConfig, this.fileName,
this.getTaskPluginCollector());
} catch (SecurityException se) {
...
懷疑Record沒拿到。於是加了些日誌如下UnstructuredStorageWriterUtil.java中doWriteToStream方法;
private static void doWriteToStream(RecordReceiver lineReceiver,
BufferedWriter writer, String contex, Configuration config,
TaskPluginCollector taskPluginCollector) throws IOException {
...省略
Record record = null;
while ((record = lineReceiver.getFromReader()) != null) {
LOG.info("[Unstrctured..Util] write one record.");
UnstructuredStorageWriterUtil.transportOneRecord(record,
nullFormat, dateParse, taskPluginCollector,
unstructuredWriter);
}
// warn:由調用方控制流的關閉
// IOUtils.closeQuietly(unstructuredWriter);
}
2、重新編譯插件,/opt/datax/plugin/writer/txtfilewriter/libs路徑下替換原來的plugin-unstructured-storage-util-0.0.1-SNAPSHOT.jar 發現日誌可以打印出來,說明數據已經收到。
3、重複往kafka裏面多放一些數據,發現文件終於寫入,但是總是4K大小往上增加。和小夥伴討論,得到疑點“緩存區是否沒清空”
4、果斷驗證一下,在transportOneRecord裏面增加flush()方法,如下
public static void transportOneRecord(Record record, String nullFormat,
DateFormat dateParse, TaskPluginCollector taskPluginCollector,
UnstructuredWriter unstructuredWriter) {
// warn: default is null
if (null == nullFormat) {
nullFormat = "null";
}
try {
//省略
}
unstructuredWriter.writeOneRecord(splitedRows);
//新增一行
unstructuredWriter.flush();
} catch (Exception e) {
// warn: dirty data
taskPluginCollector.collectDirtyRecord(record, e);
}
}
重新編譯,替換,運行。
寫在最後
顯然最後生效了。
這是datax的一個bug,只有在reader線程常駐的時候存在。
正好幫我水一貼。