背景說明
利用apache parquet-mr項目提供的parquet合併接口,完成hdfs上parquet文件的合併,從而減少hdfs上的小文件,減少文件元數據佔據namenode的內存。
問題描述
現場環境上線parquet文件合併算子,運行一段時間後,日誌中出現too many open files。利用lsof -p 進程號|wc -l命令來查看進程打開的文件句柄數,發現已經接近系統設置的最大數65535。
解決過程
查看org.apache.parquet.hadoop.ParquetFileWriter源碼
public void appendFile(Configuration conf, Path file) throws IOException {
ParquetFileReader.open(conf, file).appendTo(this);
}
public void appendTo(ParquetFileWriter writer) throws IOException {
writer.appendRowGroups(f, blocks, true);
}
public void appendRowGroup(SeekableInputStream from, BlockMetaData rowGroup,
boolean dropColumns) throws IOException {
startBlock(rowGroup.getRowCount());
Map<String, ColumnChunkMetaData> columnsToCopy =
new HashMap<String, ColumnChunkMetaData>();
for (ColumnChunkMetaData chunk : rowGroup.getColumns()) {
columnsToCopy.put(chunk.getPath().toDotString(), chunk);
}
List<ColumnChunkMetaData> columnsInOrder =
new ArrayList<ColumnChunkMetaData>();
for (ColumnDescriptor descriptor : schema.getColumns()) {
String path = ColumnPath.get(descriptor.getPath()).toDotString();
ColumnChunkMetaData chunk = columnsToCopy.remove(path);
if (chunk != null) {
columnsInOrder.add(chunk);
} else {
throw new IllegalArgumentException(String.format(
"Missing column '%s', cannot copy row group: %s", path, rowGroup));
}
}
// complain if some columns would be dropped and that's not okay
if (!dropColumns && !columnsToCopy.isEmpty()) {
throw new IllegalArgumentException(String.format(
"Columns cannot be copied (missing from target schema): %s",
Strings.join(columnsToCopy.keySet(), ", ")));
}
// copy the data for all chunks
long start = -1;
long length = 0;
long blockCompressedSize = 0;
for (int i = 0; i < columnsInOrder.size(); i += 1) {
ColumnChunkMetaData chunk = columnsInOrder.get(i);
// get this chunk's start position in the new file
long newChunkStart = out.getPos() + length;
// add this chunk to be copied with any previous chunks
if (start < 0) {
// no previous chunk included, start at this chunk's starting pos
start = chunk.getStartingPos();
}
length += chunk.getTotalSize();
if ((i + 1) == columnsInOrder.size() ||
columnsInOrder.get(i + 1).getStartingPos() != (start + length)) {
// not contiguous. do the copy now.
copy(from, out, start, length);
// reset to start at the next column chunk
start = -1;
length = 0;
}
currentBlock.addColumn(ColumnChunkMetaData.get(
chunk.getPath(),
chunk.getType(),
chunk.getCodec(),
chunk.getEncodingStats(),
chunk.getEncodings(),
chunk.getStatistics(),
newChunkStart,
newChunkStart,
chunk.getValueCount(),
chunk.getTotalSize(),
chunk.getTotalUncompressedSize()));
blockCompressedSize += chunk.getTotalSize();
}
currentBlock.setTotalByteSize(blockCompressedSize);
endBlock();
}
public void end(Map<String, String> extraMetaData) throws IOException {
state = state.end();
LOG.debug("{}: end", out.getPos());
ParquetMetadata footer = new ParquetMetadata(new FileMetaData(schema, extraMetaData, Version.FULL_VERSION), blocks);
serializeFooter(footer, out);
out.close();
}
上述源碼顯示ParquetFileReader.open(conf, file)構造出來的ParquetFileReader對象,並沒有用變量引用,因此該對象中的f變量無法關閉(f變量是SeekableInputStream對象),導致too many open files問題。
解決方法
方法一、修改org.apache.parquet.hadoop.ParquetFileWriter.appendFile方法,代碼中持有SeekableInputStream引用,該方法執行的最後一條語句關閉該io流。
方法二、繼承org.apache.parquet.hadoop.ParquetFileWriter類,重寫appendFile方法,然後在算子代碼中使用該繼承類。
方法一需要修改parquet-hadoop jar源碼包,違背了開閉原則,故在項目中使用了方法二。
結論
經過測試環境驗證,該IO流在關閉之後,文件句柄在短時間後,會自動關閉。
看官有收穫的話,關注公衆號支持一下唄~