背景说明
利用apache parquet-mr项目提供的parquet合并接口,完成hdfs上parquet文件的合并,从而减少hdfs上的小文件,减少文件元数据占据namenode的内存。
问题描述
现场环境上线parquet文件合并算子,运行一段时间后,日志中出现too many open files。利用lsof -p 进程号|wc -l命令来查看进程打开的文件句柄数,发现已经接近系统设置的最大数65535。
解决过程
查看org.apache.parquet.hadoop.ParquetFileWriter源码
public void appendFile(Configuration conf, Path file) throws IOException {
ParquetFileReader.open(conf, file).appendTo(this);
}
public void appendTo(ParquetFileWriter writer) throws IOException {
writer.appendRowGroups(f, blocks, true);
}
public void appendRowGroup(SeekableInputStream from, BlockMetaData rowGroup,
boolean dropColumns) throws IOException {
startBlock(rowGroup.getRowCount());
Map<String, ColumnChunkMetaData> columnsToCopy =
new HashMap<String, ColumnChunkMetaData>();
for (ColumnChunkMetaData chunk : rowGroup.getColumns()) {
columnsToCopy.put(chunk.getPath().toDotString(), chunk);
}
List<ColumnChunkMetaData> columnsInOrder =
new ArrayList<ColumnChunkMetaData>();
for (ColumnDescriptor descriptor : schema.getColumns()) {
String path = ColumnPath.get(descriptor.getPath()).toDotString();
ColumnChunkMetaData chunk = columnsToCopy.remove(path);
if (chunk != null) {
columnsInOrder.add(chunk);
} else {
throw new IllegalArgumentException(String.format(
"Missing column '%s', cannot copy row group: %s", path, rowGroup));
}
}
// complain if some columns would be dropped and that's not okay
if (!dropColumns && !columnsToCopy.isEmpty()) {
throw new IllegalArgumentException(String.format(
"Columns cannot be copied (missing from target schema): %s",
Strings.join(columnsToCopy.keySet(), ", ")));
}
// copy the data for all chunks
long start = -1;
long length = 0;
long blockCompressedSize = 0;
for (int i = 0; i < columnsInOrder.size(); i += 1) {
ColumnChunkMetaData chunk = columnsInOrder.get(i);
// get this chunk's start position in the new file
long newChunkStart = out.getPos() + length;
// add this chunk to be copied with any previous chunks
if (start < 0) {
// no previous chunk included, start at this chunk's starting pos
start = chunk.getStartingPos();
}
length += chunk.getTotalSize();
if ((i + 1) == columnsInOrder.size() ||
columnsInOrder.get(i + 1).getStartingPos() != (start + length)) {
// not contiguous. do the copy now.
copy(from, out, start, length);
// reset to start at the next column chunk
start = -1;
length = 0;
}
currentBlock.addColumn(ColumnChunkMetaData.get(
chunk.getPath(),
chunk.getType(),
chunk.getCodec(),
chunk.getEncodingStats(),
chunk.getEncodings(),
chunk.getStatistics(),
newChunkStart,
newChunkStart,
chunk.getValueCount(),
chunk.getTotalSize(),
chunk.getTotalUncompressedSize()));
blockCompressedSize += chunk.getTotalSize();
}
currentBlock.setTotalByteSize(blockCompressedSize);
endBlock();
}
public void end(Map<String, String> extraMetaData) throws IOException {
state = state.end();
LOG.debug("{}: end", out.getPos());
ParquetMetadata footer = new ParquetMetadata(new FileMetaData(schema, extraMetaData, Version.FULL_VERSION), blocks);
serializeFooter(footer, out);
out.close();
}
上述源码显示ParquetFileReader.open(conf, file)构造出来的ParquetFileReader对象,并没有用变量引用,因此该对象中的f变量无法关闭(f变量是SeekableInputStream对象),导致too many open files问题。
解决方法
方法一、修改org.apache.parquet.hadoop.ParquetFileWriter.appendFile方法,代码中持有SeekableInputStream引用,该方法执行的最后一条语句关闭该io流。
方法二、继承org.apache.parquet.hadoop.ParquetFileWriter类,重写appendFile方法,然后在算子代码中使用该继承类。
方法一需要修改parquet-hadoop jar源码包,违背了开闭原则,故在项目中使用了方法二。
结论
经过测试环境验证,该IO流在关闭之后,文件句柄在短时间后,会自动关闭。
看官有收获的话,关注公众号支持一下呗~