現象
在hive中建了一個parquet表,導入文件數據結構如下
_success 0M
_commited_ 10M
_started_xxx 0M
part-xxx.parquet 40M
...
查詢表的數據量是正確的,但是多出來三個下劃線開頭的文件,考慮hive是read on schema的模式,爲什麼不讀取這幾個文件呢?
原因
發現網上有人討論過這些現象,但是要麼沒解釋原因要麼解釋的不清楚
https://blog.csdn.net/weixin_34357267/article/details/92599750
https://stackoverflow.com/questions/31466884/hive-not-recognizing-underscore-in-path
繼續查看Hive源碼FileInputFormat.class文件
private static final PathFilter hiddenFileFilter = new PathFilter() {
public boolean accept(Path p) {
String name = p.getName();
return !name.startsWith("_") && !name.startsWith(".");
}
};
hive中是以下劃線_或者點.爲隱藏文件
在listStatus中,會把hiddenFileFilter加入到filters中
protected FileStatus[] listStatus(JobConf job) throws IOException {
Path[] dirs = getInputPaths(job);
if (dirs.length == 0) {
throw new IOException("No input paths specified in job");
} else {
TokenCache.obtainTokensForNamenodes(job.getCredentials(), dirs, job);
boolean recursive = job.getBoolean("mapreduce.input.fileinputformat.input.dir.recursive", false);
List<FileStatus> result = new ArrayList();
List<IOException> errors = new ArrayList();
List<PathFilter> filters = new ArrayList();
filters.add(hiddenFileFilter);
PathFilter jobFilter = getInputPathFilter(job);
if (jobFilter != null) {
filters.add(jobFilter);
}
PathFilter inputFilter = new FileInputFormat.MultiPathFilter(filters);
Path[] arr$ = dirs;
int len$ = dirs.length;
for(int i$ = 0; i$ < len$; ++i$) {
Path p = arr$[i$];
FileSystem fs = p.getFileSystem(job);
FileStatus[] matches = fs.globStatus(p, inputFilter);
if (matches == null) {
errors.add(new IOException("Input path does not exist: " + p));
} else if (matches.length == 0) {
errors.add(new IOException("Input Pattern " + p + " matches 0 files"));
} else {
FileStatus[] arr$ = matches;
int len$ = matches.length;
label65:
for(int i$ = 0; i$ < len$; ++i$) {
FileStatus globStat = arr$[i$];
if (!globStat.isDirectory()) {
result.add(globStat);
} else {
RemoteIterator iter = fs.listLocatedStatus(globStat.getPath());
while(true) {
while(true) {
LocatedFileStatus stat;
do {
if (!iter.hasNext()) {
continue label65;
}
stat = (LocatedFileStatus)iter.next();
} while(!inputFilter.accept(stat.getPath()));
if (recursive && stat.isDirectory()) {
this.addInputPathRecursively(result, fs, stat.getPath(), inputFilter);
} else {
result.add(stat);
}
}
}
}
}
}
}
if (!errors.isEmpty()) {
throw new InvalidInputException(errors);
} else {
LOG.info("Total input paths to process : " + result.size());
return (FileStatus[])result.toArray(new FileStatus[result.size()]);
}
}
}