Hive不讀取下劃線文件

文章目錄

現象

在hive中建了一個parquet表,導入文件數據結構如下

_success     0M
_commited_   10M
_started_xxx  0M
part-xxx.parquet 40M
...

查詢表的數據量是正確的,但是多出來三個下劃線開頭的文件,考慮hive是read on schema的模式,爲什麼不讀取這幾個文件呢?

原因

發現網上有人討論過這些現象,但是要麼沒解釋原因要麼解釋的不清楚
https://blog.csdn.net/weixin_34357267/article/details/92599750
https://stackoverflow.com/questions/31466884/hive-not-recognizing-underscore-in-path

繼續查看Hive源碼FileInputFormat.class文件

    private static final PathFilter hiddenFileFilter = new PathFilter() {
        public boolean accept(Path p) {
            String name = p.getName();
            return !name.startsWith("_") && !name.startsWith(".");
        }
    };

hive中是以下劃線_或者點.爲隱藏文件

在listStatus中,會把hiddenFileFilter加入到filters中

    protected FileStatus[] listStatus(JobConf job) throws IOException {
        Path[] dirs = getInputPaths(job);
        if (dirs.length == 0) {
            throw new IOException("No input paths specified in job");
        } else {
            TokenCache.obtainTokensForNamenodes(job.getCredentials(), dirs, job);
            boolean recursive = job.getBoolean("mapreduce.input.fileinputformat.input.dir.recursive", false);
            List<FileStatus> result = new ArrayList();
            List<IOException> errors = new ArrayList();
            List<PathFilter> filters = new ArrayList();
            filters.add(hiddenFileFilter);
            PathFilter jobFilter = getInputPathFilter(job);
            if (jobFilter != null) {
                filters.add(jobFilter);
            }

            PathFilter inputFilter = new FileInputFormat.MultiPathFilter(filters);
            Path[] arr$ = dirs;
            int len$ = dirs.length;

            for(int i$ = 0; i$ < len$; ++i$) {
                Path p = arr$[i$];
                FileSystem fs = p.getFileSystem(job);
                FileStatus[] matches = fs.globStatus(p, inputFilter);
                if (matches == null) {
                    errors.add(new IOException("Input path does not exist: " + p));
                } else if (matches.length == 0) {
                    errors.add(new IOException("Input Pattern " + p + " matches 0 files"));
                } else {
                    FileStatus[] arr$ = matches;
                    int len$ = matches.length;

                    label65:
                    for(int i$ = 0; i$ < len$; ++i$) {
                        FileStatus globStat = arr$[i$];
                        if (!globStat.isDirectory()) {
                            result.add(globStat);
                        } else {
                            RemoteIterator iter = fs.listLocatedStatus(globStat.getPath());

                            while(true) {
                                while(true) {
                                    LocatedFileStatus stat;
                                    do {
                                        if (!iter.hasNext()) {
                                            continue label65;
                                        }

                                        stat = (LocatedFileStatus)iter.next();
                                    } while(!inputFilter.accept(stat.getPath()));

                                    if (recursive && stat.isDirectory()) {
                                        this.addInputPathRecursively(result, fs, stat.getPath(), inputFilter);
                                    } else {
                                        result.add(stat);
                                    }
                                }
                            }
                        }
                    }
                }
            }

            if (!errors.isEmpty()) {
                throw new InvalidInputException(errors);
            } else {
                LOG.info("Total input paths to process : " + result.size());
                return (FileStatus[])result.toArray(new FileStatus[result.size()]);
            }
        }
    }
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章