Hive不讀取下劃線文件

文章目錄

現象

在hive中建了一個parquet表，導入文件數據結構如下

_success     0M
_commited_   10M
_started_xxx  0M
part-xxx.parquet 40M
...

查詢表的數據量是正確的，但是多出來三個下劃線開頭的文件，考慮hive是read on schema的模式，爲什麼不讀取這幾個文件呢？

原因

發現網上有人討論過這些現象，但是要麼沒解釋原因要麼解釋的不清楚
https://blog.csdn.net/weixin_34357267/article/details/92599750
https://stackoverflow.com/questions/31466884/hive-not-recognizing-underscore-in-path

繼續查看Hive源碼FileInputFormat.class文件

    private static final PathFilter hiddenFileFilter = new PathFilter() {
        public boolean accept(Path p) {
            String name = p.getName();
            return !name.startsWith("_") && !name.startsWith(".");
        }
    };

hive中是以下劃線_或者點.爲隱藏文件

在listStatus中，會把hiddenFileFilter加入到filters中

    protected FileStatus[] listStatus(JobConf job) throws IOException {
        Path[] dirs = getInputPaths(job);
        if (dirs.length == 0) {
            throw new IOException("No input paths specified in job");
        } else {
            TokenCache.obtainTokensForNamenodes(job.getCredentials(), dirs, job);
            boolean recursive = job.getBoolean("mapreduce.input.fileinputformat.input.dir.recursive", false);
            List<FileStatus> result = new ArrayList();
            List<IOException> errors = new ArrayList();
            List<PathFilter> filters = new ArrayList();
            filters.add(hiddenFileFilter);
            PathFilter jobFilter = getInputPathFilter(job);
            if (jobFilter != null) {
                filters.add(jobFilter);
            }

            PathFilter inputFilter = new FileInputFormat.MultiPathFilter(filters);
            Path[] arr$ = dirs;
            int len$ = dirs.length;

            for(int i$ = 0; i$ < len$; ++i$) {
                Path p = arr$[i$];
                FileSystem fs = p.getFileSystem(job);
                FileStatus[] matches = fs.globStatus(p, inputFilter);
                if (matches == null) {
                    errors.add(new IOException("Input path does not exist: " + p));
                } else if (matches.length == 0) {
                    errors.add(new IOException("Input Pattern " + p + " matches 0 files"));
                } else {
                    FileStatus[] arr$ = matches;
                    int len$ = matches.length;

                    label65:
                    for(int i$ = 0; i$ < len$; ++i$) {
                        FileStatus globStat = arr$[i$];
                        if (!globStat.isDirectory()) {
                            result.add(globStat);
                        } else {
                            RemoteIterator iter = fs.listLocatedStatus(globStat.getPath());

                            while(true) {
                                while(true) {
                                    LocatedFileStatus stat;
                                    do {
                                        if (!iter.hasNext()) {
                                            continue label65;
                                        }

                                        stat = (LocatedFileStatus)iter.next();
                                    } while(!inputFilter.accept(stat.getPath()));

                                    if (recursive && stat.isDirectory()) {
                                        this.addInputPathRecursively(result, fs, stat.getPath(), inputFilter);
                                    } else {
                                        result.add(stat);
                                    }
                                }
                            }
                        }
                    }
                }
            }

            if (!errors.isEmpty()) {
                throw new InvalidInputException(errors);
            } else {
                LOG.info("Total input paths to process : " + result.size());
                return (FileStatus[])result.toArray(new FileStatus[result.size()]);
            }
        }
    }

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hive不讀取下劃線文件

文章目錄

現象

原因

Android啓動過程-萬字長文(Android14)

【SQL進階】CASE語句的使用

optional install error: Error: Unsupported URL Type: npm:vue-loader@^16.1.0

這種嵌套字典類型的數據，我想把它讀取到df裏，如何操作？

微調真的能讓LLM學到新東西嗎:引入新知識可能讓模型產生更多的幻覺

iNeuOS工業互聯網操作系統，增加電力IEC104協議

微服務實踐k8s&dapr開發部署實驗（3）訂閱發佈

chromedriver版本

kbgressdb之數據結構V0.2

LeetCode contest180

pycharm問題：module ‘pip’ has no attribute ‘main’

AWS Lambda function 版本化和格式化

Hive不讀取下劃線文件

01.使用靜態工廠方法替代構造方法【Effective Java】

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結