背景

Delta Lake 進行數據刪除或更新操作時實際上只是對刪除數據文件進行了一個 remove 標記，在進行 vacuum 前並不會進行物理刪除，因此一些例如在 web 上獲取元數據或進行部分數據展示的操作如果直接從表路徑下獲取 parquet 文件信息，讀到的可能是歷史已經被標記刪除的數據。

用戶可以通過其 snapshot 獲取相應表或分區對應的真實 parquet 路徑，但是 Delta Lake 目前強依賴 Spark ，需要傳入 SparkSession，例如

val snapshot = DeltaLog.forTable(spark, location).snapshot

如果僅僅只是想獲取 snapshot，通過這種方式冷啓動耗時會比較長，所以寫了個簡單的 Delta Lake 工具類用於獲取相關信息，基本能在毫秒級獲得結果。

Code

以下爲 Java 版本，同時也用 scala 實現了一版，感覺scala的代碼可讀性更高。

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.example.GroupReadSupport;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;

/**
 * 用於讀取 delta 數據
 */
public class DeltaHelper {

    private static final Logger LOGGER = LoggerFactory.getLogger(DeltaHelper.class);

    public static List<FileStatus> loadFileStatus(String rawPath,
                                                  FileSystem fs) throws IOException {
        List<Path> pathList = load(rawPath, fs);
        List<FileStatus> input = new ArrayList<>();
        for (Path p : pathList) {
            input.add(fs.getFileStatus(p));
        }
        return input;
    }

    /**
     * 獲取 delta 表真實 parquet 路徑
     */
    public static List<Path> load(String rawPath,
                                  FileSystem fs) throws IOException {
        String tablePath = cutPartitionPath(rawPath);
        String partitionPath = tablePath.length() != rawPath.length() ? rawPath.substring(tablePath.length() + 1) : "";
        Path deltaLogPath = fs.makeQualified(new Path(tablePath, "_delta_log"));
        ArrayList<Path> result = new ArrayList<>();
        ArrayList<Path> parquetPathList = new ArrayList<>();
        LinkedList<String> checkPointPath = new LinkedList<>();
        LinkedList<String> afterCheckPointPath = new LinkedList<>();
        long lastCheckPointIndex = 0L;

        for (FileStatus fileStatus : fs.listStatus(deltaLogPath)) {
            Path path = fileStatus.getPath();
            if (path.toString().contains("parquet")) {
                parquetPathList.add(path);
            }
        }
        if (parquetPathList.size() != 0) {
            String lastCheckPointPath = parquetPathList.get(parquetPathList.size() - 1).toString();
            lastCheckPointIndex = getDeltaLogIndex(lastCheckPointPath, "parquet");
            checkPointPath = getCheckPointPath(lastCheckPointPath, fs.getConf(), partitionPath);
        }
        for (FileStatus fileStatus : fs.listStatus(deltaLogPath)) {
            Path path = fileStatus.getPath();
            if (path.toString().contains("json")) {
                // 不存在 checkpoint 的情況下讀取全部 json，存在 checkpoint 的情況只讀取 index 比 lastCheckPointIndex 的 json
                if (lastCheckPointIndex == 0 || getDeltaLogIndex(path.toString(), "json") > lastCheckPointIndex) {
                    BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));
                    String line;
                    line = br.readLine();
                    while (line != null) {
                        JSONObject obj = JSON.parseObject(line);
                        JSONObject addObj = obj.getJSONObject("add");
                        JSONObject removeObj = obj.getJSONObject("remove");
                        if (addObj != null) {
                            String addPath = addObj.getString("path");
                            if (StringUtils.isNoneEmpty(addPath) && partitionCond(addPath, partitionPath)) {
                                afterCheckPointPath.add(addPath);
                            }
                        } else if (removeObj != null) {
                            String removePath = removeObj.getString("path");
                            if (StringUtils.isNoneEmpty(removePath) && partitionCond(removePath, partitionPath)) {
                                checkPointPath.remove(removePath);
                                afterCheckPointPath.remove(removePath);
                            }
                        }
                        line = br.readLine();
                    }
                }
            }
        }
        checkPointPath.addAll(afterCheckPointPath);
        for (String path : checkPointPath) {
            result.add(new Path(tablePath + "/" + path));
        }
        return result;
    }

    /**
     * 判斷表目錄下是否存在 _delta_log
     * /user/hive/warehouse/db_name/table_name/_delta_log
     */
    public static boolean isDeltaTable(String path,
                                       FileSystem fs) throws IOException {
        Path deltaLogPath = fs.makeQualified(new Path(cutPartitionPath(path), "_delta_log"));
        return fs.exists(deltaLogPath);
    }

    /**
     * /a/b/c=1/d=2 => /a/b
     */
    private static String cutPartitionPath(String path) {
        String lastStr = path.substring(path.lastIndexOf("/") + 1);
        if (lastStr.contains("=")) {
            return cutPartitionPath(path.substring(0, path.indexOf(lastStr) - 1));
        } else {
            return path;
        }
    }

    /**
     * 獲取 deltaLog 的索引
     */
    private static Long getDeltaLogIndex(String path,
                                         String format) {
        String index;
        if (format.equals("parquet")) {
            index = path.substring(path.indexOf("_delta_log/") + 11, path.indexOf(".checkpoint.parquet"));
        } else {
            index = path.substring(path.indexOf("_delta_log/") + 11, path.indexOf(".json"));
        }
        return Long.parseLong(index);
    }

    /**
     * 分區路徑判斷
     */
    private static boolean partitionCond(String path,
                                         String partition) {
        return (path.contains(partition) && StringUtils.isNoneBlank(partition)) || StringUtils.isBlank(partition);
    }

    /**
     * 從 checkpoint(parquet) 中獲取對應的路徑
     */
    private static LinkedList<String> getCheckPointPath(String path,
                                                        Configuration conf,
                                                        String partitionPath) {
        LinkedList<String> parquetList = new LinkedList<>();
        if (StringUtils.isNoneEmpty(path)) {
            try {
                ParquetReader<Group> reader = ParquetReader.builder(new GroupReadSupport(), new Path(path)).withConf(conf).build();
                Group recordData = reader.read();
                while (recordData != null) {
                    String addPath;
                    String removePath;
                    try {
                        addPath = recordData.getGroup("add", 0).getString("path", 0);
                        if (partitionCond(addPath, partitionPath)) {
                            parquetList.add(addPath);
                        }
                    } catch (RuntimeException ignored) {
                    }
                    try {
                        removePath = recordData.getGroup("remove", 0).getString("path", 0);
                        if (partitionCond(removePath, partitionPath)) {
                            parquetList.remove(removePath);
                        }
                    } catch (RuntimeException ignored) {
                    }
                    recordData = reader.read();
                }
            } catch (IOException e) {
                LOGGER.error("讀取 delta parquet checkpoint 失敗");
            }
        }
        return parquetList;
    }
}

使用

input1:

獲取某表所有存活的 parquet 文件路徑

load("/user/hive/warehouse/db_name/table_name")

output1:

/user/hive/warehouse/db_name/table_name/a1.parquet
...

input2:

獲取某表某分區所有存活的 parquet 文件路徑

load("/user/hive/warehouse/db_name/table_name/ds=20200101")

output2:

/user/hive/warehouse/db_name/table_name/ds=20200101/a1.parquet
...

後記

能夠理解 Databricks 團隊對 Delta Lake 的定位及在海量數據場景下通過 Spark 讀取的優勢，但是目前過於依賴 Spark 生態圈，使得有些場景使用起來會比較困難。

不通過 Spark 獲取 Delta Lake Snapshot

背景

Code

使用

後記

Delta Presto Integration & Manifests 機制

Apache Ranger KMS 部署文檔

Delta 初探

使用 aspectj 對 spark 進行攔截

spark on yarn cgroup 資源隔離(cpu篇)

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結