背景
Delta Lake 進行數據刪除或更新操作時實際上只是對刪除數據文件進行了一個 remove 標記,在進行 vacuum 前並不會進行物理刪除,因此一些例如在 web 上獲取元數據或進行部分數據展示的操作如果直接從表路徑下獲取 parquet 文件信息,讀到的可能是歷史已經被標記刪除的數據。
用戶可以通過其 snapshot 獲取相應表或分區對應的真實 parquet 路徑,但是 Delta Lake 目前強依賴 Spark ,需要傳入 SparkSession,例如
val snapshot = DeltaLog.forTable(spark, location).snapshot
如果僅僅只是想獲取 snapshot,通過這種方式冷啓動耗時會比較長,所以寫了個簡單的 Delta Lake 工具類用於獲取相關信息,基本能在毫秒級獲得結果。
Code
以下爲 Java 版本,同時也用 scala 實現了一版,感覺scala的代碼可讀性更高。
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.example.GroupReadSupport;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
/**
* 用於讀取 delta 數據
*/
public class DeltaHelper {
private static final Logger LOGGER = LoggerFactory.getLogger(DeltaHelper.class);
public static List<FileStatus> loadFileStatus(String rawPath,
FileSystem fs) throws IOException {
List<Path> pathList = load(rawPath, fs);
List<FileStatus> input = new ArrayList<>();
for (Path p : pathList) {
input.add(fs.getFileStatus(p));
}
return input;
}
/**
* 獲取 delta 表真實 parquet 路徑
*/
public static List<Path> load(String rawPath,
FileSystem fs) throws IOException {
String tablePath = cutPartitionPath(rawPath);
String partitionPath = tablePath.length() != rawPath.length() ? rawPath.substring(tablePath.length() + 1) : "";
Path deltaLogPath = fs.makeQualified(new Path(tablePath, "_delta_log"));
ArrayList<Path> result = new ArrayList<>();
ArrayList<Path> parquetPathList = new ArrayList<>();
LinkedList<String> checkPointPath = new LinkedList<>();
LinkedList<String> afterCheckPointPath = new LinkedList<>();
long lastCheckPointIndex = 0L;
for (FileStatus fileStatus : fs.listStatus(deltaLogPath)) {
Path path = fileStatus.getPath();
if (path.toString().contains("parquet")) {
parquetPathList.add(path);
}
}
if (parquetPathList.size() != 0) {
String lastCheckPointPath = parquetPathList.get(parquetPathList.size() - 1).toString();
lastCheckPointIndex = getDeltaLogIndex(lastCheckPointPath, "parquet");
checkPointPath = getCheckPointPath(lastCheckPointPath, fs.getConf(), partitionPath);
}
for (FileStatus fileStatus : fs.listStatus(deltaLogPath)) {
Path path = fileStatus.getPath();
if (path.toString().contains("json")) {
// 不存在 checkpoint 的情況下讀取全部 json,存在 checkpoint 的情況只讀取 index 比 lastCheckPointIndex 的 json
if (lastCheckPointIndex == 0 || getDeltaLogIndex(path.toString(), "json") > lastCheckPointIndex) {
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));
String line;
line = br.readLine();
while (line != null) {
JSONObject obj = JSON.parseObject(line);
JSONObject addObj = obj.getJSONObject("add");
JSONObject removeObj = obj.getJSONObject("remove");
if (addObj != null) {
String addPath = addObj.getString("path");
if (StringUtils.isNoneEmpty(addPath) && partitionCond(addPath, partitionPath)) {
afterCheckPointPath.add(addPath);
}
} else if (removeObj != null) {
String removePath = removeObj.getString("path");
if (StringUtils.isNoneEmpty(removePath) && partitionCond(removePath, partitionPath)) {
checkPointPath.remove(removePath);
afterCheckPointPath.remove(removePath);
}
}
line = br.readLine();
}
}
}
}
checkPointPath.addAll(afterCheckPointPath);
for (String path : checkPointPath) {
result.add(new Path(tablePath + "/" + path));
}
return result;
}
/**
* 判斷表目錄下是否存在 _delta_log
* /user/hive/warehouse/db_name/table_name/_delta_log
*/
public static boolean isDeltaTable(String path,
FileSystem fs) throws IOException {
Path deltaLogPath = fs.makeQualified(new Path(cutPartitionPath(path), "_delta_log"));
return fs.exists(deltaLogPath);
}
/**
* /a/b/c=1/d=2 => /a/b
*/
private static String cutPartitionPath(String path) {
String lastStr = path.substring(path.lastIndexOf("/") + 1);
if (lastStr.contains("=")) {
return cutPartitionPath(path.substring(0, path.indexOf(lastStr) - 1));
} else {
return path;
}
}
/**
* 獲取 deltaLog 的索引
*/
private static Long getDeltaLogIndex(String path,
String format) {
String index;
if (format.equals("parquet")) {
index = path.substring(path.indexOf("_delta_log/") + 11, path.indexOf(".checkpoint.parquet"));
} else {
index = path.substring(path.indexOf("_delta_log/") + 11, path.indexOf(".json"));
}
return Long.parseLong(index);
}
/**
* 分區路徑判斷
*/
private static boolean partitionCond(String path,
String partition) {
return (path.contains(partition) && StringUtils.isNoneBlank(partition)) || StringUtils.isBlank(partition);
}
/**
* 從 checkpoint(parquet) 中獲取對應的路徑
*/
private static LinkedList<String> getCheckPointPath(String path,
Configuration conf,
String partitionPath) {
LinkedList<String> parquetList = new LinkedList<>();
if (StringUtils.isNoneEmpty(path)) {
try {
ParquetReader<Group> reader = ParquetReader.builder(new GroupReadSupport(), new Path(path)).withConf(conf).build();
Group recordData = reader.read();
while (recordData != null) {
String addPath;
String removePath;
try {
addPath = recordData.getGroup("add", 0).getString("path", 0);
if (partitionCond(addPath, partitionPath)) {
parquetList.add(addPath);
}
} catch (RuntimeException ignored) {
}
try {
removePath = recordData.getGroup("remove", 0).getString("path", 0);
if (partitionCond(removePath, partitionPath)) {
parquetList.remove(removePath);
}
} catch (RuntimeException ignored) {
}
recordData = reader.read();
}
} catch (IOException e) {
LOGGER.error("讀取 delta parquet checkpoint 失敗");
}
}
return parquetList;
}
}
使用
input1:
獲取某表所有存活的 parquet 文件路徑
load("/user/hive/warehouse/db_name/table_name")
output1:
/user/hive/warehouse/db_name/table_name/a1.parquet
...
input2:
獲取某表某分區所有存活的 parquet 文件路徑
load("/user/hive/warehouse/db_name/table_name/ds=20200101")
output2:
/user/hive/warehouse/db_name/table_name/ds=20200101/a1.parquet
...
後記
能夠理解 Databricks 團隊對 Delta Lake 的定位及在海量數據場景下通過 Spark 讀取的優勢,但是目前過於依賴 Spark 生態圈,使得有些場景使用起來會比較困難。