剛開始工作時有個研發大哥告訴,看源碼時,首先要了解各部分的數據結構設計,搞清楚數據結構的設計對於理解代碼很有幫助,我一直將這句話記在心裏。此前因爲工作需要,需要對kudu-client 的部分代碼進行修改,以實現數據加解密。因此,我仔細研究了一下kudu java client 中對於kudu row的數據結構設計和數據的讀寫流程,這裏我將row的設計思路捋一下,並做一個筆記。
數據寫入–PartialRow
Kudu clien中對於一行數據(Row)的數據結構設計如下:
結構整體是比較簡單的:
schama :保存了字段的信息
varLengthData:用於存儲變長數據,如string, binary這種類型數據
rowAlloc:用於存儲數值類型這種定長數據
兩個bitset用於記錄字段是否set了值, froze用於控制row是否可修改
上述數據結構初始化工作如下:
/**
* This is not a stable API, prefer using {@link Schema#newPartialRow()}
* to create a new partial row.
* @param schema the schema to use for this row
*/
public PartialRow(Schema schema) {
this.schema = schema;
this.columnsBitSet = new BitSet(this.schema.getColumnCount());
this.nullsBitSet = schema.hasNullableColumns() ?
new BitSet(this.schema.getColumnCount()) : null;
this.rowAlloc = new byte[schema.getRowSize()];
// Pre-fill the array with nulls. We'll only replace cells that have varlen values.
this.varLengthData = Arrays.asList(new ByteBuffer[this.schema.getColumnCount()]);
}
上面對varLengthData 進行了初始化,並且計算了rowAlloc 的size,這裏我們重點看下如何計算rowAlloc 數據大小的:
/**
* Gives the size in bytes for a single row given the specified schema
* @param columns the row's columns
* @return row size in bytes
*/
private int getRowSize(List<ColumnSchema> columns) {
int totalSize = 0;
boolean hasNullables = false;
Set<Integer> encryptedColIndices = getEncryptedColumnIdsToAlgortithms().keySet();
int size = columns.size();
for (int i = 0; i < size; i++) {
totalSize += columns.get(i).getTypeSize(encryptedColIndices.contains(i));
hasNullables |= columns.get(i).isNullable();
}
if (hasNullables) {
totalSize += Bytes.getBitSetSize(columns.size());
}
return totalSize;
}
上面的代碼對每個字段進行遍歷,計算數據長度。我們再看下每個數據類型到底佔了多大的長度:
/**
* Gives the size in bytes for a given DataType, as per the pb specification
* @param type pb type
* @return size in bytes
*/
private static int getTypeSize(DataType type) {
switch (type) {
case STRING:
case BINARY:
return 8 + 8; // offset then string length
case BOOL:
case INT8:
case IS_DELETED:
return 1;
case INT16:
return Shorts.BYTES;
case INT32:
case FLOAT:
return Ints.BYTES;
case INT64:
case DOUBLE:
case UNIXTIME_MICROS:
return Longs.BYTES;
default: throw new IllegalArgumentException("The provided data type doesn't map" +
" to know any known one.");
}
}
這裏給string 和binary分配了16個字節用於記錄數據的offset和數據的實際長度(8+8),其餘的數據根據當前的情況分配固定的長度。
PartialRow 數據寫入
這裏寫入就是調用其addXX方法,我們以分別爲定長數據和變長數據舉一個set數據的例子。
定長數據以int數據爲例:
/**
* Add an int for the specified column.
* @param columnIndex the column's index in the schema
* @param val value to add
* @throws IllegalArgumentException if the value doesn't match the column's type
* @throws IllegalStateException if the row was already applied
* @throws IndexOutOfBoundsException if the column doesn't exist
*/
public void addInt(int columnIndex, int val) {
checkNotFrozen();
checkColumn(schema.getColumnByIndex(columnIndex), Type.INT32);
Bytes.setInt(rowAlloc, val, getPositionInRowAllocAndSetBitSet(columnIndex));
這裏非常簡單,就是直接把值set到rowAlloc 字節數組裏面,並且將改字段標記爲已set值,重要的是如何拿到數據應該存放的位置,即getPositionInRowAllocAndSetBitSet
/**
* Sets the column bit set for the column index, and returns the column's offset.
* @param columnIndex the index of the column to get the position for and mark as set
* @return the offset in rowAlloc for the column
*/
private int getPositionInRowAllocAndSetBitSet(int columnIndex) {
columnsBitSet.set(columnIndex);
return schema.getColumnOffset(columnIndex);
}
變長,以string數據爲例,非常簡單,直接放到addVarLengthData裏面就可以了。
/**
* Add a String for the specified value, encoded as UTF8.
* Note that the provided value must not be mutated after this.
* @param columnIndex the column's index in the schema
* @param val value to add
* @throws IllegalArgumentException if the value doesn't match the column's type
* @throws IllegalStateException if the row was already applied
* @throws IndexOutOfBoundsException if the column doesn't exist
*/
public void addStringUtf8(int columnIndex, byte[] val) {
// TODO: use Utf8.isWellFormed from Guava 16 to verify that.
// the user isn't putting in any garbage data.
checkNotFrozen();
checkColumn(schema.getColumnByIndex(columnIndex), Type.STRING);
addVarLengthData(columnIndex, val);
}
數據讀取–RowResult
註釋中寫道:RowResult represents one row from a scanner. 也就是代表讀取kudu數據後返回的行。其結構如下:
其中, rowData存儲定長數據,indirectData存儲string和binary的變長數據。我們以long和string爲例看看如何讀取數據:
之所以要以long類型舉例是因爲這裏有個特殊的地方, 還記得PartialRow中在rowAlloc中也爲string和binary申請了空間,這個空間是用於存儲變長數據的offset和length的,一共16個字節,8字節用於offset,8字節用於length。
/**
* Get the specified column's long
*
* If this is a UNIXTIME_MICROS column, the long value corresponds to a number of microseconds
* since midnight, January 1, 1970 UTC.
*
* @param columnIndex Column index in the schema
* @return a positive long
* @throws IllegalArgumentException if the column is null
* @throws IndexOutOfBoundsException if the column doesn't exist
*/
public long getLong(int columnIndex) {
checkValidColumn(columnIndex);
checkNull(columnIndex);
checkType(columnIndex, Type.INT64, Type.UNIXTIME_MICROS);
return getLongOrOffset(columnIndex);
}
/**
* Returns the long column value if the column type is INT64 or UNIXTIME_MICROS.
* Returns the column's offset into the indirectData if the column type is BINARY or STRING.
* @param columnIndex Column index in the schema
* @return a long value for the column
*/
long getLongOrOffset(int columnIndex) {
return Bytes.getLong(this.rowData.getRawArray(),
this.rowData.getRawOffset() +
getCurrentRowDataOffsetForColumn(columnIndex));
}
對於string數據,先是獲取了offset,然後計算了再讀取offset緊接着的8個字節數據長度,然後根據offset和length,以及基礎的數據offset讀取到數據。這裏數據的最大長度爲Integer.MAX_VALUE
/**
* Get the specified column's string.
* @param columnIndex Column index in the schema
* @return a string
* @throws IllegalArgumentException if the column is null
* or if the type doesn't match the column's type
* @throws IndexOutOfBoundsException if the column doesn't exist
*/
public String getString(int columnIndex) {
checkValidColumn(columnIndex);
checkNull(columnIndex);
checkType(columnIndex, Type.STRING);
// C++ puts a Slice in rowData which is 16 bytes long for simplicity, but we only support ints.
long offset = getLongOrOffset(columnIndex);
long length = rowData.getLong(getCurrentRowDataOffsetForColumn(columnIndex) + 8);
assert offset < Integer.MAX_VALUE;
assert length < Integer.MAX_VALUE;
return Bytes.getString(indirectData.getRawArray(),
indirectData.getRawOffset() + (int)offset,
(int)length);
}
總結
以上就是kudu client中對於數據寫入和讀取的數據結構及讀寫邏輯的一個筆記。如有不當,歡迎指出。