Kudu java 客戶端Row 讀寫數據結構設計

剛開始工作時有個研發大哥告訴,看源碼時,首先要了解各部分的數據結構設計,搞清楚數據結構的設計對於理解代碼很有幫助,我一直將這句話記在心裏。此前因爲工作需要,需要對kudu-client 的部分代碼進行修改,以實現數據加解密。因此,我仔細研究了一下kudu java client 中對於kudu row的數據結構設計和數據的讀寫流程,這裏我將row的設計思路捋一下,並做一個筆記。

數據寫入–PartialRow

Kudu clien中對於一行數據(Row)的數據結構設計如下:
PartialRow 類圖
結構整體是比較簡單的:
schama :保存了字段的信息
varLengthData:用於存儲變長數據,如string, binary這種類型數據
rowAlloc:用於存儲數值類型這種定長數據
兩個bitset用於記錄字段是否set了值, froze用於控制row是否可修改
上述數據結構初始化工作如下:

/**
   * This is not a stable API, prefer using {@link Schema#newPartialRow()}
   * to create a new partial row.
   * @param schema the schema to use for this row
   */
  public PartialRow(Schema schema) {
    this.schema = schema;
    this.columnsBitSet = new BitSet(this.schema.getColumnCount());
    this.nullsBitSet = schema.hasNullableColumns() ?
        new BitSet(this.schema.getColumnCount()) : null;
    this.rowAlloc = new byte[schema.getRowSize()];
    // Pre-fill the array with nulls. We'll only replace cells that have varlen values.
    this.varLengthData = Arrays.asList(new ByteBuffer[this.schema.getColumnCount()]);
  }

上面對varLengthData 進行了初始化,並且計算了rowAlloc 的size,這裏我們重點看下如何計算rowAlloc 數據大小的:

/**
   * Gives the size in bytes for a single row given the specified schema
   * @param columns the row's columns
   * @return row size in bytes
   */
  private  int getRowSize(List<ColumnSchema> columns) {
    int totalSize = 0;
    boolean hasNullables = false;
    Set<Integer> encryptedColIndices = getEncryptedColumnIdsToAlgortithms().keySet();
    int size = columns.size();
    for (int i = 0; i < size; i++) {
      totalSize += columns.get(i).getTypeSize(encryptedColIndices.contains(i));
      hasNullables |= columns.get(i).isNullable();
    }
    if (hasNullables) {
      totalSize += Bytes.getBitSetSize(columns.size());
    }
    return totalSize;
  }

上面的代碼對每個字段進行遍歷,計算數據長度。我們再看下每個數據類型到底佔了多大的長度:

/**
   * Gives the size in bytes for a given DataType, as per the pb specification
   * @param type pb type
   * @return size in bytes
   */
  private static int getTypeSize(DataType type) {
    switch (type) {
      case STRING:
      case BINARY:
        return 8 + 8; // offset then string length
      case BOOL:
      case INT8:
      case IS_DELETED:
        return 1;
      case INT16:
        return Shorts.BYTES;
      case INT32:
      case FLOAT:
        return Ints.BYTES;
      case INT64:
      case DOUBLE:
      case UNIXTIME_MICROS:
        return Longs.BYTES;
      default: throw new IllegalArgumentException("The provided data type doesn't map" +
          " to know any known one.");
    }
  }

這裏給string 和binary分配了16個字節用於記錄數據的offset和數據的實際長度(8+8),其餘的數據根據當前的情況分配固定的長度。

PartialRow 數據寫入

這裏寫入就是調用其addXX方法,我們以分別爲定長數據和變長數據舉一個set數據的例子。
定長數據以int數據爲例:

/**
   * Add an int for the specified column.
   * @param columnIndex the column's index in the schema
   * @param val value to add
   * @throws IllegalArgumentException if the value doesn't match the column's type
   * @throws IllegalStateException if the row was already applied
   * @throws IndexOutOfBoundsException if the column doesn't exist
   */
  public void addInt(int columnIndex, int val) {
    checkNotFrozen();
    checkColumn(schema.getColumnByIndex(columnIndex), Type.INT32);
    Bytes.setInt(rowAlloc, val, getPositionInRowAllocAndSetBitSet(columnIndex));
  

這裏非常簡單,就是直接把值set到rowAlloc 字節數組裏面,並且將改字段標記爲已set值,重要的是如何拿到數據應該存放的位置,即getPositionInRowAllocAndSetBitSet

 /**
   * Sets the column bit set for the column index, and returns the column's offset.
   * @param columnIndex the index of the column to get the position for and mark as set
   * @return the offset in rowAlloc for the column
   */
  private int getPositionInRowAllocAndSetBitSet(int columnIndex) {
    columnsBitSet.set(columnIndex);
    return schema.getColumnOffset(columnIndex);
  }

變長,以string數據爲例,非常簡單,直接放到addVarLengthData裏面就可以了。

 /**
   * Add a String for the specified value, encoded as UTF8.
   * Note that the provided value must not be mutated after this.
   * @param columnIndex the column's index in the schema
   * @param val value to add
   * @throws IllegalArgumentException if the value doesn't match the column's type
   * @throws IllegalStateException if the row was already applied
   * @throws IndexOutOfBoundsException if the column doesn't exist
   */
  public void addStringUtf8(int columnIndex, byte[] val) {
    // TODO: use Utf8.isWellFormed from Guava 16 to verify that.
    // the user isn't putting in any garbage data.
    checkNotFrozen();
    checkColumn(schema.getColumnByIndex(columnIndex), Type.STRING);
    addVarLengthData(columnIndex, val);
  }

數據讀取–RowResult

註釋中寫道:RowResult represents one row from a scanner. 也就是代表讀取kudu數據後返回的行。其結構如下:
RowResult類圖
其中, rowData存儲定長數據,indirectData存儲string和binary的變長數據。我們以long和string爲例看看如何讀取數據:
之所以要以long類型舉例是因爲這裏有個特殊的地方, 還記得PartialRow中在rowAlloc中也爲string和binary申請了空間,這個空間是用於存儲變長數據的offset和length的,一共16個字節,8字節用於offset,8字節用於length。

 /**
   * Get the specified column's long
   *
   * If this is a UNIXTIME_MICROS column, the long value corresponds to a number of microseconds
   * since midnight, January 1, 1970 UTC.
   *
   * @param columnIndex Column index in the schema
   * @return a positive long
   * @throws IllegalArgumentException if the column is null
   * @throws IndexOutOfBoundsException if the column doesn't exist
   */
  public long getLong(int columnIndex) {
    checkValidColumn(columnIndex);
    checkNull(columnIndex);
    checkType(columnIndex, Type.INT64, Type.UNIXTIME_MICROS);
    return getLongOrOffset(columnIndex);
  }
  /**
   * Returns the long column value if the column type is INT64 or UNIXTIME_MICROS.
   * Returns the column's offset into the indirectData if the column type is BINARY or STRING.
   * @param columnIndex Column index in the schema
   * @return a long value for the column
   */
  long getLongOrOffset(int columnIndex) {
    return Bytes.getLong(this.rowData.getRawArray(),
        this.rowData.getRawOffset() +
            getCurrentRowDataOffsetForColumn(columnIndex));
  }

對於string數據,先是獲取了offset,然後計算了再讀取offset緊接着的8個字節數據長度,然後根據offset和length,以及基礎的數據offset讀取到數據。這裏數據的最大長度爲Integer.MAX_VALUE

/**
   * Get the specified column's string.
   * @param columnIndex Column index in the schema
   * @return a string
   * @throws IllegalArgumentException if the column is null
   * or if the type doesn't match the column's type
   * @throws IndexOutOfBoundsException if the column doesn't exist
   */
  public String getString(int columnIndex) {
    checkValidColumn(columnIndex);
    checkNull(columnIndex);
    checkType(columnIndex, Type.STRING);
    // C++ puts a Slice in rowData which is 16 bytes long for simplicity, but we only support ints.
    long offset = getLongOrOffset(columnIndex);
    long length = rowData.getLong(getCurrentRowDataOffsetForColumn(columnIndex) + 8);
    assert offset < Integer.MAX_VALUE;
    assert length < Integer.MAX_VALUE;
    return Bytes.getString(indirectData.getRawArray(),
                           indirectData.getRawOffset() + (int)offset,
                           (int)length);
  }

總結

以上就是kudu client中對於數據寫入和讀取的數據結構及讀寫邏輯的一個筆記。如有不當,歡迎指出。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章