剛開始工作時有個研發大哥告訴，看源碼時，首先要了解各部分的數據結構設計，搞清楚數據結構的設計對於理解代碼很有幫助，我一直將這句話記在心裏。此前因爲工作需要，需要對kudu-client 的部分代碼進行修改，以實現數據加解密。因此，我仔細研究了一下kudu java client 中對於kudu row的數據結構設計和數據的讀寫流程，這裏我將row的設計思路捋一下，並做一個筆記。

文章目錄

數據寫入--PartialRow

PartialRow 數據寫入

數據寫入–PartialRow

Kudu clien中對於一行數據（Row）的數據結構設計如下：

結構整體是比較簡單的：
schama ：保存了字段的信息
varLengthData：用於存儲變長數據，如string， binary這種類型數據
rowAlloc：用於存儲數值類型這種定長數據
兩個bitset用於記錄字段是否set了值， froze用於控制row是否可修改
上述數據結構初始化工作如下：

/**
   * This is not a stable API, prefer using {@link Schema#newPartialRow()}
   * to create a new partial row.
   * @param schema the schema to use for this row
   */
  public PartialRow(Schema schema) {
    this.schema = schema;
    this.columnsBitSet = new BitSet(this.schema.getColumnCount());
    this.nullsBitSet = schema.hasNullableColumns() ?
        new BitSet(this.schema.getColumnCount()) : null;
    this.rowAlloc = new byte[schema.getRowSize()];
    // Pre-fill the array with nulls. We'll only replace cells that have varlen values.
    this.varLengthData = Arrays.asList(new ByteBuffer[this.schema.getColumnCount()]);
  }

上面對varLengthData 進行了初始化，並且計算了rowAlloc 的size，這裏我們重點看下如何計算rowAlloc 數據大小的：

/**
   * Gives the size in bytes for a single row given the specified schema
   * @param columns the row's columns
   * @return row size in bytes
   */
  private  int getRowSize(List<ColumnSchema> columns) {
    int totalSize = 0;
    boolean hasNullables = false;
    Set<Integer> encryptedColIndices = getEncryptedColumnIdsToAlgortithms().keySet();
    int size = columns.size();
    for (int i = 0; i < size; i++) {
      totalSize += columns.get(i).getTypeSize(encryptedColIndices.contains(i));
      hasNullables |= columns.get(i).isNullable();
    }
    if (hasNullables) {
      totalSize += Bytes.getBitSetSize(columns.size());
    }
    return totalSize;
  }

上面的代碼對每個字段進行遍歷，計算數據長度。我們再看下每個數據類型到底佔了多大的長度：

/**
   * Gives the size in bytes for a given DataType, as per the pb specification
   * @param type pb type
   * @return size in bytes
   */
  private static int getTypeSize(DataType type) {
    switch (type) {
      case STRING:
      case BINARY:
        return 8 + 8; // offset then string length
      case BOOL:
      case INT8:
      case IS_DELETED:
        return 1;
      case INT16:
        return Shorts.BYTES;
      case INT32:
      case FLOAT:
        return Ints.BYTES;
      case INT64:
      case DOUBLE:
      case UNIXTIME_MICROS:
        return Longs.BYTES;
      default: throw new IllegalArgumentException("The provided data type doesn't map" +
          " to know any known one.");
    }
  }

這裏給string 和binary分配了16個字節用於記錄數據的offset和數據的實際長度（8+8），其餘的數據根據當前的情況分配固定的長度。

PartialRow 數據寫入

這裏寫入就是調用其addXX方法，我們以分別爲定長數據和變長數據舉一個set數據的例子。
定長數據以int數據爲例：

/**
   * Add an int for the specified column.
   * @param columnIndex the column's index in the schema
   * @param val value to add
   * @throws IllegalArgumentException if the value doesn't match the column's type
   * @throws IllegalStateException if the row was already applied
   * @throws IndexOutOfBoundsException if the column doesn't exist
   */
  public void addInt(int columnIndex, int val) {
    checkNotFrozen();
    checkColumn(schema.getColumnByIndex(columnIndex), Type.INT32);
    Bytes.setInt(rowAlloc, val, getPositionInRowAllocAndSetBitSet(columnIndex));

這裏非常簡單，就是直接把值set到rowAlloc 字節數組裏面，並且將改字段標記爲已set值，重要的是如何拿到數據應該存放的位置，即getPositionInRowAllocAndSetBitSet

 /**
   * Sets the column bit set for the column index, and returns the column's offset.
   * @param columnIndex the index of the column to get the position for and mark as set
   * @return the offset in rowAlloc for the column
   */
  private int getPositionInRowAllocAndSetBitSet(int columnIndex) {
    columnsBitSet.set(columnIndex);
    return schema.getColumnOffset(columnIndex);
  }

變長,以string數據爲例，非常簡單，直接放到addVarLengthData裏面就可以了。

 /**
   * Add a String for the specified value, encoded as UTF8.
   * Note that the provided value must not be mutated after this.
   * @param columnIndex the column's index in the schema
   * @param val value to add
   * @throws IllegalArgumentException if the value doesn't match the column's type
   * @throws IllegalStateException if the row was already applied
   * @throws IndexOutOfBoundsException if the column doesn't exist
   */
  public void addStringUtf8(int columnIndex, byte[] val) {
    // TODO: use Utf8.isWellFormed from Guava 16 to verify that.
    // the user isn't putting in any garbage data.
    checkNotFrozen();
    checkColumn(schema.getColumnByIndex(columnIndex), Type.STRING);
    addVarLengthData(columnIndex, val);
  }

數據讀取–RowResult

註釋中寫道：RowResult represents one row from a scanner. 也就是代表讀取kudu數據後返回的行。其結構如下：

其中， rowData存儲定長數據，indirectData存儲string和binary的變長數據。我們以long和string爲例看看如何讀取數據：
之所以要以long類型舉例是因爲這裏有個特殊的地方, 還記得PartialRow中在rowAlloc中也爲string和binary申請了空間，這個空間是用於存儲變長數據的offset和length的，一共16個字節，8字節用於offset，8字節用於length。

 /**
   * Get the specified column's long
   *
   * If this is a UNIXTIME_MICROS column, the long value corresponds to a number of microseconds
   * since midnight, January 1, 1970 UTC.
   *
   * @param columnIndex Column index in the schema
   * @return a positive long
   * @throws IllegalArgumentException if the column is null
   * @throws IndexOutOfBoundsException if the column doesn't exist
   */
  public long getLong(int columnIndex) {
    checkValidColumn(columnIndex);
    checkNull(columnIndex);
    checkType(columnIndex, Type.INT64, Type.UNIXTIME_MICROS);
    return getLongOrOffset(columnIndex);
  }

  /**
   * Returns the long column value if the column type is INT64 or UNIXTIME_MICROS.
   * Returns the column's offset into the indirectData if the column type is BINARY or STRING.
   * @param columnIndex Column index in the schema
   * @return a long value for the column
   */
  long getLongOrOffset(int columnIndex) {
    return Bytes.getLong(this.rowData.getRawArray(),
        this.rowData.getRawOffset() +
            getCurrentRowDataOffsetForColumn(columnIndex));
  }

對於string數據，先是獲取了offset，然後計算了再讀取offset緊接着的8個字節數據長度，然後根據offset和length，以及基礎的數據offset讀取到數據。這裏數據的最大長度爲Integer.MAX_VALUE

/**
   * Get the specified column's string.
   * @param columnIndex Column index in the schema
   * @return a string
   * @throws IllegalArgumentException if the column is null
   * or if the type doesn't match the column's type
   * @throws IndexOutOfBoundsException if the column doesn't exist
   */
  public String getString(int columnIndex) {
    checkValidColumn(columnIndex);
    checkNull(columnIndex);
    checkType(columnIndex, Type.STRING);
    // C++ puts a Slice in rowData which is 16 bytes long for simplicity, but we only support ints.
    long offset = getLongOrOffset(columnIndex);
    long length = rowData.getLong(getCurrentRowDataOffsetForColumn(columnIndex) + 8);
    assert offset < Integer.MAX_VALUE;
    assert length < Integer.MAX_VALUE;
    return Bytes.getString(indirectData.getRawArray(),
                           indirectData.getRawOffset() + (int)offset,
                           (int)length);
  }

總結

以上就是kudu client中對於數據寫入和讀取的數據結構及讀寫邏輯的一個筆記。如有不當，歡迎指出。

Kudu java 客戶端Row 讀寫數據結構設計

文章目錄

數據寫入–PartialRow

PartialRow 數據寫入

數據讀取–RowResult

總結

Spark default 分區爲空時無法查詢的問題解決

記一個hive1.2.1 orc 事務表不能正常提交合並任務的問題

從零開始搭建一個windows下的presto開發調試環境

Idea開發調試MapReduce的wordCount

記一個Spark2.3 JDBC連接thriftServer 創建臨時函數的bug

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結