ORC 從stream read 到 column read 的轉變

ORC Stream

Orc 在讀取一個 stripe時，是安裝stream爲單位讀取的，stripe 中的column可能只有一個stream，或者多個不同屬性的stream組成，stream 不是 column的子單元，

enum Kind {
   // boolean stream of whether the next value is non-null
   PRESENT = 0;
   // the primary data stream
   DATA = 1;
   // the length of each value for variable length data
   LENGTH = 2;
   // the dictionary blob
   DICTIONARY_DATA = 3;
   // deprecated prior to Hive 0.11
   // It was used to store the number of instances of each value in the
   // dictionary
   DICTIONARY_COUNT = 4;
   // a secondary data stream
   SECONDARY = 5;
   // the index for seeking to particular row groups
   ROW_INDEX = 6;
   // original bloom filters used before ORC-101
   BLOOM_FILTER = 7;
   // bloom filters that consistently use utf8
   BLOOM_FILTER_UTF8 = 8;

   // Virtual stream kinds to allocate space for encrypted index and data.
   ENCRYPTED_INDEX = 9;
   ENCRYPTED_DATA = 10;

   // stripe statistics streams
   STRIPE_STATISTICS = 100;
   // A virtual stream kind that is used for setting the encryption IV.
   FILE_STATISTICS = 101;
 }

  public static Area getArea(OrcProto.Stream.Kind kind) {
    switch (kind) {
      case ROW_INDEX:
      case DICTIONARY_COUNT:
      case BLOOM_FILTER:
      case BLOOM_FILTER_UTF8:
        return Area.INDEX;
      default:
        return Area.DATA;
    }
  }

From stream to Column reading

現在根據開發需求，將 stream reading 變爲 column reading

重寫 RecordReaderImpl 相關接口

因爲是以列爲單位讀取，
includedRowGroups 和 readAllDataStreams(讀取一個stripe中的所有stream) 就被註釋了，

protected void readStripe() throws IOException {
    StripeInformation stripe = beginReadStripe();

// includedRowGroups = pickRowGroups();
...
// readAllDataStreams(stripe);
...

    if (rowInStripe < rowCountInStripe) {
      readPartialDataStreams(stripe);
      reader.startStripe(streams, stripeFooter);
      // if we skipped the first row group, move the pointers forward
      if (rowInStripe != 0) {
        seekToRowEntry(reader, (int) (rowInStripe / rowIndexStride));
      }
    }
  }

private void readPartialDataStreams(StripeInformation stripe) throws IO
Exception {
 // 獲取相關位置 columns 的相應位置
DiskRangeList toRead = planReadColumnData(streamList, fileIncluded);
...
    // 讀取 column
    bufferChunks = ((RecordReaderBinaryCacheUtils.DefaultDataReader)dataReader)
            .readFileColumnData(toRead, stripe.getOffset(), false);
....
    createStreams(streamList, bufferChunks, fileIncluded,
            dataReader.getCompressionCodec(), bufferSize, streams);
  }

// planReadColumnData

/**
   * Plan the ranges of the file that we need to read, given the list of
   * columns in one stripe.
   *
   * @param streamList        the list of streams available
   * @param includedColumns   which columns are needed
   * @return the list of disk ranges that will be loaded
   */
  private DiskRangeList planReadColumnData(
          List<OrcProto.Stream> streamList,
          boolean[] includedColumns) {
    long offset = 0;
    Map<Integer, RecordReaderBinaryCacheImpl.ColumnDiskRange> columnDiskRangeMap = new HashMap<Integer, RecordReaderBinaryCacheImpl.ColumnDiskRange>();
    ColumnDiskRangeList.CreateColumnRangeHelper list =
            new ColumnDiskRangeList.CreateColumnRangeHelper();
    for (OrcProto.Stream stream : streamList) {
      long length = stream.getLength();
      int column = stream.getColumn();
      OrcProto.Stream.Kind streamKind = stream.getKind();
      // since stream kind is optional, first check if it exists
      if (stream.hasKind() &&
              (StreamName.getArea(streamKind) == StreamName.Area.DATA) &&
              (column < includedColumns.length && includedColumns[column])) {

        if (columnDiskRangeMap.containsKey(column)) {
          columnDiskRangeMap.get(column).length += length;
        } else {
          columnDiskRangeMap.put(column, new RecordReaderBinaryCacheImpl.ColumnDiskRange(offset, length));
        }
      }
      offset += length;
    }
    for (int columnId=1; columnId<includedColumns.length; ++columnId) {
      if (includedColumns[columnId]) {
        list.add(columnId, currentStripe, columnDiskRangeMap.get(columnId).offset,
                columnDiskRangeMap.get(columnId).offset + columnDiskRangeMap.get(columnId).length);
      }
    }
    return list.extract();
  }

ORC 從stream read 到 column read 的轉變

ORC Stream

From stream to Column reading

重寫 RecordReaderImpl 相關接口

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

cs04 CSS Measurement Units

動態庫、靜態庫的一些測試

使用Putty 實現網頁訪問代理

Debian 安裝numa 相關庫

ORC 讀數據源碼分析之 readAllDataStreams

CMakeLists.txt 一二三

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結