ORC存儲格式

ORC存儲格式

原創

尘缘未了-

2020-06-30 05:52

1、orc索引

index、row group index、bloom filter index

set hive.optimize.index.filter=true;打開索引（默認是flase）

輕量級索引Row Group Index

一個orc文件包含一個或者多個stripe(groups of row data)，stripe裏面存放數據和索引和stripe footer。每個stripe包含了每個列的最大值和最小值，當查詢><=的時候，可以根據max和min來跳過不必要的stripe。

其中爲每個stripe建立的包含min/max值的索引，就稱爲Row Group Index，也叫min-max Index，或者Storage Index。在建立ORC格式表時，指定表參數’orc.create.index’=’true’之後，便會建立Row Group Index，需要注意的是，爲了使Row Group Index有效利用，向表中加載數據時，必須對需要使用索引的字段進行排序，否則，min/max會失去意義。另外，這種索引通常用於數值型字段的查詢過濾優化上。

ORC查詢優化
一個ORC文件會被分成多個stripe，而且文件的元數據中有每個字段的統計信息（min/max,hasNull等等），這就爲ORC的查詢優化做好了基礎準備。假如我的查詢過濾條件爲WHERE id = 0;在Map Task讀到一個ORC文件時，首先從文件的統計信息中看看id字段的min/max值，如果0不包含在內，那麼這個文件就可以直接跳過了。基於這點，還有一個更有效的優化手段是在數據入庫的時候，根據id字段排序後入庫，這樣儘量能使id=0的數據位於同一個文件甚至是同一個stripe中，那麼在查詢時候，只有負責讀取該文件的Map Task需要掃描文件，其他的Map Task都會跳過掃描，大大節省Map Task的執行時間。

海量數據下，使用ORDER BY可能不太現實，另一個有效手段是使用DISTRIBUTE BY id SORT BY id;

使用下面的HQL構造一個較大的ORC表：

CREATE TABLE test_orc3 stored AS ORC
AS
SELECT CAST(siteid AS INT) AS id,
pcid
FROM lxw1234_text
DISTRIBUTE BY id sort BY id;
該語句保證相同的id位於同一個ORC文件中，並且是排序的。

摘自：http://lxw1234.com/archives/2016/04/632.htm

4、order by distribute by和sort by cluster by

distribute by和sort by的字段相同 = cluster by

order by 只會在一個reduce中，distribute by和sort by 來代替他，distribute by 會根據字段進行hash，分多個reduce

sort by 排序，在每一個reduce中進行排序

詳情：https://blog.csdn.net/bitcarmanlee/article/details/51694616

3、建表

CREATE TABLE test_orc(
id INT,
name STRING
) stored AS ORC;
指定orc、分區分桶都需要用臨時表insert進去，否則會出錯，桶沒有分等

CREATE TABLE test_orc1 stored AS ORC 
TBLPROPERTIES
('orc.compress'='SNAPPY',
'orc.create.index'='true',
'orc.bloom.filter.fpp'='0.05',
'orc.stripe.size'='10485760',
'orc.row.index.stride'='10000')

3、查看orc格式文件的元數據

./hive --orcfiledump -j -p hdfs:/user/hive/warehouse/test.db/test_orc/000000_1

{
  "fileName": "\/user\/hive\/warehouse\/test.db\/test_orc\/000000_1",
  "fileVersion": "0.12",
  "writerVersion": "HIVE_13083",
  "numberOfRows": 90,
  "compression": "ZLIB",
  "compressionBufferSize": 262144,
  "schemaString": "struct<id:int,name:string>",
  "schema": [
    {
      "columnId": 0,
      "columnType": "STRUCT",
      "childColumnNames": [
        "id",
        "name"
      ],
      "childColumnIds": [
        1,
        2
      ]
    },
    {
      "columnId": 1,
      "columnType": "INT"
    },
    {
      "columnId": 2,
      "columnType": "STRING"
    }
  ],
  "stripeStatistics": [{
    "stripeNumber": 1,
    "columnStatistics": [
      {
        "columnId": 0,
        "count": 90,
        "hasNull": false
      },
      {
        "columnId": 1,
        "count": 90,
        "hasNull": false,
        "min": 1,
        "max": 7,
        "sum": 345,
        "type": "LONG"
      },
      {
        "columnId": 2,
        "count": 90,
        "hasNull": false,
        "min": "呂布",
        "max": "馬超",
        "totalLength": 540,
        "type": "STRING"
      }
    ]
  }],
  "fileStatistics": [
    {
      "columnId": 0,
      "count": 90,
      "hasNull": false
    },
    {
      "columnId": 1,
      "count": 90,
      "hasNull": false,
      "min": 1,
      "max": 7,
      "sum": 345,
      "type": "LONG"
    },
    {
      "columnId": 2,
      "count": 90,
      "hasNull": false,
      "min": "呂布",
      "max": "馬超",
      "totalLength": 540,
      "type": "STRING"
    }
  ],
  "stripes": [{
    "stripeNumber": 1,
    "stripeInformation": {
      "offset": 3,
      "indexLength": 73,
      "dataLength": 68,
      "footerLength": 53,
      "rowCount": 90
    },
    "streams": [
      {
        "columnId": 0,
        "section": "ROW_INDEX",
        "startOffset": 3,
        "length": 11
      },
      {
        "columnId": 1,
        "section": "ROW_INDEX",
        "startOffset": 14,
        "length": 25
      },
      {
        "columnId": 2,
        "section": "ROW_INDEX",
        "startOffset": 39,
        "length": 37
      },
      {
        "columnId": 1,
        "section": "DATA",
        "startOffset": 76,
        "length": 12
      },
      {
        "columnId": 2,
        "section": "DATA",
        "startOffset": 88,
        "length": 12
      },
      {
        "columnId": 2,
        "section": "LENGTH",
        "startOffset": 100,
        "length": 5
      },
      {
        "columnId": 2,
        "section": "DICTIONARY_DATA",
        "startOffset": 105,
        "length": 39
      }
    ],
    "encodings": [
      {
        "columnId": 0,
        "kind": "DIRECT"
      },
      {
        "columnId": 1,
        "kind": "DIRECT_V2"
      },
      {
        "columnId": 2,
        "kind": "DICTIONARY_V2",
        "dictionarySize": 6
      }
    ]
  }],
  "fileLength": 373,
  "paddingLength": 0,
  "paddingRatio": 0,
  "status": "OK"
}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

1、orc索引

set hive.optimize.index.filter=true;打開索引（默認是flase）

海量數據下，使用ORDER BY可能不太現實，另一個有效手段是使用DISTRIBUTE BY id SORT BY id;

4、order by distribute by和sort by cluster by

3、建表

3、查看orc格式文件的元數據

./hive --orcfiledump -j -p hdfs:/user/hive/warehouse/test.db/test_orc/000000_1

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

WebStorm 創建 Vue 項目

VUE——HelloWorld

ORC存儲格式

rsync遠程同步和時間同步

centos6.8部署cloudera-manager

sqoop導入mysql數據到hive表

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結