1、orc索引
index、row group index、bloom filter index
set hive.optimize.index.filter=true;打開索引(默認是flase)
輕量級索引Row Group Index
一個orc文件包含一個或者多個stripe(groups of row data),stripe裏面存放數據和索引和stripe footer。每個stripe包含了每個列的最大值和最小值,當查詢><=的時候,可以根據max和min來跳過不必要的stripe。
其中爲每個stripe建立的包含min/max值的索引,就稱爲Row Group Index,也叫min-max Index,或者Storage Index。在建立ORC格式表時,指定表參數’orc.create.index’=’true’之後,便會建立Row Group Index,需要注意的是,爲了使Row Group Index有效利用,向表中加載數據時,必須對需要使用索引的字段進行排序,否則,min/max會失去意義。另外,這種索引通常用於數值型字段的查詢過濾優化上。
ORC查詢優化
一個ORC文件會被分成多個stripe,而且文件的元數據中有每個字段的統計信息(min/max,hasNull等等),這就爲ORC的查詢優化做好了基礎準備。假如我的查詢過濾條件爲WHERE id = 0;在Map Task讀到一個ORC文件時,首先從文件的統計信息中看看id字段的min/max值,如果0不包含在內,那麼這個文件就可以直接跳過了。基於這點,還有一個更有效的優化手段是在數據入庫的時候,根據id字段排序後入庫,這樣儘量能使id=0的數據位於同一個文件甚至是同一個stripe中,那麼在查詢時候,只有負責讀取該文件的Map Task需要掃描文件,其他的Map Task都會跳過掃描,大大節省Map Task的執行時間。
海量數據下,使用ORDER BY可能不太現實,另一個有效手段是使用DISTRIBUTE BY id SORT BY id;
使用下面的HQL構造一個較大的ORC表:
CREATE TABLE test_orc3 stored AS ORC
AS
SELECT CAST(siteid AS INT) AS id,
pcid
FROM lxw1234_text
DISTRIBUTE BY id sort BY id;
該語句保證相同的id位於同一個ORC文件中,並且是排序的。
摘自:http://lxw1234.com/archives/2016/04/632.htm
4、order by distribute by和sort by cluster by
distribute by和sort by的字段相同 = cluster by
order by 只會在一個reduce中,distribute by和sort by 來代替他,distribute by 會根據字段進行hash,分多個reduce
sort by 排序,在每一個reduce中進行排序
詳情:https://blog.csdn.net/bitcarmanlee/article/details/51694616
3、建表
CREATE TABLE test_orc(
id INT,
name STRING
) stored AS ORC;
指定orc、分區分桶都需要用臨時表insert進去,否則會出錯,桶沒有分等
CREATE TABLE test_orc1 stored AS ORC
TBLPROPERTIES
('orc.compress'='SNAPPY',
'orc.create.index'='true',
'orc.bloom.filter.fpp'='0.05',
'orc.stripe.size'='10485760',
'orc.row.index.stride'='10000')
3、查看orc格式文件的元數據
./hive --orcfiledump -j -p hdfs:/user/hive/warehouse/test.db/test_orc/000000_1
{
"fileName": "\/user\/hive\/warehouse\/test.db\/test_orc\/000000_1",
"fileVersion": "0.12",
"writerVersion": "HIVE_13083",
"numberOfRows": 90,
"compression": "ZLIB",
"compressionBufferSize": 262144,
"schemaString": "struct<id:int,name:string>",
"schema": [
{
"columnId": 0,
"columnType": "STRUCT",
"childColumnNames": [
"id",
"name"
],
"childColumnIds": [
1,
2
]
},
{
"columnId": 1,
"columnType": "INT"
},
{
"columnId": 2,
"columnType": "STRING"
}
],
"stripeStatistics": [{
"stripeNumber": 1,
"columnStatistics": [
{
"columnId": 0,
"count": 90,
"hasNull": false
},
{
"columnId": 1,
"count": 90,
"hasNull": false,
"min": 1,
"max": 7,
"sum": 345,
"type": "LONG"
},
{
"columnId": 2,
"count": 90,
"hasNull": false,
"min": "呂布",
"max": "馬超",
"totalLength": 540,
"type": "STRING"
}
]
}],
"fileStatistics": [
{
"columnId": 0,
"count": 90,
"hasNull": false
},
{
"columnId": 1,
"count": 90,
"hasNull": false,
"min": 1,
"max": 7,
"sum": 345,
"type": "LONG"
},
{
"columnId": 2,
"count": 90,
"hasNull": false,
"min": "呂布",
"max": "馬超",
"totalLength": 540,
"type": "STRING"
}
],
"stripes": [{
"stripeNumber": 1,
"stripeInformation": {
"offset": 3,
"indexLength": 73,
"dataLength": 68,
"footerLength": 53,
"rowCount": 90
},
"streams": [
{
"columnId": 0,
"section": "ROW_INDEX",
"startOffset": 3,
"length": 11
},
{
"columnId": 1,
"section": "ROW_INDEX",
"startOffset": 14,
"length": 25
},
{
"columnId": 2,
"section": "ROW_INDEX",
"startOffset": 39,
"length": 37
},
{
"columnId": 1,
"section": "DATA",
"startOffset": 76,
"length": 12
},
{
"columnId": 2,
"section": "DATA",
"startOffset": 88,
"length": 12
},
{
"columnId": 2,
"section": "LENGTH",
"startOffset": 100,
"length": 5
},
{
"columnId": 2,
"section": "DICTIONARY_DATA",
"startOffset": 105,
"length": 39
}
],
"encodings": [
{
"columnId": 0,
"kind": "DIRECT"
},
{
"columnId": 1,
"kind": "DIRECT_V2"
},
{
"columnId": 2,
"kind": "DICTIONARY_V2",
"dictionarySize": 6
}
]
}],
"fileLength": 373,
"paddingLength": 0,
"paddingRatio": 0,
"status": "OK"
}