1 Druid數據查詢
1.1:查詢組件介紹
在介紹具體的查詢之前,我們先來了解一下各種查詢都會用到的基本組件,如Filter,Aggregator,Post-Aggregator,Query,Interval等,每種組件都包含很多的細節
1.1.1 Filter
Filter就是過濾器,在查詢語句中就是一個JSON對象,用來對維度進行篩選和過濾,表示維度滿足Filter的行是我們需要的數據,類似sql中的where字句。Filter包含的類型如下:
Selector Filter
Selector Filter的功能類似於SQL中的where key=value,它的json示例如下:
“Filter”:{“type”:“selector”,“dimension”:dimension_name,“value”:target_value}
{
"queryType":"timeseries",//查詢的類型,druid中有timeseries,groupby ,select,search
"dataSource":"adclicklog",//指定你要查詢的數據源
"granularity":"day",//指定時間聚合的區間,按照每天的時間維度聚合數據
"aggregations":[//聚合器,
{
"type":"longSum",//數值類型的累加
"name":"click",//聚合後的輸出字段,select sum(price) as totalPrice
"fieldName":"click_cnt" //按照哪個原始字段聚合,
},{
"type":"longSum",
"name":"pv",
"fieldName":"count"//求pv,其實就是求出原始數據的條數,
}
],
"filter":{"type":"selector","dimension":"device_type","value":"pc"},//selectorfilter過濾出pc
"intervals":["2019-05-30/2019-05-31"] //指定查詢的時間範圍,前閉後開
}
Regex Filter
Regex Filter 允許用戶使用正則表達式進行維度的過濾篩選,任何java支持的標準正則表達式druid都支持,它的JSON格式如下:
“filter":{“type”:“regex”,“dimension”:dimension_name,“pattern”:regex}
正則表達式判斷我們的device_type是不是pc,
.*pc.*
{
"queryType":"timeseries",
"dataSource":"adclicklog",
"granularity":"day",
"aggregations":[
{
"type":"longSum",
"name":"click",
"fieldName":"click_cnt"
},{
"type":"longSum",
"name":"pv",
"fieldName":"count"
}
],
"filter":{"type":"regex","dimension":"device_type","pattern":".*pc.*"},
"intervals":["2019-05-30/2019-05-31"]
}
1+$ :該正則的意思是匹配所有的數字和字母
Logincal Expression Filter (and,or,not)
Logincal Expression Filter包含and,not,or三種過濾器,每一種都支持嵌套,可以構建豐富的邏輯表達式,與sql中的and,not,or類似,JSON表達式如下:
“filter”:{“type”:“and”,“fields”:[filter1,filter2]}
“filter”:{“type”:“or”,“fields”:[filter1,filter2]}
“filter”:{“type”:“not”,“fields”:[filter]}
{
"queryType":"timeseries",
"dataSource":"adclicklog",
"granularity":"day",
"aggregations":[
{
"type":"longSum",
"name":"click",
"fieldName":"click_cnt"
},{
"type":"longSum",
"name":"pv",
"fieldName":"count"
}
],
"filter":{
"type":"and",
"fields":[
{"type":"selector","dimension":"device_type","value":"pc"},
{"type":"selector","dimension":"host","value":"baidu.com"}
]
},
"intervals":["2019-05-30/2019-05-31"]
}
In Filter
In Filter類似於SQL中的in, 比如 where username in(‘zhangsan’,‘lisi’,‘zhaoliu’),它的JSON格式如下:
{
“type”:“in”,
“dimension”:“username”,
“values”:[‘zhangsan’,‘lisi’,‘zhaoliu’]
}
{
"queryType":"timeseries",
"dataSource":"adclicklog",
"granularity":"day",
"aggregations":[
{
"type":"longSum",
"name":"click",
"fieldName":"click_cnt"
},{
"type":"longSum",
"name":"pv",
"fieldName":"count"
}
],
"filter":{
"type":"in",
"dimension":"device_type",
"values":["pc","mobile"]
},
"intervals":["2019-05-30/2019-05-31"]
}
Bound Filter
Bound Filter是比較過濾器,包含大於,等於,小於三種,它默認支持的就是字符串比較,是基於字典順序,如果使用數字進行比較,需要在查詢中設定alpaNumeric的值爲true,需要注意的是Bound Filter默認的大小比較爲>=或者<=,因此如果使用<或>,需要指定lowerStrict值爲true,或者upperStrict值爲true,它的JSON格式如下:
21 <=age<=31:
{
“type”:“bound”,
“dimension”:“age”,
“lower”:“21”, #默認包含等於
“upper”:“31”, #默認包含等於
“alphaNumeric”:true #數字比較時指定alphaNumeric爲true
}
21 <age<31:
{
“type”:“bound”,
“dimension”:“age”,
“lower”:“21”,
“lowerStrict”:true, #去掉包含
“upper”:“31”,
“upperStrict”:true, #去掉包含
“alphaNumeric”:true #數字比較時指定alphaNumeric爲true
}
1.1.2 granularity
聚合粒度通過granularity配置項指定聚合時間跨度,時間跨度範圍要大於等於創建索引時設置的索引粒度,druid提供了三種類型的聚合粒度分別是:Simple,Duration,Period
Simple的聚合粒度
Simple的聚合粒度通過druid提供的固定時間粒度進行聚合,以字符串表示,定義查詢規則的時候不需要顯示設置type配置項,druid提供的常用Simple粒度:all,none,minute,fifteen_minute,thirty_minute,hour,day,month,Quarter(季度),year;
all:會將起始和結束時間內所有數據聚合到一起返回一個結果集,
none:按照創建索引時的最小粒度做聚合計算,最小粒度是毫秒爲單位,不推薦使用性能較差;
minute:以分鐘作爲聚合的最小粒度;
fifteen_minute:15分鐘聚合;
thirty_minute:30分鐘聚合
hour:一小時聚合
day:天聚合
數據源:
統計2019年05月30日的不同終端設備的曝光量,曝光量輸出字段名稱爲pv,查詢規則如下:
{
"queryType":"groupBy",
"dataSource":"adclicklog",
"granularity":"day",
"dimensions":["device_type"],
"aggregations":[
{
"type":"longSum",
"name":"pv",
"fieldName":"count"
}
],
"intervals":["2019-05-30/2019-05-31"]
}
Duration聚合粒度
duration聚合粒度提供了更加靈活的聚合粒度,不只侷限於Simple聚合粒度提供的固定聚合粒度,而是以毫秒爲單位自定義聚合粒度,比如兩小時做一次聚合可以設置duration配置項爲7200000毫秒,所以Simple聚合粒度不能夠滿足的聚合粒度可以選擇使用Duration聚合粒度。注意:使用Duration聚合粒度需要設置配置項type值爲duration.
{
"queryType":"groupBy",
"dataSource":"adclicklog",
"dimensions":["device_type"],
"granularity":{
"type":"duration",
"duration":7200000
},
"aggregations":[
{
"type":"longSum",
"name":"pv",
"fieldName":"pv_cnt"
}
],
"intervals":["2019-05-29/2019-05-31"]
}
Period聚合粒度
Period聚合粒度採用了日期格式,常用的幾種時間跨度表示方法,一小時:PT1H,一週:P1W,一天:P1D,一個月:P1M;使用Period聚合粒度需要設置配置項type值爲period,
案例:
{
"queryType":"groupBy",
"dataSource":"adclicklog",
"granularity":{
"type":"period",
"period":"P1D"
},
"aggregations":[
{
"type":"longSum",
"name":"pv",
"fieldName":"pv_cnt"
}
],
"intervals":["2019-05-29/2019-05-31"]
}
1.1.3 Aggregator
Aggregator是聚合器,聚合器可以在數據攝入階段和查詢階段使用,在數據攝入階段使用聚合器能夠在數據被查詢之前按照維度進行聚合計算,提高查詢階段聚合計算性能,在查詢過程中,使用聚合器能夠實現各種不同指標的組合計算。
聚合器的公共屬性介紹:
type:聲明使用的聚合器類型;
name:定義返回值的字段名稱,相當於sql語法中的字段別名;
fieldName:數據源中已定義的指標名稱,該值不可以自定義,必須與數據源中的指標名一致;
Count Aggregator
計數聚合器,等同於sql語法中的count函數,用於計算druid roll-up合併之後的數據條數,並不是攝入的原始數據條數,在定義數據模式指標規則中必須添加一個count類型的計數指標count;
比如想查詢Roll-up 後有多少條數據,查詢的JSON格式如下:
{“type”:“count”,“name”:out_name}
{
"queryType":"timeseries",
"dataSource":"ad_event",
"granularity":{
"type":"period",
"period":"P1D"
},
"aggregations":[
{
"type":"count",
"name":"count"
},
{
"type":"longSum",
"name":"pv",
"fieldName":"count"
}
],
"intervals":["2018-12-01/2018-12-3"]
}
如果想要查詢原始數據攝入多少條,在查詢時使用longSum,JSON示例如下:
{“type”:“longSum”,“name”:out_name,“fieldName”:“count”}
{
"queryType":"timeseries",
"dataSource":"adclicklog",
"granularity":{
"type":"period",
"period":"P1D"
},
"aggregations":[
{
"type":"longSum",
"name":"pv",
"fieldName":"count"
}],
"intervals":["2019-05-29/2019-05-31"]
}
Sum Aggregator
求和聚合器,等同於sql語法中的sum函數,用戶指標求和計算,druid提供兩種類型的聚合器,分別是long類型和double類型的聚合器;
第一類就是longSum Aggregator ,負責整數類型的計算,JSON格式如下:
{“type”:“longSum”,“name”:out_name,“fieldName”:“metric_name”}
第二類是doubleSum Aggregator,負責浮點數計算,JSON格式如下:
{“type”:“doubleSum”,“name”:out_name,“fieldName”:“metric_name”}
Min/Max Aggregator
負責計算出指定metric的最大或最小值;類似於sql語法中的Min/Max
doubleMin aggregator
{ “type” : “doubleMin”, “name” : <output_name>, “fieldName” : <metric_name> }
doubleMax aggregator
{ “type” : “doubleMax”, “name” : <output_name>, “fieldName” : <metric_name> }
longMin aggregator
{ “type” : “longMin”, “name” : <output_name>, “fieldName” : <metric_name> }
longMax aggregator
{ “type” : “longMax”, “name” : <output_name>, “fieldName” : <metric_name> }
{
"queryType":"timeseries",
"dataSource":"adclicklog",
"granularity":{
"type":"period",
"period":"P1D"
},
"aggregations":[
{
"type":"longMin",
"name":"min",
"fieldName":"is_new"
}
],
"intervals":["2019-05-30/2019-05-31"]
}
DataSketche Aggregator
DataSketche Aggregator是近似基數計算聚合器,在攝入階段指定metric,從而在查詢的時候使用,要在conf/druid/_common/common.runtime.properties配置文件中聲明加載依賴druid.extensions.loadList=[“druid-datasketches”],之前已有的hdfs,mysql等不要刪除,添加這個擴展即可。
使用的場景:高基數維度的去重計算,比如用戶訪問數等
DataSketche聚合器在數據攝入階段規則定義格式如下:
{"type":"thetaSketch",
"name":<out_name>,
"fieldName":<metric_name>,
"isInputThetaSketch":false,
"size":16384
}
{
"type": "kafka",
"dataSchema": {
"dataSource": "adclicklog",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [],
"dimensionExclusions": [
"timestamp",
"is_new",
"pv_cnt",
"click_cnt"
]
}
}
},
"metricsSpec": [
{
"name": "count",
"type": "count"
},
{
"name": "click_cnt",
"fieldName": "click_cnt",
"type": "longSum"
},
{
"name": "new_cnt",
"fieldName": "is_new",
"type": "longSum"
},
{
"name": "uv",
"fieldName": "user_id",
"type": "thetaSketch",
"isInputThetaSketch":"false",
"size":"16384"
},
{
"name": "click_uv",
"fieldName": "click_user_id",
"type": "thetaSketch",
"isInputThetaSketch":"false",
"size":"16384"
}
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": "NONE"
}
},
"tuningConfig": {
"type": "kafka",
"maxRowsPerSegment": 5000000
},
"ioConfig": {
"topic": "process_ad_click",
"consumerProperties": {
"bootstrap.servers": "hp101:9092,hp102:9092",
"group.id":"kafka-index-service"
},
"taskCount": 1,
"replicas": 1,
"taskDuration": "PT5m"
}
}
在查詢階段規則定義:
{
"type":"thetaSketch",
"name":<out_name>,
"fieldName":<metric_name>
}
{
"queryType":"groupBy",
"dataSource":"adclicklog",
"granularity":{
"type":"period",
"period":"PT1H",
"timeZone": "Asia/Shanghai"
},
"dimensions":["device_type"],
"aggregations":[
{
"type": "thetaSketch",
"name": "uv",
"fieldName": "uv"
}
],
"intervals":["2019-05-30/2019-05-31"]
}
1.1.4 Post-Aggregator
Post-Aggregator可以對結果進行二次加工並輸出,最終的輸出既包含Aggregation的結果,也包含Post-Aggregator的結果,Post-Aggregator包含的類型:
Arithmetic Post-Aggregator
Arithmetic Post-Aggregator支持對Aggregator的結果進行加減乘除的計算,JSON格式如下:
"postAggregation":{
"type":"arithmetic",
"name":out_name,
"fn":function,
"fields":[post_aggregator1,post_aggregator2]
}
Field Accessor Post-Aggregator
Field Accessor Post-Aggregator返回指定的Aggregator的值,在Post-Aggregator中大部分情況下使用fieldAccess來訪問Aggregator,在fieldName中指定Aggregator裏定義的name,如果對HyperUnique的結果進行訪問,需要使用hyperUniqueCardinality,Field Accessor Post-Aggregator的JSON示例如下:
{
"type":"fieldAccess",
"name":out_name,
"fieldName":aggregator_name
}
我們計算日期20190530的廣告總點擊量,曝光量和曝光率,曝光率等於點擊量除以曝光量,曝光率的計算就可以使用druid的後聚合器實現:
類似的sql:
select t.click_cnt,t.pv_cnt,(t.click/t.pv*100) click_rate from
(select sum(click_cnt) ,sum(pv_cnt) pv_cnt from ad_event where dt='20181201' ) t
druid如何實現:
{
"queryType": "timeseries",
"dataSource": "adclicklog",
"granularity":{
"type":"period",
"period":"PT1H"
},
"intervals": [
"2019-05-30/2019-05-31"
],
"aggregations": [
{
"type": "longSum",
"name": "pv_cnt",
"fieldName": "count"
},
{
"type": "longSum",
"name": "click_cnt",
"fieldName": "click_cnt"
}
],
"postAggregations": [
{
"type": "arithmetic",
"name": "click_rate",
"fn": "*",
"fields": [
{
"type": "arithmetic",
"name": "div",
"fn": "/",
"fields": [
{
"type": "fieldAccess",
"name": "click_cnt",
"fieldName": "click_cnt"
},
{
"type": "fieldAccess",
"name": "pv_cnt",
"fieldName": "pv_cnt"
}
]
},
{
"type": "constant",
"name": "const",
"value": 100
}
]
}
]
}
1.2 查詢類型
druid查詢採用的是HTTP RESTFUL方式,REST接口負責接收客戶端的查詢請求,客戶端只需要將查詢條件封裝成JSON格式,通過HTTP方式將JSON查詢條件發送到broker節點,查詢成功會返回JSON格式的結果數據。瞭解一下druid提供的查詢類型
1.2.1 時間序列查詢
timeseries時間序列查詢對於指定時間段按照查詢規則返回聚合後的結果集,查詢規則中可以設置查詢粒度,結果排序方式以及過濾條件,過濾條件可以使用嵌套過濾,並且支持後聚合。
timeseries查詢屬性:
案例:統計2019年05月30日北京地區曝光量,點擊量
類似sql語句:
select sum(click_cnt) click,sum(pv_cnt) pv from ad_event
wehre dt='20181201' and city = 'beijing'
druid JSON格式查詢:
{
"queryType":"timeseries",
"dataSource":"adclicklog",
"descending":"true",
"granularity":"minute",
"aggregations":[
{
"type":"longSum",
"name":"click",
"fieldName":"click_cnt"
},{
"type":"longSum",
"name":"pv",
"fieldName":"count"
}
],
"filter":{"type":"selector","dimension":"city","value":"beijing"},
"intervals":["2019-05-30/2019-05-31"]
}
然後通過HTTP POST方式執行查詢,注意發送的是broker節點地址。
1.2.2 TopN查詢
topn查詢是通過給定的規則和顯示維度返回一個結果集,topn查詢可以看做是給定排序規則,返回單一維度的group by查詢,但是topn查詢比group by性能更快。metric這個屬性是topn專屬的按照該指標排序。
topn的查詢屬性如下:
案例:統計2019年05月30日PC端曝光量和點擊量,取點擊量排名前二的城市
topn查詢規則定義:
{
"queryType":"topN",
"dataSource":"adclicklog",
"dimension":"city",
"threshold":2,
"metric":"click_cnt",
"granularity":"day",
"filter":{
"type":"selector",
"dimension":"device_type",
"value":"pc"
},
"aggregations":[
{
"type":"longSum",
"name":"pv_cnt",
"fieldName":"count"
},
{
"type":"longSum",
"name":"click_cnt",
"fieldName":"click_cnt"
}
],
"intervals":["2019-05-30/2019-05-31"]
}
關於排序規則:
"metric" : {
"type" : "numeric", //指定按照numeric 降序排序
"metric" : "<metric_name>"
}
"metric" : {
"type" : "inverted", //指定按照numeric 升序排序
"metric" : "<metric_name>"
}
1.2.3分組查詢
在實際應用中經常需要進行分組查詢,等同於sql語句中的Group by查詢,如果對單個維度和指標進行分組聚合計算,推薦使用topN查詢,能夠獲得更高的查詢性能,分組查詢適合多維度,多指標聚合查詢:
分組查詢屬性:
limitSpec
limitSpec規則定義的主要作用是查詢結果進行排序,提取數據條數,類似於sql中的order by 和limit的作用;規則定義格式如下:
limitSpec屬性表:
案例:統計2018年12月1日各城市PC端和TV端的曝光量,點擊量,點擊率,取曝光量排名前三的城市數據;曝光量相同則按照城市名稱升序排列。
分組查詢規則定義:
{
"queryType": "groupBy",
"dataSource": "adclicklog",
"granularity": "day",
"intervals": [
"2019-05-30/2019-05-31"
],
"dimensions": [
"city",
"device_type"
],
"aggregations": [
{
"type": "longSum",
"name": "pv_cnt",
"fieldName": "count"
},
{
"type": "longSum",
"name": "click_cnt",
"fieldName": "click_cnt"
}
],
"postAggregations": [
{
"type": "arithmetic",
"name": "click_rate",
"fn": "*",
"fields": [
{
"type": "arithmetic",
"name": "div",
"fn": "/",
"fields": [
{
"type": "fieldAccess",
"name": "click_cnt",
"fieldName": "click_cnt"
},
{
"type": "fieldAccess",
"name": "pv_cnt",
"fieldName": "pv_cnt"
}
]
},
{
"type": "constant",
"name": "const",
"value": 100
}
]
}
],
"limitSpec": {
"type": "default",
"limit": 3,
"columns": [
{
"dimension": "pv_cnt",
"direction": "descending"
},
{
"dimension": "city",
"direction": "ascending"
}
]
}
}
1.2.4 search搜索查詢
search 查詢返回匹配中的維度,對維度值過濾查詢,類似於sql中的like語法,它的相關屬性:
搜索規則用於搜索維度值範圍內與搜索值是否相匹配,類似於sql中where限制條件中的like語法,使用搜索過濾器需要設置三個配置項:type過濾器類型值爲:search,dimension值爲維度名稱,query值爲json對象,定義搜索過濾規則。搜索過濾規則有Insensitive Contains,Fragment,Contains
(1)Insensitive Contains
維度值的任何部分包含指定的搜索值都會匹配成功,不區分大小寫,定義規則如下:
{
"type":"insensitive_contains",
"value":"some_value"
}
sql語句中where city like '%jing%'轉爲等價的查詢規則如下:
{
"queryType": "search",
"dataSource": "adclicklog",
"granularity": "all",
"limit": 2,
"searchDimensions": [
"city"
],
"query": {
"type": "insensitive_contains",
"value": "jing"
},
"sort" : {
"type": "lexicographic"
},
"intervals": [
"2019-05-29/2019-05-31"
]
}
(2)Fragment
Fragment提供一組搜索值,緯度值任何部分包含全部搜索值則匹配成功,匹配過程可以選擇忽略大小寫,使用Fragment搜索過濾器需要配置三個選項:type:fragment,values:設置一組值(使用json數組),case_sensitive:表示是否忽略大小寫,默認爲false,不忽略大小寫;
樣例,sql語句中where city like ‘%bei%’ and city like '%jing%'轉化爲等價的查詢
{
"queryType": "search",
"dataSource": "adclicklog",
"granularity": "all",
"limit": 2,
"searchDimensions": [
"city"
],
"query": {
"type": "fragment",
"values": ["jing","bei"],
"case_sensitive":true
},
"sort" : {
"type": "lexicographic"
},
"intervals": [
"2019-05-29/2019-05-31"
]
}
(3)Contains
維度值的任何部分包含指定的搜索值都會匹配成功,與insensitive Contains實現的功能類似,唯一不同的是Contains過濾類型可以配置是否區分大小寫。
樣例:sql語句中where city like "%bei%"轉化爲等價查詢規則如下:
{
"queryType": "search",
"dataSource": "adclicklog",
"granularity": "all",
"limit": 2,
"searchDimensions": [
"city"
],
"query": {
"type": "contains",
"value": "bei",
"case_sensitive":true
},
"sort" : {
"type": "lexicographic"
},
"intervals": [
"2019-05-29/2019-05-31"
]
}
6 查詢的API
6.1 druid restful api展示
提交查詢任務
curl -X 'POST' -H'Content-Type: application/json' -d @quickstart/ds.json http://hp103:8082/druid/v2/?pretty
//提交kafka索引任務
curl -X POST -H 'Content-Type: application/json' -d @kafka-index.json http://hp101:8090/druid/indexer/v1/supervisor
提交普通索引導入數據任務
curl -X 'POST' -H 'Content-Type:application/json' -d @hadoop-index.json hp101:8090/druid/indexer/v1/task
//獲取指定kafka索引任務的狀態
curl -X GET http://hp101:8090/druid/indexer/v1/supervisor/kafkaindex333/status
殺死一個kafka索引任務
curl -X GET http://hp101:8090/druid/indexer/v1/supervisor/kafkaindex333/shutdown
刪除datasource,提交到coordinator
curl -XDELETE http://hp101:8081/druid/coordinator/v1/datasources/adclicklog6
"city"
],
“query”: {
“type”: “contains”,
“value”: “bei”,
“case_sensitive”:true
},
“sort” : {
“type”: “lexicographic”
},
“intervals”: [
“2019-05-29/2019-05-31”
]
}
# 6 查詢的API
## 6.1 druid restful api展示
提交查詢任務
curl -X ‘POST’ -H’Content-Type: application/json’ -d @quickstart/ds.json http://hp103:8082/druid/v2/?pretty
//提交kafka索引任務
curl -X POST -H ‘Content-Type: application/json’ -d @kafka-index.json http://hp101:8090/druid/indexer/v1/supervisor
提交普通索引導入數據任務
curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @hadoop-index.json hp101:8090/druid/indexer/v1/task
//獲取指定kafka索引任務的狀態
curl -X GET http://hp101:8090/druid/indexer/v1/supervisor/kafkaindex333/status
殺死一個kafka索引任務
curl -X GET http://hp101:8090/druid/indexer/v1/supervisor/kafkaindex333/shutdown
刪除datasource,提交到coordinator
curl -XDELETE http://hp101:8081/druid/coordinator/v1/datasources/adclicklog6
a-z0-9A-Z ↩︎