首先我們有如下圖的json數據,我們需要把這份數據先導入到Hive,然後在整理成結構化的數據,這樣我們就可以根據需求查詢對應的數據了
1 創建表
首先先要創建一個表
create table rating(json string);
2 導入Hive
然後把數據導入到hive中
load data local inpath '/home/hadoopadmin/rating.json' into table rating;
查看數據,已經導入到hive中
3 json_tuple查詢數據
但是上面的數據格式不是我們想要的,我們想要的數據格式爲下面這種結構:
movie | rate | time | userid |
---|---|---|---|
1193 | 5 | 978300760 | 1 |
hive中有個json_tuple函數,官方語法:
json_tuple(string jsonStr,string k1,...,string kn)
#jsonStr:一個json字符串
#k1...kn:json字符串中的key
我們舉個例子:
select json_tuple(
'{"movie":"1193","rate":"5","time":"978300760","userid":"1"}',
'movie','rate','time','userid');
OK
#結果
c0 c1 c2 c3
1193 5 978300760 1
上面結果中別名我們需要改下
select json_tuple(
'{"movie":"1193","rate":"5","time":"978300760","userid":"1"}',
'movie','rate','time','userid') as (movie, rate, time, user_id);
OK
#結果
movie rate time user_id
1193 5 978300760 1
那麼下面,我們只要把上面的json字符串改成表的字段json,然後從rating表中查詢即可
select
json_tuple(json,'movie','rate','time','userid') as (movie, rate, time, user_id)
from rating limit 10 ;
如下圖,我們通過json_tuple函數,把json數據結構,改成了結構化數據格式
4 整理成大寬表
上面的數據正常不會滿足我們的需求,假如我們需要查詢某個時間的信息,還需要其他的一些信心,例如下面這個格式,也就是常說的大寬表:
movie | rate | time | userid | year | month | day | hour | minute | ts |
---|---|---|---|---|---|---|---|---|---|
1193 | 5 | 978300760 | 1 | 2011 | 1 | 1 | 6 | 12 | 2001-01-01 06:12:40 |
目前我們有的參數是一個字符串的time,我們先要把time字符串轉換成整數,然後再把整數轉換成時間格式,這兩個轉換用到下面兩個函數:
函數官網:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions
#把表達式轉成想要的類型
cast(expr as <type>)
#把bigint的時間類型,轉成想要的格式
from_unixtime(bigint unixtime[, string format])
針對上面2個函數,我們舉個例子:
select cast('978300760' as bigint);
select from_unixtime(cast('978300760' as bigint));
後面我們只需要,把第三步的查詢結果作爲自查詢,然後用上面2個函數去解析time字段就ok了
select movie,rate,time,user_id,
from_unixtime(cast(time as bigint)) as ts
from
(
select
json_tuple(json,'movie','rate','time','userid') as (movie, rate, time, user_id)
from rating
) t
limit 10;
再通過下面這些函數,獲取ts對應的年、月、日、時、分,就完成了大寬表
select movie,rate,time,user_id,
from_unixtime(cast(time as bigint)) as ts,
year(from_unixtime(cast(time as bigint))) as year,
month(from_unixtime(cast(time as bigint))) as month,
day(from_unixtime(cast(time as bigint))) as day,
hour(from_unixtime(cast(time as bigint))) as hour,
minute(from_unixtime(cast(time as bigint))) as minute
from
(
select
json_tuple(json,'movie','rate','time','userid') as (movie, rate, time, user_id)
from rating
) t
limit 10;
如下圖,查詢出我們想要的大寬表了
但是隻是查詢出來還不行,我們不能每次要查一個需求的時候,我就用一下上面那麼一大坨sql,我們可以把查詢出來的大寬表再生成一張表,然後針對這個表做一些業務的處理:
create table rating_width
as
select movie,rate,time,user_id,
from_unixtime(cast(time as bigint)) as ts,
year(from_unixtime(cast(time as bigint))) as year,
month(from_unixtime(cast(time as bigint))) as month,
day(from_unixtime(cast(time as bigint))) as day,
hour(from_unixtime(cast(time as bigint))) as hour,
minute(from_unixtime(cast(time as bigint))) as minute
from
(
select
json_tuple(json,'movie','rate','time','userid') as (movie, rate, time, user_id)
from rating
) t;
生成大寬表之後,我們查詢看一下:
select * from rating_width limit 10;
如下圖,已經生成我們需要的大寬表了