本題是一個綜合練習題目總共包括以下部分:
1.數據的預處理階段
2.數據的入庫操作階段
3.數據的分析階段
4.數據保存到數據庫階段
5.數據的查詢顯示階段
給出數據格式表和數據示例,請先閱讀數據說明,再做相應題目。
數據說明:
表1-1 視頻表
字段 備註 詳細描述
表1-2 用戶表
字段 備註 字段類型
原始數據:
qR8WRLrO2aQ:mienge:406:People &
Blogs:599:2788:5:1:0:4UUEKhr6vfA:zvDPXgPiiWI:TxP1eXHJQ2Q:k5Kb1K0zVxU:hLP_mJIMNFg:tzNRSSTGF4o:BrUGfqJANn8:OVIc-mNxqHc:gdxtKvNiYXc:bHZRZ-1A-qk:GUJdU6uHyzU:eyZOjktUb5M:Dv15_9gnM2A:lMQydgG1N2k:U0gZppW_-2Y:dUVU6xpMc6Y:ApA6VEYI8zQ:a3_boc9Z_Pc:N1z4tYob0hM:2UJkU2neoBs
預處理之後的數據:
qR8WRLrO2aQ:mienge:406:People,Blogs:599:2788:5:1:0:4UUEKhr6vfA,zvDPXgPiiWI,TxP1eXHJQ2Q,k5Kb1K0zVxU,hLP_mJIMNFg,tzNRSSTGF4o,BrUGfqJANn8,OVIc-mNxqHc,gdxtKvNiYXc,bHZRZ-1A-qk,GUJdU6uHyzU,eyZOjktUb5M,Dv15_9gnM2A,lMQydgG1N2k,U0gZppW_-2Y,dUVU6xpMc6Y,ApA6VEYI8zQ,a3_boc9Z_Pc,N1z4tYob0hM,2UJkU2neoBs
1、對原始數據進行預處理,格式爲上面給出的預處理之後的示例數據。通過觀察原始數據形式,可以發現,每個字段之間使用“:”分割,視頻可以有多個視頻類別,類別之間&符號分割,且分割的兩邊有空格字符,同時相關視頻也是可以有多個,多個相關視頻也是用“:”進行分割。爲了分析數據時方便,我們首先進行數據重組清洗操作。
即:將每條數據的類別用“,”分割,同時去掉兩邊空格,多個“相關視頻id”也使用“,”進行分割
2、把預處理之後的數據進行入庫到hive中
2.1創建數據庫和表
創建數據庫名字爲:video
創建原始數據表:
視頻表:video_ori 用戶表:video_user_ori
創建ORC格式的表:
視頻表:video_orc 用戶表:video_user_orc
給出創建原始表語句
創建video_ori視頻表:
create table video_ori(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited
fields terminated by ":"
collection items terminated by ","
stored as textfile;
創建video_user_ori用戶表:
create table video_user_ori(
uploader string,
videos int,
friends int)
row format delimited
fields terminated by ","
stored as textfile;
請寫出ORC格式的建表語句:
創建video_orc表
create table video_orc(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited
fields terminated by ":"
collection items terminated by ","
stored as orcfile;
創建video_user_orc表:
create table video_user_orc(
uploader string,
videos int,
friends int)
row format delimited
fields terminated by ","
stored as orcfile;
2.2分別導入預處理之後的視頻數據到原始表video_ori和導入原始用戶表的數據到video_user_ori中
請寫出導入語句:
video_ori:
load data local inpath '/opt/video_new.txt' into table video_ori ;
video_user_ori:
load data local inpath '/opt/user.txt' into table video_user_ori;
2.3從原始表查詢數據並插入對應的ORC表中
請寫出插入語句:
video_orc:
INSERT INTO TABLE video_orc SELECT * FROM video_ori;
video_user_orc:
INSERT INTO TABLE video_user_orc SELECT * FROM video_user_ori;
3、對入庫之後的數據進行hivesql查詢操作
3.1從視頻表中統計出視頻評分爲5分的視頻信息,把查詢結果保存到/export/rate.txt
請寫出sql語句:
insert overwrite local directory "/export/rate.txt" row format delimited
fields terminated by ":"
collection items terminated by ","
select * from video_ori where rate = 5 ;
3.2從視頻表中統計出評論數大於100條的視頻信息,把查詢結果保存到/export/comments.txt
請寫出sql語句:
insert overwrite local directory "/export/comments.txt" row format delimited
fields terminated by ":"
collection items terminated by ","
select * from video_ori where comments>100 ;
4、把hive分析出的數據保存到hbase中
4.1創建hive對應的數據庫外部表
請寫出創建rate外部表的語句:
Create external table rate(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited
fields terminated by ":"
collection items terminated by ","
stored as textfile;
請寫出創建comments外部表的語句:
Create external table comments(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited
fields terminated by ":"
collection items terminated by ","
stored as textfile;
4.2加載第3步的結果數據到外部表中
請寫出加載語句到rate表:
load data local inpath '/export/rate.txt/000000_0' overwrite into table rate;
請寫出加載語句到comments表:
load data local inpath '/export/comments.txt' overwrite into table comments;
4.3創建hive管理表與HBase進行映射
給出此步驟的語句Hive中的rate,comments兩個表分別對應hbase中的hbase_rate,hbase_comments兩個表,創建hbase_rate表並進行映射:
create table hbase_rate( key string,videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int,relatedId array<string>)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties("hbase.columns.mapping" ="info:videoId,info:uploader,info:age,info:category,info:length,info:views,info:rate,info:category,info:category,info:category")
tblproperties("hbase.table.name" = "hbase_rate");
創建hbase_comments表並進行映射:
create table hbase_comments( key string,videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int,relatedId array<string>)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties("hbase.columns.mapping" ="info:videoId,info:uploader,info:age,info:category,info:length,info:views,info:rate,info:category,info:category,info:category")
tblproperties("hbase.table.name" = "hbase_comments");
4.4請寫出通過insert overwrite select,插入hbase_rate表的語句
insert overwrite table hbase_rate2 select row_number() over (),rate.* from rate;
請寫出通過insert overwrite select,插入hbase_comments表的語句
insert overwrite table hbase_comments select row_number() over (),comments.* from comments;
5.通過hbaseapi進行查詢操作
5.1請使用hbaseapi 對 表,按照通過startRowKey=1和endRowKey=100進行掃描查詢出結果。
public void rowKeyFilter() throws IOException {
Configuration conf = new Configuration();
conf.set("hbase.zookeeper.quorum", "node01:2181,node02:2181,node03:2181");
Connection connection = ConnectionFactory.createConnection(conf);
//讀取表
Table mytable = connection.getTable(TableName.valueOf("hbase_rate2"));
//全表掃描
Scan scan = new Scan();
//區間掃描
scan.setStartRow("1".getBytes());
scan.setStopRow("100".getBytes());
ResultScanner scanner = mytable.getScanner(scan);
//result 是與一行數據(有多個列族,多個列)
for (Result result : scanner) {
System.out.println("rowkey -->" + Bytes.toString(result.getRow()));
System.out.println("age:" + Bytes.toString(result.getValue("cf".getBytes(), "age".getBytes())));
}
//關閉連接
connection.close();
}
5.2請使用hbaseapi對hbase_comments表,只查詢comments列的值。
public void search() throws Exception {
Configuration conf = new Configuration();
conf.set("hbase.zookeeper.quorum", "node01:2181,node02:2181,node03:2181");
Connection connection = ConnectionFactory.createConnection(conf);
Table mytable = connection.getTable(TableName.valueOf("hbase_comments"));
Scan scan = new Scan();
ResultScanner scanner = mytable.getScanner(scan);
for (Result result : scanner) {
Cell[] cells = result.rawCells();
for (Cell cell : cells) {
if (Bytes.toString(CellUtil.cloneQualifier(cell)).equals("comments")) {
System.out.println(Bytes.toString(CellUtil.cloneFamily(cell))+":"+Bytes.toString(CellUtil.cloneQualifier (cell))+"-"+Bytes.toString(CellUtil.cloneValue(cell)));
}
}
}
connection.close();
}