大數據流程練習題

本題是一個綜合練習題目總共包括以下部分：
1.數據的預處理階段
2.數據的入庫操作階段
3.數據的分析階段
4.數據保存到數據庫階段
5.數據的查詢顯示階段
給出數據格式表和數據示例，請先閱讀數據說明，再做相應題目。

數據說明：
表1-1 視頻表
字段備註詳細描述

表1-2 用戶表
字段備註字段類型

原始數據：

qR8WRLrO2aQ:mienge:406:People &
Blogs:599:2788:5:1:0:4UUEKhr6vfA:zvDPXgPiiWI:TxP1eXHJQ2Q:k5Kb1K0zVxU:hLP_mJIMNFg:tzNRSSTGF4o:BrUGfqJANn8:OVIc-mNxqHc:gdxtKvNiYXc:bHZRZ-1A-qk:GUJdU6uHyzU:eyZOjktUb5M:Dv15_9gnM2A:lMQydgG1N2k:U0gZppW_-2Y:dUVU6xpMc6Y:ApA6VEYI8zQ:a3_boc9Z_Pc:N1z4tYob0hM:2UJkU2neoBs

預處理之後的數據：

qR8WRLrO2aQ:mienge:406:People,Blogs:599:2788:5:1:0:4UUEKhr6vfA,zvDPXgPiiWI,TxP1eXHJQ2Q,k5Kb1K0zVxU,hLP_mJIMNFg,tzNRSSTGF4o,BrUGfqJANn8,OVIc-mNxqHc,gdxtKvNiYXc,bHZRZ-1A-qk,GUJdU6uHyzU,eyZOjktUb5M,Dv15_9gnM2A,lMQydgG1N2k,U0gZppW_-2Y,dUVU6xpMc6Y,ApA6VEYI8zQ,a3_boc9Z_Pc,N1z4tYob0hM,2UJkU2neoBs

1、對原始數據進行預處理，格式爲上面給出的預處理之後的示例數據。通過觀察原始數據形式，可以發現，每個字段之間使用“:”分割，視頻可以有多個視頻類別，類別之間&符號分割，且分割的兩邊有空格字符，同時相關視頻也是可以有多個，多個相關視頻也是用“:”進行分割。爲了分析數據時方便，我們首先進行數據重組清洗操作。
即：將每條數據的類別用“，”分割，同時去掉兩邊空格，多個“相關視頻id”也使用“,”進行分割
2、把預處理之後的數據進行入庫到hive中
2.1創建數據庫和表

	創建數據庫名字爲：video
	創建原始數據表：
	視頻表：video_ori  用戶表：video_user_ori
	創建ORC格式的表：
	視頻表：video_orc 用戶表：video_user_orc

給出創建原始表語句
創建video_ori視頻表：

create table video_ori(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited 
fields terminated by ":"
collection items terminated by ","
stored as textfile;

創建video_user_ori用戶表：

create table video_user_ori(
    uploader string,
    videos int,
    friends int)
row format delimited 
fields terminated by "," 
stored as textfile;

請寫出ORC格式的建表語句：
創建video_orc表

create table video_orc(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited 
fields terminated by ":"
collection items terminated by ","
stored as orcfile;

創建video_user_orc表：

create table video_user_orc(
    uploader string,
    videos int,
    friends int)
row format delimited 
fields terminated by "," 
stored as orcfile;

2.2分別導入預處理之後的視頻數據到原始表video_ori和導入原始用戶表的數據到video_user_ori中
請寫出導入語句：
video_ori：

load data local inpath '/opt/video_new.txt' into table video_ori ;

video_user_ori：

load data local inpath '/opt/user.txt' into table video_user_ori;

2.3從原始表查詢數據並插入對應的ORC表中
請寫出插入語句：
video_orc：

INSERT INTO TABLE video_orc SELECT * FROM video_ori;

video_user_orc：

INSERT INTO TABLE video_user_orc SELECT * FROM video_user_ori;

3、對入庫之後的數據進行hivesql查詢操作
3.1從視頻表中統計出視頻評分爲5分的視頻信息，把查詢結果保存到/export/rate.txt
請寫出sql語句：

insert overwrite local directory "/export/rate.txt"   row format delimited 
fields terminated by ":"
collection items terminated by ","
select * from video_ori where rate = 5 ;

3.2從視頻表中統計出評論數大於100條的視頻信息,把查詢結果保存到/export/comments.txt
請寫出sql語句：

insert overwrite local directory "/export/comments.txt"  row format delimited 
fields terminated by ":"
collection items terminated by ","
select * from video_ori where comments>100 ;

4、把hive分析出的數據保存到hbase中
4.1創建hive對應的數據庫外部表
請寫出創建rate外部表的語句：

Create external table rate(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited 
fields terminated by ":"
collection items terminated by ","
stored as textfile;

請寫出創建comments外部表的語句：

Create external table comments(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited 
fields terminated by ":"
collection items terminated by ","
stored as textfile;

4.2加載第3步的結果數據到外部表中
請寫出加載語句到rate表：

load data local inpath '/export/rate.txt/000000_0' overwrite   into table rate;

請寫出加載語句到comments表：

load data local inpath '/export/comments.txt' overwrite    into table comments;

4.3創建hive管理表與HBase進行映射
給出此步驟的語句Hive中的rate，comments兩個表分別對應hbase中的hbase_rate，hbase_comments兩個表，創建hbase_rate表並進行映射：

create table hbase_rate( key string,videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int,relatedId array<string>)  
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
with serdeproperties("hbase.columns.mapping" ="info:videoId,info:uploader,info:age,info:category,info:length,info:views,info:rate,info:category,info:category,info:category") 
tblproperties("hbase.table.name" = "hbase_rate");

創建hbase_comments表並進行映射：

create table hbase_comments( key string,videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int,relatedId array<string>)  
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
with serdeproperties("hbase.columns.mapping" ="info:videoId,info:uploader,info:age,info:category,info:length,info:views,info:rate,info:category,info:category,info:category") 
tblproperties("hbase.table.name" = "hbase_comments");

4.4請寫出通過insert overwrite select，插入hbase_rate表的語句

insert overwrite table hbase_rate2 select row_number() over (),rate.* from rate;

請寫出通過insert overwrite select，插入hbase_comments表的語句

insert overwrite table hbase_comments select row_number() over (),comments.* from comments;

5.通過hbaseapi進行查詢操作
5.1請使用hbaseapi 對表，按照通過startRowKey=1和endRowKey=100進行掃描查詢出結果。

  public void rowKeyFilter() throws IOException {
        Configuration conf = new Configuration();
        conf.set("hbase.zookeeper.quorum", "node01:2181,node02:2181,node03:2181");
        Connection connection = ConnectionFactory.createConnection(conf);
        //讀取表
        Table mytable = connection.getTable(TableName.valueOf("hbase_rate2"));

        //全表掃描
        Scan scan = new Scan();
        //區間掃描
        scan.setStartRow("1".getBytes());
        scan.setStopRow("100".getBytes());


        ResultScanner scanner = mytable.getScanner(scan);
        //result  是與一行數據（有多個列族，多個列）
        for (Result result : scanner) {
            System.out.println("rowkey -->" + Bytes.toString(result.getRow()));
            System.out.println("age:" + Bytes.toString(result.getValue("cf".getBytes(), "age".getBytes())));
    
        }
        //關閉連接
        connection.close();
    
    }

5.2請使用hbaseapi對hbase_comments表，只查詢comments列的值。

  public void search() throws Exception {
        Configuration conf = new Configuration();
        conf.set("hbase.zookeeper.quorum", "node01:2181,node02:2181,node03:2181");
        Connection connection = ConnectionFactory.createConnection(conf);
        Table mytable = connection.getTable(TableName.valueOf("hbase_comments"));

        Scan scan = new Scan();
    
        ResultScanner scanner = mytable.getScanner(scan);
        for (Result result : scanner) {
            Cell[] cells = result.rawCells();
            for (Cell cell : cells) {
                if (Bytes.toString(CellUtil.cloneQualifier(cell)).equals("comments")) {
                    System.out.println(Bytes.toString(CellUtil.cloneFamily(cell))+":"+Bytes.toString(CellUtil.cloneQualifier (cell))+"-"+Bytes.toString(CellUtil.cloneValue(cell)));
                }
            }
        }
        connection.close();
    }