需求
有2張大的mysql表,量級分別是1億和4.5億(太大了,DBA的同學正在考慮分表),而且數據是增量的,需要寫spark任務做處理,直接讀取mysql有點喫力,想通過sqoop定時增量直接導入hive,然後spark sql再與hive交互,能避免mysql的很多瓶頸,研究好幾天sqoop定時任務,使用的是sqoop1,版本是sqoop-1.4.6-cdh5.7.0。
1. 創建並執行sqoop job:
sqoop job -delete torderincrementjob //先刪除之前的job
-
sqoop job --create torderincrementjob -- import \
-
--connect jdbc:mysql://172.16.*.*:3306/*?useCursorFetch=true \
-
--username *\
-
--password-file /input/sqoop/pwd/109mysql.pwd \
-
--target-dir /mysqldb/t_order \
-
--table t_order \
-
--fields-terminated-by "\t" \
-
--lines-terminated-by "\n" \
-
--null-string '\\N' \
-
--null-non-string '\\N' \
-
--incremental append \
-
--check-column id \
-
--last-value 1281 \
-
-m 1
其中:
--password-file指定hdfs上存放的密碼
--fields-terminated-by "\t" \ 指定列分隔符,即製表符
--lines-terminated-by "\n" \ 指定行分隔符,及換行符
--split-by id \ 指定分割的字段
--null-string '\N' \ 指定string類型到hive裏的值爲NULL
--null-non-string '\N' \ 指定非string類型到hive裏的值爲NULL
--incremental append
--check-column id
--last-value 1281
以上3個參數組合使用做增量
創建完成後,執行這個job:sqoop job -exec torderincrementjob會看到在日誌裏有如下select語句:
SELECT MIN(id
), MAX(id
) FROM t_order
WHERE ( id
>= '1281' AND id
< '100701508' ),也就是last-value指定的id,sqoop會自己維護記錄,下次再執行這個任務,起始id就是100701508,每次都是從上次執行的id到當前id的區間增量,這樣就能通過crontab做定時任務,定時增量導入到hdfs
每次執行sqoop都會更新last-value的值,下次從新的值開始,以下是我從3次打印的日誌裏截取的:
Lower bound value: 1281
Upper bound value: 100701508
Lower bound value: 100701508
Upper bound value: 100703035
Lower bound value: 100703035
Upper bound value: 100704475
第一次執行完job後hdfs就有數據了,可以在命令行或者通過50070查看
2. 在hive中創建外部表關聯HDFS上的數據:
-
CREATE external TABLE `t_order`(
-
`id` bigint,
-
`serial` string,
-
`product_id` int,
-
`product_type` tinyint,
-
`product_name` string,
-
`quantity` double,
-
`buyer_id` bigint,
-
`payer_id` bigint,
-
`price` double,
-
`vip_price` double,
-
`settle_price` double,
-
`currency` string,
-
`payer_level` tinyint,
-
`status` tinyint,
-
`pay_mode` tinyint,
-
`payment_serial` string,
-
`client_type` string,
-
`app_type` tinyint,
-
`seller_id` string,
-
`partner_id` int,
-
`reference` string,
-
`channel_source` string,
-
`note` string,
-
`expiration_time` string,
-
`operator` string,
-
`create_time` string,
-
`pay_time` string,
-
`update_time` string)
-
ROW FORMAT DELIMITED
-
FIELDS TERMINATED BY '\t'
-
LINES TERMINATED BY '\n'
-
LOCATION
-
'hdfs://golive-master:8020/mysqldb/t_order'
這時候就可以通過hive查詢hdfs上的數據了
select * from golivecms20.t_order limit 10;
3. crontab定時任務
創建如下3個文件:
timermysqltohdfs.cron //定時任務
timermysqltohdfs.sh //腳本文件
timermysqltohdfs.log //日誌文件
-
timermysqltohdfs.sh:
-
#!/bin/sh
-
current_time=$(date +%Y%m%d%H%M%S)
-
echo $current_time >> /data/bigdata/app/sqoopjob/timermysqltohdfs.log
-
echo ............................>> /data/bigdata/app/sqoopjob/timermysqltohdfs.log
-
#t_order表同步
-
/data/bigdata/sqoop-1.4.6-cdh5.7.0/bin/sqoop job -exec torderincrementjob
-
#t_userlogout表同步
-
/data/bigdata/sqoop-1.4.6-cdh5.7.0/bin/sqoop job -exec tuserlogoutincrementjob
-
-
timermysqltohdfs.cron(每天1點、7點、13點、19點定時執行):
-
00 1,7,13,19 * * * /bin/bash /data/bigdata/app/sqoopjob/timermysqltohdfs.sh >> /data/bigdata/app/sqoopjob/timermysqltohdfs.log 2>&1
-
另一個表t_userlogout也是一樣,相關命令如下:
創建job:
-
sqoop job --create tuserlogoutincrementjob -- import \
-
--connect jdbc:mysql://172.16.*.*:3306/*?useCursorFetch=true \
-
--username *\
-
--password-file /input/sqoop/pwd/68mysql.pwd \
-
--target-dir /mysqldb/t_userlogout \
-
--table t_userlogout \
-
--fields-terminated-by "\t" \
-
--lines-terminated-by "\n" \
-
--null-string '\\N' \
-
--null-non-string '\\N' \
-
--incremental append \
-
--check-column ID \
-
--last-value 1 \
-
-m 1
首次執行:sqoop job -exec tuserlogoutincrementjob
在hive創建外部表:
-
CREATE external TABLE `t_userlogout`(
-
`ID` bigint,
-
`GoliveId` string,
-
`InstalmentCode` string,
-
`ManufacturerCode` string,
-
`MacAddress` string,
-
`AreaCode` string,
-
`IpAddress` string,
-
`LoginTime` string,
-
`LogoutTime` string,
-
`DeviceID` string,
-
`VersionType` string,
-
`Version` string,
-
`Platform` string,
-
`PartnerID` int,
-
`BranchType` int,
-
`LicenseProviderCode` string)
-
ROW FORMAT DELIMITED
-
FIELDS TERMINATED BY '\t'
-
LINES TERMINATED BY '\n'
-
LOCATION
-
'hdfs://golive-master:8020/mysqldb/t_userlogout'
後邊就是定時任務,增量導入了
附錄
除了指定--table導入表的全部字段,也可以通過--query指定sql:
-
--query "select ID,GoliveId,InstalmentCode,ManufacturerCode,MacAddress,IpAddress,LoginTime,VersionType,Version,PartnerID from t_userlogout where $CONDITIONS" \
-
-
--query "select id,serial,product_id,product_type,product_name,buyer_id,price,vip_price,settle_price,status,pay_mode,client_type,seller_id,partner_id,channel_source,expiration_time,create_time,pay_time,update_time from t_order where $CONDITIONS" \