Apache Sqoop (最新版本)

概述

Apache Sqoop(TM)是一種旨在在Apache Hadoop和結構化數據存儲(例如關係數據庫)之間高效傳輸批量數據的工具。通過內嵌的MapReduce程序實現關係型數據庫和HDFS、Hbase、Hive等數據的倒入導出。

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-TelQXgi3-1581128219010)(assets/image-20200104194538990.png)]

安裝

1、訪問sqoop的網址http://sqoop.apache.org/,選擇相應的sqoop版本下載,本案例選擇下載的是1.4.7下載地址:https://mirrors.tuna.tsinghua.edu.cn/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz,下載完相應的工具包後,解壓Sqoop

[root@CentOS ~]# tar -zxf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /usr/
[root@CentOS ~]# cd /usr/
[root@CentOS usr]# mv sqoop-1.4.7.bin__hadoop-2.6.0 sqoop-1.4.7
[root@CentOS ~]# cd /usr/sqoop-1.4.7/

2、配置SQOOP_HOME環境變量

[root@CentOS sqoop-1.4.7]# vi ~/.bashrc 
SQOOP_HOME=/usr/sqoop-1.4.7
HADOOP_HOME=/usr/hadoop-2.9.2
HIVE_HOME=/usr/apache-hive-1.2.2-bin
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$SQOOP_HOME/bin
CLASSPATH=.
export JAVA_HOME
export PATH
export HADOOP_HOME
export CLASSPATH
export HIVE_HOME
export SQOOP_HOME
[root@CentOS sqoop-1.4.7]# source ~/.bashrc

3、修改conf下的sqoop-env.sh.template配置文件

[root@CentOS sqoop-1.4.7]# mv conf/sqoop-env-template.sh conf/sqoop-env.sh 
[root@CentOS sqoop-1.4.7]# vi conf/sqoop-env.sh 
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/usr/hadoop-2.9.2

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/usr/hadoop-2.9.2

#set the path to where bin/hbase is available
#export HBASE_HOME=

#Set the path to where bin/hive is available
export HIVE_HOME=/usr/apache-hive-1.2.2-bin

#Set the path for where zookeper config dir is
export ZOOCFGDIR=/usr/zookeeper-3.4.6/conf

4、將MySQL驅動jar拷貝到sqoop的lib目錄下

[root@CentOS ~]# cp /usr/apache-hive-1.2.2-bin/lib/mysql-connector-java-5.1.48.jar /usr/sqoop-1.4.7/lib/

5、驗證Sqoop是否安裝成功

[root@CentOS sqoop-1.4.7]# sqoop version
Warning: /usr/sqoop-1.4.7/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /usr/sqoop-1.4.7/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /usr/sqoop-1.4.7/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /usr/sqoop-1.4.7/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
19/12/22 08:40:12 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Sqoop 1.4.7
git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8
Compiled by maugli on Thu Dec 21 15:59:58 STD 2017
[root@CentOS sqoop-1.4.7]# sqoop list-tables --connect jdbc:mysql://192.168.52.1:3306/mysql --username root --password root

導入導出參考:http://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html

sqoop-import

Import工具將單個表從RDBMS導入到HDFS。表中的每一行在HDFS中均表示爲單獨的記錄。記錄可以存儲爲文本文件(每行一個記錄),也可以二進制表示形式存儲爲Avro或SequenceFiles。

$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)

全表導入

sqoop import \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://CentOS:3306/test?characterEncoding=UTF-8 \
--username root \
--password root \
--table t_user \
--num-mappers 4 \
--fields-terminated-by '\t' \
--target-dir /mysql/test/t_user \
--delete-target-dir 
參數 含義
–-connect 連接的數據庫地址
-–username 連接的數據庫的用戶名
–-password 連接的數據庫的密碼
–-table 想要導出數據的表
–target-dir 要導入到hdfs中的目錄(如果不指定,默認存儲在“/user/用戶名/導入的表名” 目錄下)
-–delete-target-dir 表示如果在hdfs中有該目錄,則先刪除,然後再導入數據到該目錄下
–num-mappers 表示設置的maptask個數,默認爲4個,決定最終在hdfs中生成的文件個數(將table中的數據分成幾個文件分別存儲)
–fields-terminated-by 指定字段的分割符號

字段導入

sqoop import \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://CentOS:3306/test?characterEncoding=UTF-8 \
--username root \
--password root \
--table t_user \
--columns "id,name,age" \
--where "id > 2 or name like '%z%'" \
--target-dir /mysql/test/t_user1 \
--delete-target-dir \
--num-mappers 4 \
--fields-terminated-by '\t'
字段 含義
–columns 指定要查詢的字段
–where 指定過濾條件

導入查詢

sqoop import \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://CentOS:3306/test \
--username root \
--password root \
--num-mappers 3 \
--fields-terminated-by '\t' \
--query 'select id, name,sex, age ,birthDay from t_user where $CONDITIONS LIMIT 100' \
--split-by id \
--target-dir /mysql/test/t_user2 \
--delete-target-dir

如果要並行導入查詢結果,則每個Map任務將需要執行查詢的副本,其結果由Sqoop推斷的邊界條件進行分區。您的查詢必須包含令牌$CONDITIONS,每個Sqoop進程將用唯一條件表達式替換該令牌。您還必須使用--split-by選擇拆分列。

注意split-by必須是數字類型,如果不是數字類型系統會報錯!

Import failed: java.io.IOException: Generating splits for a textual index column allowed only in case of "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" property passed asa parameter

需要指定 --split-by 主鍵 並指定 "-Dorg.apache.sqoop.splitter.allow_text_splitter=true"參數即可。
參考:https://www.cnblogs.com/youchi/p/10342875.html

RDBMS 導入Hive

全量導入

sqoop import \
--connect jdbc:mysql://CentOS:3306/test \
--username root \
--password root \
--table t_user \
--num-mappers 3 \
--hive-import \
--fields-terminated-by "\t" \
--hive-overwrite \
--hive-table baizhi.t_user

[root@CentOS ~]# cp /usr/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar /usr/sqoop-1.4.7/lib/
[root@CentOS ~]# cp /usr/apache-hive-1.2.2-bin/lib/hive-exec-1.2.2.jar /usr/sqoop-1.4.7/lib/

參數 含義
–hive-import 將數據導入到hive中
–hive-overwrite 如果表已經存在,將原有數據覆蓋
–hive-table 指定導入hive中那張表裏

導入分區

sqoop import \
--connect jdbc:mysql://CentOS:3306/test \
--username root \
--password root \
--table t_user \
--num-mappers 3 \
--hive-import \
--fields-terminated-by "\t" \
--hive-overwrite \
--hive-table baizhi.t_user \
--hive-partition-key city \
--hive-partition-value 'bj'
參數 含義
–hive-partition-key 指定分區表的字段
–hive-partition-value 指定分區值

RDBMS-> Hbase

sqoop import \
--connect jdbc:mysql://CentOS:3306/test \
--username root \
--password root \
--table t_user \
--num-mappers 3 \
--hbase-table baizhi:t_user \
--column-family cf1 \
--hbase-create-table \
--hbase-row-key id \
--hbase-bulkload 
參數 含義
–hbase-table 寫入hbase中的表
–column-family 導入的列簇
–hbase-create-table 創建表
–hbase-row-key 指定字段作爲rowkey
–hbase-bulkload 啓動Hbase 批量寫入

啓動Hbase服務,創建baizhi 數據庫,t_user由系統自動創建!

sqoop-export

Export工具將一組文件從HDFS導出回RDBMS。目標表必須已經存在於數據庫中。根據用戶指定的定界符,讀取輸入文件並將其解析爲一組記錄。

HDFS -> MySQL

0       zhangsan        true    20      2020-01-11
1       lisi    false   25      2020-01-10
3       wangwu  true    36      2020-01-17
4       zhaoliu        false   50      1990-02-08
5       win7    true    20      1991-02-08
create table t_user(
 id int primary key auto_increment,
 name VARCHAR(32),
 sex boolean,
 age int,
 birthDay date
) CHARACTER SET=utf8;
sqoop export \
--connect jdbc:mysql://CentOS:3306/test \
--username root \
--password root \
--table t_user  \
--update-key id  \
--update-mode allowinsert \
--export-dir /demo/src \
--input-fields-terminated-by '\t'
參數 說明
–export-dir 導出的數據
–input-fields-terminated-by 字段分割符號

導入模式可選值可以是updateonly或者allowinsert,updateonly僅僅會更新已經存在的記錄。

HBASE -> MySQL

HBASE -> HIVE

HIVE-RDBMS 等價 HDFS => RDBMS

①準備測試數據

測試數據 t_employee

7369,SMITH,CLERK,7902,1980-12-17 00:00:00,800,\N,20
7499,ALLEN,SALESMAN,7698,1981-02-20 00:00:00,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22 00:00:00,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02 00:00:00,2975,\N,20
7654,MARTIN,SALESMAN,7698,1981-09-28 00:00:00,1250,1400,30
7698,BLAKE,MANAGER,7839,1981-05-01 00:00:00,2850,\N,30
7782,CLARK,MANAGER,7839,1981-06-09 00:00:00,2450,\N,10
7788,SCOTT,ANALYST,7566,1987-04-19 00:00:00,1500,\N,20
7839,KING,PRESIDENT,\N,1981-11-17 00:00:00,5000,\N,10
7844,TURNER,SALESMAN,7698,1981-09-08 00:00:00,1500,0,30
7876,ADAMS,CLERK,7788,1987-05-23 00:00:00,1100,\N,20
7900,JAMES,CLERK,7698,1981-12-03 00:00:00,950,\N,30
7902,FORD,ANALYST,7566,1981-12-03 00:00:00,3000,\N,20
7934,MILLER,CLERK,7782,1982-01-23 00:00:00,1300,\N,10
create database if not exists baizhi;
use baizhi;
drop table if exists t_employee;
CREATE TABLE t_employee(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7,2),
    comm DECIMAL(7,2),
    deptno INT)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;
load data local inpath '/root/t_employee' overwrite into table t_employee;

drop table if exists t_employee_hbase;
create external table t_employee_hbase(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7,2),
    comm DECIMAL(7,2),
    deptno INT)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES("hbase.columns.mapping" = ":key,cf1:name,cf1:job,cf1:mgr,cf1:hiredate,cf1:sal,cf1:comm,cf1:deptno") 
TBLPROPERTIES("hbase.table.name" = "baizhi:t_employee");

insert overwrite  table t_employee_hbase  select empno,ename,job,mgr,hiredate,sal,comm,deptno from t_employee;

②先嚐試將HBase的數據導出到HDFS

INSERT OVERWRITE  DIRECTORY '/demo/src/employee' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE select empno,ename,job,mgr,hiredate,sal,comm,deptno from t_employee_hbase;

③將HDFS中數據導出RDBMS

sqoop export \
--connect jdbc:mysql://CentOS:3306/test \
--username root \
--password root \
--table t_employee  \
--update-key id  \
--update-mode allowinsert \
--export-dir /demo/src/employee \
--input-fields-terminated-by ',' \
--input-null-string '\\N' \
--input-null-non-string '\\N';
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章