上一篇文章介紹了sqoop全量同步數據到hive,同時上一篇文章也給出了本人寫的hadoop+hive+hbase+sqoop+kylin的僞分佈式安裝方法連接,上篇文章連接:Sqoop全量同步mysql/Oracle數據到hive。
本片文章將通過實驗詳細介紹如何增量同步數據到hive,以及sqoop job與crontab定時結合無密碼登錄的增量同步實現方法。
一、知識儲備
在生產環境中,系統可能會定期從與業務相關的關係型數據庫向Hadoop導入數據,導入數倉後進行後續離線分析。故我們此時不可能再將所有數據重新導一遍,此時我們就需要增量數據導入這一模式了。
增量數據導入分兩種,一是基於遞增列的增量數據導入(Append方式)。二是基於時間列的增量數據導入(LastModified方式),增量導入使用到的核心參數主要是:
–check-column
用來指定一些列,這些列在增量導入時用來檢查這些數據是否作爲增量數據進行導入,和關係型數據庫中的自增字段及時間戳類似.
注意:這些被指定的列的類型不能使任意字符類型,如char、varchar等類型都是不可以的,同時–check-column可以去指定多個列
–incremental
用來指定增量導入的模式,兩種模式分別爲Append和Lastmodified
–last-value
指定上一次導入中檢查列指定字段最大值
接下來通過具體實驗來詳細說明
1、Append模式增量導入
重要參數:
–incremental append
基於遞增列的增量導入(將遞增列值大於閾值的所有數據增量導入Hadoop)
–check-column
遞增列(int)
–last-value
閾值(int)
舉個簡單例子,在oracle庫scott用戶下有一張員工表(inr_app),表中有:自增主鍵員工編號(empno),員工名(ename),員工職位(job),員工薪資(sal)這幾個屬性,如下:
--在oracle庫scott下創建一個這樣的表
create table inr_app as
select rownum as empno, ename, job, sal
from emp a
where job is not null
and rownum<=5;
--查詢:
select * from inr_app;
EMPNO ENAME JOB SAL
1 er CLERK 800.00
2 ALLEN SALESMAN 1600.00
3 WARD SALESMAN 1250.00
4 JONES MANAGER 2975.00
5 MARTIN SALESMAN 1250.00
我們需要將新進員工也導入hadoop以供公司人力部門做分析,此時我們需要將這個表數據導入到hive,也就是增量導入前的一次全量導入:
--在hive創建表:
create table INR_APP
(
empno int,
ename string,
job string,
sal float
);
hive> show tables;
OK
inr_app
inr_emp
ora_hive
Time taken: 0.166 seconds, Fetched: 3 row(s)
--接下來執行全量導入:
[root@hadoop ~]# sqoop import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username scott --password tiger --table INR_APP -m 1 --hive-import --hive-database oracle
--查詢hive表
hive> select * from inr_app;
OK
1 er CLERK 800.0
2 ALLEN SALESMAN 1600.0
3 WARD SALESMAN 1250.0
4 JONES MANAGER 2975.0
5 MARTIN SALESMAN 1250.0
Time taken: 0.179 seconds, Fetched: 5 row(s)
過了一段時間後,公司又新來一批員工,我們需要將新員工也導入到hadoop供有關部門分析,此時我們只需要指定–incremental 參數爲append,–last-value參數爲5可。表示只從id大於5後開始導入:
--先給oracle庫scott.inr_app插入幾條數據:
insert into inr_app values(6,'zhao','DBA',100);
insert into inr_app values(7,'yan','BI',100);
insert into inr_app values(8,'dong','JAVA',100);
commit;
--執行增量導入:
[root@hadoop ~]# sqoop import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username scott --password tiger --table INR_APP -m 1 --hive-import --hive-database oracle --incremental app
end --check-column EMPNO --last-value 5
Warning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /hadoop/sqoop/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
19/03/12 19:45:55 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
19/03/12 19:45:56 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/03/12 19:45:56 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
19/03/12 19:45:56 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
19/03/12 19:45:56 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
19/03/12 19:45:56 INFO manager.SqlManager: Using default fetchSize of 1000
19/03/12 19:45:56 INFO tool.CodeGenTool: Beginning code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hbase/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/03/12 19:45:57 INFO manager.OracleManager: Time zone has been set to GMT
19/03/12 19:45:57 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM INR_APP t WHERE 1=0
19/03/12 19:45:57 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /hadoop
Note: /tmp/sqoop-root/compile/9b898359374ea580a390b32da1a37949/INR_APP.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/03/12 19:45:59 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/9b898359374ea580a390b32da1a37949/INR_APP.jar
19/03/12 19:45:59 INFO manager.OracleManager: Time zone has been set to GMT
19/03/12 19:45:59 INFO tool.ImportTool: Maximal id query for free form incremental import: SELECT MAX(EMPNO) FROM INR_APP
19/03/12 19:45:59 INFO tool.ImportTool: Incremental import based on column EMPNO
19/03/12 19:45:59 INFO tool.ImportTool: Lower bound value: 5
19/03/12 19:45:59 INFO tool.ImportTool: Upper bound value: 8
19/03/12 19:45:59 INFO manager.OracleManager: Time zone has been set to GMT
19/03/12 19:45:59 INFO mapreduce.ImportJobBase: Beginning import of INR_APP
19/03/12 19:46:00 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/03/12 19:46:00 INFO manager.OracleManager: Time zone has been set to GMT
19/03/12 19:46:01 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/03/12 19:46:01 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.66:8032
19/03/12 19:46:04 INFO db.DBInputFormat: Using read commited transaction isolation
19/03/12 19:46:04 INFO mapreduce.JobSubmitter: number of splits:1
19/03/12 19:46:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552371714699_0010
19/03/12 19:46:05 INFO impl.YarnClientImpl: Submitted application application_1552371714699_0010
19/03/12 19:46:05 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1552371714699_0010/
19/03/12 19:46:05 INFO mapreduce.Job: Running job: job_1552371714699_0010
19/03/12 19:46:13 INFO mapreduce.Job: Job job_1552371714699_0010 running in uber mode : false
19/03/12 19:46:13 INFO mapreduce.Job: map 0% reduce 0%
19/03/12 19:46:21 INFO mapreduce.Job: map 100% reduce 0%
19/03/12 19:46:21 INFO mapreduce.Job: Job job_1552371714699_0010 completed successfully
19/03/12 19:46:21 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=143702
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=44
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=4336
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=4336
Total vcore-milliseconds taken by all map tasks=4336
Total megabyte-milliseconds taken by all map tasks=4440064
Map-Reduce Framework
Map input records=3
Map output records=3
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=92
CPU time spent (ms)=2760
Physical memory (bytes) snapshot=211570688
Virtual memory (bytes) snapshot=2133770240
Total committed heap usage (bytes)=106954752
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=44
19/03/12 19:46:21 INFO mapreduce.ImportJobBase: Transferred 44 bytes in 20.3436 seconds (2.1628 bytes/sec)
19/03/12 19:46:21 INFO mapreduce.ImportJobBase: Retrieved 3 records.
19/03/12 19:46:21 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners for table INR_APP
19/03/12 19:46:21 INFO util.AppendUtils: Creating missing output directory - INR_APP
19/03/12 19:46:21 INFO manager.OracleManager: Time zone has been set to GMT
19/03/12 19:46:21 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM INR_APP t WHERE 1=0
19/03/12 19:46:21 WARN hive.TableDefWriter: Column EMPNO had to be cast to a less precise type in Hive
19/03/12 19:46:21 WARN hive.TableDefWriter: Column SAL had to be cast to a less precise type in Hive
19/03/12 19:46:21 INFO hive.HiveImport: Loading uploaded data into Hive
19/03/12 19:46:21 INFO conf.HiveConf: Found configuration file file:/hadoop/hive/conf/hive-site.xml
Logging initialized using configuration in jar:file:/hadoop/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
19/03/12 19:46:24 INFO SessionState:
Logging initialized using configuration in jar:file:/hadoop/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
19/03/12 19:46:24 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/2968942b-30b6-49f5-b86c-d71a77963381
19/03/12 19:46:24 INFO session.SessionState: Created local directory: /hadoop/hive/tmp/root/2968942b-30b6-49f5-b86c-d71a77963381
19/03/12 19:46:24 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/2968942b-30b6-49f5-b86c-d71a77963381/_tmp_space.db
19/03/12 19:46:24 INFO conf.HiveConf: Using the default value passed in for log id: 2968942b-30b6-49f5-b86c-d71a77963381
19/03/12 19:46:24 INFO session.SessionState: Updating thread name to 2968942b-30b6-49f5-b86c-d71a77963381 main
19/03/12 19:46:24 INFO conf.HiveConf: Using the default value passed in for log id: 2968942b-30b6-49f5-b86c-d71a77963381
19/03/12 19:46:24 INFO ql.Driver: Compiling command(queryId=root_20190312114624_6679c12a-4224-4bcd-a8be-f7d4ae56a139): CREATE TABLE IF NOT EXISTS `oracle`.`INR_APP` ( `EMPNO` DOUBLE, `ENAME
` STRING, `JOB` STRING, `SAL` DOUBLE) COMMENT 'Imported by sqoop on 2019/03/12 11:46:21' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012' STORED AS TEXTFILE19/03/12 19:46:27 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/12 19:46:27 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/12 19:46:27 INFO hive.metastore: Connected to metastore.
19/03/12 19:46:27 INFO parse.CalcitePlanner: Starting Semantic Analysis
19/03/12 19:46:27 INFO parse.CalcitePlanner: Creating table oracle.INR_APP position=27
19/03/12 19:46:27 INFO ql.Driver: Semantic Analysis Completed
19/03/12 19:46:27 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
19/03/12 19:46:27 INFO ql.Driver: Completed compiling command(queryId=root_20190312114624_6679c12a-4224-4bcd-a8be-f7d4ae56a139); Time taken: 2.876 seconds
19/03/12 19:46:27 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager
19/03/12 19:46:27 INFO ql.Driver: Executing command(queryId=root_20190312114624_6679c12a-4224-4bcd-a8be-f7d4ae56a139): CREATE TABLE IF NOT EXISTS `oracle`.`INR_APP` ( `EMPNO` DOUBLE, `ENAME
` STRING, `JOB` STRING, `SAL` DOUBLE) COMMENT 'Imported by sqoop on 2019/03/12 11:46:21' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012' STORED AS TEXTFILE19/03/12 19:46:27 INFO sqlstd.SQLStdHiveAccessController: Created SQLStdHiveAccessController for session context : HiveAuthzSessionContext [sessionString=2968942b-30b6-49f5-b86c-d71a7796338
1, clientType=HIVECLI]19/03/12 19:46:27 WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
19/03/12 19:46:27 INFO hive.metastore: Mestastore configuration hive.metastore.filter.hook changed from org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl to org.apache.hadoop.
hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook19/03/12 19:46:27 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/12 19:46:27 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/12 19:46:27 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/12 19:46:27 INFO hive.metastore: Connected to metastore.
19/03/12 19:46:27 INFO ql.Driver: Completed executing command(queryId=root_20190312114624_6679c12a-4224-4bcd-a8be-f7d4ae56a139); Time taken: 0.096 seconds
OK
19/03/12 19:46:27 INFO ql.Driver: OK
Time taken: 2.982 seconds
19/03/12 19:46:27 INFO CliDriver: Time taken: 2.982 seconds
19/03/12 19:46:27 INFO conf.HiveConf: Using the default value passed in for log id: 2968942b-30b6-49f5-b86c-d71a77963381
19/03/12 19:46:27 INFO session.SessionState: Resetting thread name to main
19/03/12 19:46:27 INFO conf.HiveConf: Using the default value passed in for log id: 2968942b-30b6-49f5-b86c-d71a77963381
19/03/12 19:46:27 INFO session.SessionState: Updating thread name to 2968942b-30b6-49f5-b86c-d71a77963381 main
19/03/12 19:46:27 INFO ql.Driver: Compiling command(queryId=root_20190312114627_748c136c-1446-43df-a819-728becae7df2):
LOAD DATA INPATH 'hdfs://192.168.1.66:9000/user/root/INR_APP' INTO TABLE `oracle`.`INR_APP`
19/03/12 19:46:28 INFO ql.Driver: Semantic Analysis Completed
19/03/12 19:46:28 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
19/03/12 19:46:28 INFO ql.Driver: Completed compiling command(queryId=root_20190312114627_748c136c-1446-43df-a819-728becae7df2); Time taken: 0.421 seconds
19/03/12 19:46:28 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager
19/03/12 19:46:28 INFO ql.Driver: Executing command(queryId=root_20190312114627_748c136c-1446-43df-a819-728becae7df2):
LOAD DATA INPATH 'hdfs://192.168.1.66:9000/user/root/INR_APP' INTO TABLE `oracle`.`INR_APP`
19/03/12 19:46:28 INFO ql.Driver: Starting task [Stage-0:MOVE] in serial mode
19/03/12 19:46:28 INFO hive.metastore: Closed a connection to metastore, current connections: 0
Loading data to table oracle.inr_app
19/03/12 19:46:28 INFO exec.Task: Loading data to table oracle.inr_app from hdfs://192.168.1.66:9000/user/root/INR_APP
19/03/12 19:46:28 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/12 19:46:28 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/12 19:46:28 INFO hive.metastore: Connected to metastore.
19/03/12 19:46:28 ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
19/03/12 19:46:28 INFO ql.Driver: Starting task [Stage-1:STATS] in serial mode
19/03/12 19:46:28 INFO exec.StatsTask: Executing stats task
19/03/12 19:46:28 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/12 19:46:28 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/12 19:46:28 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/12 19:46:28 INFO hive.metastore: Connected to metastore.
19/03/12 19:46:29 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/12 19:46:29 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/12 19:46:29 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/12 19:46:29 INFO hive.metastore: Connected to metastore.
19/03/12 19:46:29 INFO exec.StatsTask: Table oracle.inr_app stats: [numFiles=2, numRows=0, totalSize=146, rawDataSize=0]
19/03/12 19:46:29 INFO ql.Driver: Completed executing command(queryId=root_20190312114627_748c136c-1446-43df-a819-728becae7df2); Time taken: 0.992 seconds
OK
19/03/12 19:46:29 INFO ql.Driver: OK
Time taken: 1.415 seconds
19/03/12 19:46:29 INFO CliDriver: Time taken: 1.415 seconds
19/03/12 19:46:29 INFO conf.HiveConf: Using the default value passed in for log id: 2968942b-30b6-49f5-b86c-d71a77963381
19/03/12 19:46:29 INFO session.SessionState: Resetting thread name to main
19/03/12 19:46:29 INFO conf.HiveConf: Using the default value passed in for log id: 2968942b-30b6-49f5-b86c-d71a77963381
19/03/12 19:46:29 INFO session.SessionState: Deleted directory: /tmp/hive/root/2968942b-30b6-49f5-b86c-d71a77963381 on fs with scheme hdfs
19/03/12 19:46:29 INFO session.SessionState: Deleted directory: /hadoop/hive/tmp/root/2968942b-30b6-49f5-b86c-d71a77963381 on fs with scheme file
19/03/12 19:46:29 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/12 19:46:29 INFO hive.HiveImport: Hive import complete.
19/03/12 19:46:29 INFO hive.HiveImport: Export directory is empty, removing it.
19/03/12 19:46:29 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments:
19/03/12 19:46:29 INFO tool.ImportTool: --incremental append
19/03/12 19:46:29 INFO tool.ImportTool: --check-column EMPNO
19/03/12 19:46:29 INFO tool.ImportTool: --last-value 8
19/03/12 19:46:29 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')
查詢hive表
hive> select * from inr_app;
OK
1 er CLERK 800.0
2 ALLEN SALESMAN 1600.0
3 WARD SALESMAN 1250.0
4 JONES MANAGER 2975.0
5 MARTIN SALESMAN 1250.0
6 zhao DBA 100.0
7 yan BI 100.0
8 dong JAVA 100.0
Time taken: 0.165 seconds, Fetched: 8 row(s)
已經增量過來了,我們也可以使用hdfs dfs -cat查看生成的數據文件,生成的數據文件位置在之前配置hadoop環境時已經配置,讀者也可以通過自己訪問自己環境:IP:50070/explorer.html#/查詢
[root@hadoop ~]# hdfs dfs -cat /user/hive/warehouse/oracle.db/inr_app/part-m-00000_copy_1
6zhaoDBA100
7yanBI100
8dongJAVA100
至於之前全量的數據,也可以看到:
[root@hadoop ~]# hdfs dfs -cat /user/hive/warehouse/oracle.db/inr_app/part-m-00000
1erCLERK800
2ALLENSALESMAN1600
3WARDSALESMAN1250
4JONESMANAGER2975
5MARTINSALESMAN1250
2、、lastModify增量導入
lastModify增量導入又分爲兩種模式:
a、–incremental append 附加模式
b、–incremental --merge-key合併模式
接下來繼續看實驗:
實驗一:附加模式
此方式要求原有表中有time字段,它能指定一個時間戳,讓Sqoop把該時間戳之後的數據導入至Hadoop(這裏爲HDFS)。因爲後續員工薪資可能狀態會變化,變化後time字段時間戳也會變化,此時Sqoop依然會將相同狀態更改後的員工信息導入HDFS,因此爲導致數據重複。
先在oracle庫基於scott.inr_app新建一個帶時間列etltime的表inr_las,初始化已有數據時間爲sysdate
create table inr_las as select a.empno,
a.ename,
a.job,
a.sal,
sysdate as etltime
from inr_app a;
select * from inr_las;
EMPNO ENAME JOB SAL ETLTIME
1 er CLERK 800.00 2019/3/20 10:42:27
2 ALLEN SALESMAN 1600.00 2019/3/20 10:42:27
3 WARD SALESMAN 1250.00 2019/3/20 10:42:27
4 JONES MANAGER 2975.00 2019/3/20 10:42:27
5 MARTIN SALESMAN 1250.00 2019/3/20 10:42:27
6 zhao DBA 100.00 2019/3/20 10:42:27
7 yan BI 100.00 2019/3/20 10:42:27
8 dong JAVA 100.00 2019/3/20 10:42:27
在hive創建表,這裏統一指定列分隔符爲’\t’,後面導入也是以此爲分隔符:
create table INR_LAS
(
empno int,
ename string,
job string,
sal float,
etltime string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
初始化全量導入:
[root@hadoop ~]# sqoop import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username scott --password tiger --table INR_LAS -m 1 --hive-import --hive-database oracle --fields-terminated-by '\t' --lines-terminated-by '\n'
查詢hive表:
hive> select * from inr_las;
OK
1 er CLERK 800.0 2019-03-20 10:42:27.0
2 ALLEN SALESMAN 1600.0 2019-03-20 10:42:27.0
3 WARD SALESMAN 1250.0 2019-03-20 10:42:27.0
4 JONES MANAGER 2975.0 2019-03-20 10:42:27.0
5 MARTIN SALESMAN 1250.0 2019-03-20 10:42:27.0
6 zhao DBA 100.0 2019-03-20 10:42:27.0
7 yan BI 100.0 2019-03-20 10:42:27.0
8 dong JAVA 100.0 2019-03-20 10:42:27.0
Time taken: 0.181 seconds, Fetched: 8 row(s)
這次增量導入我們先使用–incremental lastmodified --last-value --append 看下效果,首先在源端對inr_las表數據做下變更:
update inr_las set sal=1000,etltime=sysdate where empno=6;
commit;
select * from inr_las;
EMPNO ENAME JOB SAL ETLTIME
1 er CLERK 800.00 2019/3/20 10:42:27
2 ALLEN SALESMAN 1600.00 2019/3/20 10:42:27
3 WARD SALESMAN 1250.00 2019/3/20 10:42:27
4 JONES MANAGER 2975.00 2019/3/20 10:42:27
5 MARTIN SALESMAN 1250.00 2019/3/20 10:42:27
6 zhao DBA 1000.00 2019/3/20 10:52:34
7 yan BI 100.00 2019/3/20 10:42:27
8 dong JAVA 100.00 2019/3/20 10:42:27
接下來增量導入:
[root@hadoop ~]# sqoop import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username scott --password tiger --table INR_LAS --fields-terminated-by '\t' --lines-terminated-by '\n' --hive-import --hive-database oracle --hive-table INR_LAS --incremental append --check-column ETLTIME --last-value '2019-03-20 10:42:27' -m 1 --null-string '\\N' --null-non-string '\\N'
Warning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.Please set $ACCUMULO_HOME to the root of your Accumulo installation. '2019-03Warning: /hadoop/sqoop/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
19/03/13 14:46:26 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
19/03/13 14:46:26 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/03/13 14:46:27 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
19/03/13 14:46:27 INFO manager.SqlManager: Using default fetchSize of 1000
19/03/13 14:46:27 INFO tool.CodeGenTool: Beginning code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hbase/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/03/13 14:46:27 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 14:46:27 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM INR_LAS t WHERE 1=0
19/03/13 14:46:28 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /hadoop
Note: /tmp/sqoop-root/compile/37cf0f81337f33bc731bf3d6fd0a3f73/INR_LAS.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/03/13 14:46:30 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/37cf0f81337f33bc731bf3d6fd0a3f73/INR_LAS.jar
19/03/13 14:46:30 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 14:46:30 INFO tool.ImportTool: Maximal id query for free form incremental import: SELECT MAX(ETLTIME) FROM INR_LAS
19/03/13 14:46:30 INFO tool.ImportTool: Incremental import based on column ETLTIME
19/03/13 14:46:30 INFO tool.ImportTool: Lower bound value: TO_TIMESTAMP('2019-03-20 10:42:27', 'YYYY-MM-DD HH24:MI:SS.FF')
19/03/13 14:46:30 INFO tool.ImportTool: Upper bound value: TO_TIMESTAMP('2019-03-20 10:52:34.0', 'YYYY-MM-DD HH24:MI:SS.FF')
19/03/13 14:46:31 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 14:46:31 INFO mapreduce.ImportJobBase: Beginning import of INR_LAS
19/03/13 14:46:31 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/03/13 14:46:31 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 14:46:32 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/03/13 14:46:32 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.66:8032
19/03/13 14:46:35 INFO db.DBInputFormat: Using read commited transaction isolation
19/03/13 14:46:35 INFO mapreduce.JobSubmitter: number of splits:1
19/03/13 14:46:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552371714699_0031
19/03/13 14:46:36 INFO impl.YarnClientImpl: Submitted application application_1552371714699_0031
19/03/13 14:46:36 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1552371714699_0031/
19/03/13 14:46:36 INFO mapreduce.Job: Running job: job_1552371714699_0031
19/03/13 14:46:45 INFO mapreduce.Job: Job job_1552371714699_0031 running in uber mode : false
19/03/13 14:46:45 INFO mapreduce.Job: map 0% reduce 0%
19/03/13 14:46:52 INFO mapreduce.Job: map 100% reduce 0%
19/03/13 14:46:53 INFO mapreduce.Job: Job job_1552371714699_0031 completed successfully
19/03/13 14:46:54 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=143840
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=38
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=4950
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=4950
Total vcore-milliseconds taken by all map tasks=4950
Total megabyte-milliseconds taken by all map tasks=5068800
Map-Reduce Framework
Map input records=1
Map output records=1
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=560
CPU time spent (ms)=2890
Physical memory (bytes) snapshot=189190144
Virtual memory (bytes) snapshot=2141667328
Total committed heap usage (bytes)=116391936
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=38
19/03/13 14:46:54 INFO mapreduce.ImportJobBase: Transferred 38 bytes in 21.7168 seconds (1.7498 bytes/sec)
19/03/13 14:46:54 INFO mapreduce.ImportJobBase: Retrieved 1 records.
19/03/13 14:46:54 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners for table INR_LAS
19/03/13 14:46:54 INFO util.AppendUtils: Creating missing output directory - INR_LAS
19/03/13 14:46:54 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 14:46:54 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM INR_LAS t WHERE 1=0
19/03/13 14:46:54 WARN hive.TableDefWriter: Column EMPNO had to be cast to a less precise type in Hive
19/03/13 14:46:54 WARN hive.TableDefWriter: Column SAL had to be cast to a less precise type in Hive
19/03/13 14:46:54 WARN hive.TableDefWriter: Column ETLTIME had to be cast to a less precise type in Hive
19/03/13 14:46:54 INFO hive.HiveImport: Loading uploaded data into Hive
19/03/13 14:46:54 INFO conf.HiveConf: Found configuration file file:/hadoop/hive/conf/hive-site.xml
Logging initialized using configuration in jar:file:/hadoop/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
19/03/13 14:46:57 INFO SessionState:
Logging initialized using configuration in jar:file:/hadoop/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
19/03/13 14:46:57 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/dbf3aaff-4a20-426b-bc59-9117e821a2f5
19/03/13 14:46:57 INFO session.SessionState: Created local directory: /hadoop/hive/tmp/root/dbf3aaff-4a20-426b-bc59-9117e821a2f5
19/03/13 14:46:57 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/dbf3aaff-4a20-426b-bc59-9117e821a2f5/_tmp_space.db
19/03/13 14:46:57 INFO conf.HiveConf: Using the default value passed in for log id: dbf3aaff-4a20-426b-bc59-9117e821a2f5
19/03/13 14:46:57 INFO session.SessionState: Updating thread name to dbf3aaff-4a20-426b-bc59-9117e821a2f5 main
19/03/13 14:46:57 INFO conf.HiveConf: Using the default value passed in for log id: dbf3aaff-4a20-426b-bc59-9117e821a2f5
19/03/13 14:46:57 INFO ql.Driver: Compiling command(queryId=root_20190313064657_78359340-8092-4093-a9ed-b5a8e82ea901): CREATE TABLE IF NOT EXISTS `oracle`.`INR_LAS` ( `EMPNO` DOUBLE, `ENAME
` STRING, `JOB` STRING, `SAL` DOUBLE, `ETLTIME` STRING) COMMENT 'Imported by sqoop on 2019/03/13 06:46:54' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\011' LINES TERMINATED BY '\012' STORED AS TEXTFILE19/03/13 14:47:00 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/13 14:47:00 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/13 14:47:00 INFO hive.metastore: Connected to metastore.
19/03/13 14:47:00 INFO parse.CalcitePlanner: Starting Semantic Analysis
19/03/13 14:47:00 INFO parse.CalcitePlanner: Creating table oracle.INR_LAS position=27
19/03/13 14:47:00 INFO ql.Driver: Semantic Analysis Completed
19/03/13 14:47:00 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
19/03/13 14:47:00 INFO ql.Driver: Completed compiling command(queryId=root_20190313064657_78359340-8092-4093-a9ed-b5a8e82ea901); Time taken: 3.122 seconds
19/03/13 14:47:00 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager
19/03/13 14:47:00 INFO ql.Driver: Executing command(queryId=root_20190313064657_78359340-8092-4093-a9ed-b5a8e82ea901): CREATE TABLE IF NOT EXISTS `oracle`.`INR_LAS` ( `EMPNO` DOUBLE, `ENAME
` STRING, `JOB` STRING, `SAL` DOUBLE, `ETLTIME` STRING) COMMENT 'Imported by sqoop on 2019/03/13 06:46:54' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\011' LINES TERMINATED BY '\012' STORED AS TEXTFILE19/03/13 14:47:00 INFO sqlstd.SQLStdHiveAccessController: Created SQLStdHiveAccessController for session context : HiveAuthzSessionContext [sessionString=dbf3aaff-4a20-426b-bc59-9117e821a2f
5, clientType=HIVECLI]19/03/13 14:47:00 WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
19/03/13 14:47:00 INFO hive.metastore: Mestastore configuration hive.metastore.filter.hook changed from org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl to org.apache.hadoop.
hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook19/03/13 14:47:00 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/13 14:47:00 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/13 14:47:00 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/13 14:47:00 INFO hive.metastore: Connected to metastore.
19/03/13 14:47:00 INFO ql.Driver: Completed executing command(queryId=root_20190313064657_78359340-8092-4093-a9ed-b5a8e82ea901); Time taken: 0.099 seconds
OK
19/03/13 14:47:00 INFO ql.Driver: OK
Time taken: 3.234 seconds
19/03/13 14:47:00 INFO CliDriver: Time taken: 3.234 seconds
19/03/13 14:47:00 INFO conf.HiveConf: Using the default value passed in for log id: dbf3aaff-4a20-426b-bc59-9117e821a2f5
19/03/13 14:47:00 INFO session.SessionState: Resetting thread name to main
19/03/13 14:47:00 INFO conf.HiveConf: Using the default value passed in for log id: dbf3aaff-4a20-426b-bc59-9117e821a2f5
19/03/13 14:47:00 INFO session.SessionState: Updating thread name to dbf3aaff-4a20-426b-bc59-9117e821a2f5 main
19/03/13 14:47:00 INFO ql.Driver: Compiling command(queryId=root_20190313064700_5af88364-6217-429d-90a0-1816e54f44d9):
LOAD DATA INPATH 'hdfs://192.168.1.66:9000/user/root/INR_LAS' INTO TABLE `oracle`.`INR_LAS`
19/03/13 14:47:01 INFO ql.Driver: Semantic Analysis Completed
19/03/13 14:47:01 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
19/03/13 14:47:01 INFO ql.Driver: Completed compiling command(queryId=root_20190313064700_5af88364-6217-429d-90a0-1816e54f44d9); Time taken: 0.443 seconds
19/03/13 14:47:01 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager
19/03/13 14:47:01 INFO ql.Driver: Executing command(queryId=root_20190313064700_5af88364-6217-429d-90a0-1816e54f44d9):
LOAD DATA INPATH 'hdfs://192.168.1.66:9000/user/root/INR_LAS' INTO TABLE `oracle`.`INR_LAS`
19/03/13 14:47:01 INFO ql.Driver: Starting task [Stage-0:MOVE] in serial mode
19/03/13 14:47:01 INFO hive.metastore: Closed a connection to metastore, current connections: 0
Loading data to table oracle.inr_las
19/03/13 14:47:01 INFO exec.Task: Loading data to table oracle.inr_las from hdfs://192.168.1.66:9000/user/root/INR_LAS
19/03/13 14:47:01 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/13 14:47:01 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/13 14:47:01 INFO hive.metastore: Connected to metastore.
19/03/13 14:47:01 ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
19/03/13 14:47:02 INFO ql.Driver: Starting task [Stage-1:STATS] in serial mode
19/03/13 14:47:02 INFO exec.StatsTask: Executing stats task
19/03/13 14:47:02 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/13 14:47:02 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/13 14:47:02 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/13 14:47:02 INFO hive.metastore: Connected to metastore.
19/03/13 14:47:02 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/13 14:47:02 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/13 14:47:02 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/13 14:47:02 INFO hive.metastore: Connected to metastore.
19/03/13 14:47:02 INFO exec.StatsTask: Table oracle.inr_las stats: [numFiles=2, numRows=0, totalSize=360, rawDataSize=0]
19/03/13 14:47:02 INFO ql.Driver: Completed executing command(queryId=root_20190313064700_5af88364-6217-429d-90a0-1816e54f44d9); Time taken: 1.211 seconds
OK
19/03/13 14:47:02 INFO ql.Driver: OK
Time taken: 1.654 seconds
19/03/13 14:47:02 INFO CliDriver: Time taken: 1.654 seconds
19/03/13 14:47:02 INFO conf.HiveConf: Using the default value passed in for log id: dbf3aaff-4a20-426b-bc59-9117e821a2f5
19/03/13 14:47:02 INFO session.SessionState: Resetting thread name to main
19/03/13 14:47:02 INFO conf.HiveConf: Using the default value passed in for log id: dbf3aaff-4a20-426b-bc59-9117e821a2f5
19/03/13 14:47:02 INFO session.SessionState: Deleted directory: /tmp/hive/root/dbf3aaff-4a20-426b-bc59-9117e821a2f5 on fs with scheme hdfs
19/03/13 14:47:02 INFO session.SessionState: Deleted directory: /hadoop/hive/tmp/root/dbf3aaff-4a20-426b-bc59-9117e821a2f5 on fs with scheme file
19/03/13 14:47:02 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/13 14:47:02 INFO hive.HiveImport: Hive import complete.
19/03/13 14:47:02 INFO hive.HiveImport: Export directory is empty, removing it.
19/03/13 14:47:02 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments:
19/03/13 14:47:02 INFO tool.ImportTool: --incremental append
19/03/13 14:47:02 INFO tool.ImportTool: --check-column ETLTIME
19/03/13 14:47:02 INFO tool.ImportTool: --last-value 2019-03-20 10:52:34.0
19/03/13 14:47:02 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')
查詢hive表
hive> select * from inr_las;
OK
1 er CLERK 800.0 2019-03-20 10:42:27.0
2 ALLEN SALESMAN 1600.0 2019-03-20 10:42:27.0
3 WARD SALESMAN 1250.0 2019-03-20 10:42:27.0
4 JONES MANAGER 2975.0 2019-03-20 10:42:27.0
5 MARTIN SALESMAN 1250.0 2019-03-20 10:42:27.0
6 zhao DBA 100.0 2019-03-20 10:42:27.0
7 yan BI 100.0 2019-03-20 10:42:27.0
8 dong JAVA 100.0 2019-03-20 10:42:27.0
6 zhao DBA 1000.0 2019-03-20 10:52:34.0
Time taken: 0.171 seconds, Fetched: 9 row(s)
通過上面查詢結果可以看到,empno=6的這個員工薪資和etltime記錄變更時間都變化後,根據上一次全量初始化後的最大時間來做增量的起始時間去源端oracle查數時候,發現了新的發生變化的數據,然後將它最新狀態抽到了hive,採用的追加方式,因此hive裏存了兩條記錄,導致了數據重複,根據時間可以取最新的狀態來獲取最新數據狀態。
實驗二:合併模式
接着上面實驗環境繼續做,這次採用合併模式來看看效果:
--先看下當前的源端oracle數據:
EMPNO ENAME JOB SAL ETLTIME
1 er CLERK 800.00 2019/3/20 10:42:27
2 ALLEN SALESMAN 1600.00 2019/3/20 10:42:27
3 WARD SALESMAN 1250.00 2019/3/20 10:42:27
4 JONES MANAGER 2975.00 2019/3/20 10:42:27
5 MARTIN SALESMAN 1250.00 2019/3/20 10:42:27
6 zhao DBA 1000.00 2019/3/20 10:52:34
7 yan BI 100.00 2019/3/20 10:42:27
8 dong JAVA 200.00 2019/3/21 17:12:46
先把前面的hive表給刪了
hive> drop table inr_las;
OK
Time taken: 0.195 seconds
創建爲外部表
hive>create table INR_LAS
(
empno int,
ename string,
job string,
sal float,
etltime string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
location '/user/hive/warehouse/exter_inr_las';
OK
Time taken: 0.226 seconds
注意,/user/hive/warehouse/exter_inr_las這個目錄在第一次全量初始化時不要存在,它會自己創建,如果存在會報目錄已存在錯誤:
ERROR tool.ImportTool: Import failed: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://192.168.1.66:9000/user/hive/warehouse/exter_inr_las alre
ady exists
這時候應該先刪除一次這個目錄:
[root@hadoop ~]# hadoop fs -rmr /user/hive/warehouse/exter_inr_las
rmr: DEPRECATED: Please use 'rm -r' instead.
19/03/13 22:05:33 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/hive/warehouse/exter_inr_las
接下來全量導入一次:
[root@hadoop ~]# sqoop import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username scott --password tiger --table INR_LAS -m 1 --target-dir /user/hive/warehouse/exter_inr_las --fiel
ds-terminated-by '\t'Warning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/03/13 22:05:48 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
19/03/13 22:05:48 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/03/13 22:05:48 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
19/03/13 22:05:48 INFO manager.SqlManager: Using default fetchSize of 1000
19/03/13 22:05:48 INFO tool.CodeGenTool: Beginning code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hbase/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/03/13 22:05:49 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 22:05:49 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM INR_LAS t WHERE 1=0
19/03/13 22:05:49 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /hadoop
Note: /tmp/sqoop-root/compile/c8b2ed3172295709d819d17ca24aaf50/INR_LAS.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/03/13 22:05:52 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/c8b2ed3172295709d819d17ca24aaf50/INR_LAS.jar
19/03/13 22:05:52 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 22:05:52 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 22:05:52 INFO mapreduce.ImportJobBase: Beginning import of INR_LAS
19/03/13 22:05:52 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/03/13 22:05:52 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 22:05:53 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/03/13 22:05:54 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.66:8032
19/03/13 22:05:57 INFO db.DBInputFormat: Using read commited transaction isolation
19/03/13 22:05:57 INFO mapreduce.JobSubmitter: number of splits:1
19/03/13 22:05:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552482402053_0006
19/03/13 22:05:58 INFO impl.YarnClientImpl: Submitted application application_1552482402053_0006
19/03/13 22:05:58 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1552482402053_0006/
19/03/13 22:05:58 INFO mapreduce.Job: Running job: job_1552482402053_0006
19/03/13 22:06:07 INFO mapreduce.Job: Job job_1552482402053_0006 running in uber mode : false
19/03/13 22:06:07 INFO mapreduce.Job: map 0% reduce 0%
19/03/13 22:06:13 INFO mapreduce.Job: map 100% reduce 0%
19/03/13 22:06:15 INFO mapreduce.Job: Job job_1552482402053_0006 completed successfully
19/03/13 22:06:15 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=144058
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=323
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=4115
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=4115
Total vcore-milliseconds taken by all map tasks=4115
Total megabyte-milliseconds taken by all map tasks=4213760
Map-Reduce Framework
Map input records=8
Map output records=8
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=109
CPU time spent (ms)=2220
Physical memory (bytes) snapshot=187392000
Virtual memory (bytes) snapshot=2140803072
Total committed heap usage (bytes)=106430464
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=323
19/03/13 22:06:15 INFO mapreduce.ImportJobBase: Transferred 323 bytes in 21.3756 seconds (15.1107 bytes/sec)
19/03/13 22:06:15 INFO mapreduce.ImportJobBase: Retrieved 8 records.
查看一下hdfs此文件夾下文件:
[root@hadoop ~]# hdfs dfs -cat /user/hive/warehouse/exter_inr_las/part-m-00000
1 er CLERK 800 2019-03-20 10:42:27.0
2 ALLEN SALESMAN 1600 2019-03-20 10:42:27.0
3 WARD SALESMAN 1250 2019-03-20 10:42:27.0
4 JONES MANAGER 2975 2019-03-20 10:42:27.0
5 MARTIN SALESMAN 1250 2019-03-20 10:42:27.0
6 zhao DBA 1000 2019-03-20 10:52:34.0
7 yan BI 100 2019-03-20 10:42:27.0
8 dong JAVA 200 2019-03-21 17:12:46.0
查一下hive表:
hive> select * from inr_las;
OK
1 er CLERK 800.0 2019-03-20 10:42:27.0
2 ALLEN SALESMAN 1600.0 2019-03-20 10:42:27.0
3 WARD SALESMAN 1250.0 2019-03-20 10:42:27.0
4 JONES MANAGER 2975.0 2019-03-20 10:42:27.0
5 MARTIN SALESMAN 1250.0 2019-03-20 10:42:27.0
6 zhao DBA 1000.0 2019-03-20 10:52:34.0
7 yan BI 100.0 2019-03-20 10:42:27.0
8 dong JAVA 200.0 2019-03-21 17:12:46.0
Time taken: 0.191 seconds, Fetched: 8 row(s)
接下來修改一下oracle的數據:
update inr_las set sal=400 ,etltime=sysdate where empno=8;
commit;
select * from inr_las;
EMPNO ENAME JOB SAL ETLTIME
1 er CLERK 800.00 2019/3/20 10:42:27
2 ALLEN SALESMAN 1600.00 2019/3/20 10:42:27
3 WARD SALESMAN 1250.00 2019/3/20 10:42:27
4 JONES MANAGER 2975.00 2019/3/20 10:42:27
5 MARTIN SALESMAN 1250.00 2019/3/20 10:42:27
6 zhao DBA 1000.00 2019/3/20 10:52:34
7 yan BI 100.00 2019/3/20 10:42:27
8 dong JAVA 400.00 2019/3/21 17:47:03--已經更改了
接下來做合併模式增量:
[root@hadoop ~]# sqoop import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username scott --password tiger --table INR_LAS --fields-terminated-by '\t' --lines-terminated-by '\n' --t
arget-dir /user/hive/warehouse/exter_inr_las -m 1 --check-column ETLTIME --incremental lastmodified --merge-key EMPNO --last-value "2019-03-21 17:12:46"Warning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/03/13 22:18:41 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
19/03/13 22:18:42 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/03/13 22:18:42 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
19/03/13 22:18:42 INFO manager.SqlManager: Using default fetchSize of 1000
19/03/13 22:18:42 INFO tool.CodeGenTool: Beginning code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hbase/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/03/13 22:18:43 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 22:18:43 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM INR_LAS t WHERE 1=0
19/03/13 22:18:43 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /hadoop
Note: /tmp/sqoop-root/compile/d4af8fb9c2b8dd33c20926713e8d23e2/INR_LAS.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/03/13 22:18:47 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/d4af8fb9c2b8dd33c20926713e8d23e2/INR_LAS.jar
19/03/13 22:18:47 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 22:18:47 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM INR_LAS t WHERE 1=0
19/03/13 22:18:47 INFO tool.ImportTool: Incremental import based on column ETLTIME
19/03/13 22:18:47 INFO tool.ImportTool: Lower bound value: TO_TIMESTAMP('2019-03-21 17:12:46', 'YYYY-MM-DD HH24:MI:SS.FF')
19/03/13 22:18:47 INFO tool.ImportTool: Upper bound value: TO_TIMESTAMP('2019-03-21 17:54:19.0', 'YYYY-MM-DD HH24:MI:SS.FF')
19/03/13 22:18:47 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 22:18:47 INFO mapreduce.ImportJobBase: Beginning import of INR_LAS
19/03/13 22:18:47 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/03/13 22:18:47 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 22:18:48 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/03/13 22:18:48 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.66:8032
19/03/13 22:18:52 INFO db.DBInputFormat: Using read commited transaction isolation
19/03/13 22:18:52 INFO mapreduce.JobSubmitter: number of splits:1
19/03/13 22:18:52 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552482402053_0009
19/03/13 22:18:53 INFO impl.YarnClientImpl: Submitted application application_1552482402053_0009
19/03/13 22:18:53 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1552482402053_0009/
19/03/13 22:18:53 INFO mapreduce.Job: Running job: job_1552482402053_0009
19/03/13 22:19:02 INFO mapreduce.Job: Job job_1552482402053_0009 running in uber mode : false
19/03/13 22:19:02 INFO mapreduce.Job: map 0% reduce 0%
19/03/13 22:19:09 INFO mapreduce.Job: map 100% reduce 0%
19/03/13 22:19:10 INFO mapreduce.Job: Job job_1552482402053_0009 completed successfully
19/03/13 22:19:10 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=144379
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=38
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=4767
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=4767
Total vcore-milliseconds taken by all map tasks=4767
Total megabyte-milliseconds taken by all map tasks=4881408
Map-Reduce Framework
Map input records=1
Map output records=1
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=414
CPU time spent (ms)=2360
Physical memory (bytes) snapshot=189968384
Virtual memory (bytes) snapshot=2140639232
Total committed heap usage (bytes)=117440512
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=38
19/03/13 22:19:10 INFO mapreduce.ImportJobBase: Transferred 38 bytes in 22.4022 seconds (1.6963 bytes/sec)
19/03/13 22:19:11 INFO mapreduce.ImportJobBase: Retrieved 1 records.
19/03/13 22:19:11 INFO tool.ImportTool: Final destination exists, will run merge job.
19/03/13 22:19:11 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
19/03/13 22:19:11 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.66:8032
19/03/13 22:19:14 INFO input.FileInputFormat: Total input paths to process : 2
19/03/13 22:19:14 INFO mapreduce.JobSubmitter: number of splits:2
19/03/13 22:19:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552482402053_0010
19/03/13 22:19:14 INFO impl.YarnClientImpl: Submitted application application_1552482402053_0010
19/03/13 22:19:14 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1552482402053_0010/
19/03/13 22:19:14 INFO mapreduce.Job: Running job: job_1552482402053_0010
19/03/13 22:19:25 INFO mapreduce.Job: Job job_1552482402053_0010 running in uber mode : false
19/03/13 22:19:25 INFO mapreduce.Job: map 0% reduce 0%
19/03/13 22:19:33 INFO mapreduce.Job: map 100% reduce 0%
19/03/13 22:19:40 INFO mapreduce.Job: map 100% reduce 100%
19/03/13 22:19:40 INFO mapreduce.Job: Job job_1552482402053_0010 completed successfully
19/03/13 22:19:40 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=614
FILE: Number of bytes written=434631
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=657
HDFS: Number of bytes written=323
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=9137
Total time spent by all reduces in occupied slots (ms)=4019
Total time spent by all map tasks (ms)=9137
Total time spent by all reduce tasks (ms)=4019
Total vcore-milliseconds taken by all map tasks=9137
Total vcore-milliseconds taken by all reduce tasks=4019
Total megabyte-milliseconds taken by all map tasks=9356288
Total megabyte-milliseconds taken by all reduce tasks=4115456
Map-Reduce Framework
Map input records=9
Map output records=9
Map output bytes=590
Map output materialized bytes=620
Input split bytes=296
Combine input records=0
Combine output records=0
Reduce input groups=8
Reduce shuffle bytes=620
Reduce input records=9
Reduce output records=8
Spilled Records=18
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=503
CPU time spent (ms)=3680
Physical memory (bytes) snapshot=704909312
Virtual memory (bytes) snapshot=6395523072
Total committed heap usage (bytes)=517996544
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=361
File Output Format Counters
Bytes Written=323
19/03/13 22:19:40 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments:
19/03/13 22:19:40 INFO tool.ImportTool: --incremental lastmodified
19/03/13 22:19:40 INFO tool.ImportTool: --check-column ETLTIME
19/03/13 22:19:40 INFO tool.ImportTool: --last-value 2019-03-21 17:54:19.0
19/03/13 22:19:40 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')
這時候去看下/user/hive/warehouse/exter_inr_las/內容,你會發現part-m-00000變成了part-r-00000,意思是做了reduce:
root@hadoop ~]# hdfs dfs -cat /user/hive/warehouse/exter_inr_las/part-r-00000
1 er CLERK 800 2019-03-20 10:42:27.0
2 ALLEN SALESMAN 1600 2019-03-20 10:42:27.0
3 WARD SALESMAN 1250 2019-03-20 10:42:27.0
4 JONES MANAGER 2975 2019-03-20 10:42:27.0
5 MARTIN SALESMAN 1250 2019-03-20 10:42:27.0
6 zhao DBA 1000 2019-03-20 10:52:34.0
7 yan BI 100 2019-03-20 10:42:27.0
8 dong JAVA 400 2019-03-21 17:47:03.0
發現empno=8的記錄的確做了變更了,增量同步成功,去看下hive表:
hive> select * from inr_las;
OK
1 er CLERK 800.0 2019-03-20 10:42:27.0
2 ALLEN SALESMAN 1600.0 2019-03-20 10:42:27.0
3 WARD SALESMAN 1250.0 2019-03-20 10:42:27.0
4 JONES MANAGER 2975.0 2019-03-20 10:42:27.0
5 MARTIN SALESMAN 1250.0 2019-03-20 10:42:27.0
6 zhao DBA 1000.0 2019-03-20 10:52:34.0
7 yan BI 100.0 2019-03-20 10:42:27.0
8 dong JAVA 400.0 2019-03-21 17:47:03.0
Time taken: 0.196 seconds, Fetched: 8 row(s)
沒問題。由於篇幅原因,sqoop job的使用及增量腳本定時同步數據的案例寫在了下一篇文章,給出鏈接:
sqoop避免輸入密碼自動增量job腳本介紹