sqoop避免輸入密碼自動增量job腳本介紹

上一篇文章介紹了sqoop增量同步數據到hive,同時上一篇文章也給出了本人寫的hadoop+hive+hbase+sqoop+kylin的僞分佈式安裝方法及使用和增量同步實現的連接,上篇文章連接:Sqoop增量同步mysql/oracle數據到hive(merge-key/append)測試文檔
本篇文章將介紹如何將上一篇文章介紹的增量方式同sqoop自帶的job機制和shell腳本以及crontab結合起來實現自動增量同步的需求。

一、知識儲備

sqoop job --help
usage: sqoop job [GENERIC-ARGS] [JOB-ARGS] [-- [<tool-name>] [TOOL-ARGS]]
Job management arguments:
   --create <job-id> Create a new saved job
   --delete <job-id> Delete a saved job
   --exec <job-id> Run a saved job
   --help Print usage instructions
   --list List saved jobs
   --meta-connect <jdbc-uri> Specify JDBC connect string for the
                                metastore
   --show <job-id> Show the parameters for a saved job
   --verbose Print more information while working

Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.

二、詳細實驗

這裏先來看一個根據時間戳增量append的job創建和執行的過程,然後再看merge-id方式。
1、先來創建一個增量追加的job:

[root@hadoop bin]# sqoop job --create inc_job -- import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username scott --password tiger --table INR_LAS --fields-terminated-by '\t' --li
nes-terminated-by '\n' --hive-import --hive-database oracle --hive-table INR_LAS --incremental append --check-column ETLTIME --last-value '2019-03-20 14:49:19' -m 1 --null-string '\\N' --null-non-string '\\N'19/03/13 18:12:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
19/03/13 18:12:38 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
Exception in thread "main" java.lang.NoClassDefFoundError: org/json/JSONObject
	at org.apache.sqoop.util.SqoopJsonUtil.getJsonStringforMap(SqoopJsonUtil.java:43)
	at org.apache.sqoop.SqoopOptions.writeProperties(SqoopOptions.java:785)
	at org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage.createInternal(HsqldbJobStorage.java:399)
	at org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage.create(HsqldbJobStorage.java:379)
	at org.apache.sqoop.tool.JobTool.createJob(JobTool.java:181)
	at org.apache.sqoop.tool.JobTool.run(JobTool.java:294)
	at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
	at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
	at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
	at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Caused by: java.lang.ClassNotFoundException: org.json.JSONObject
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

報錯了,sqoop缺少java-json.jar包。好吧,下載缺少的jar包然後上傳到$SQOOP_HOME/lib,連接:

點此下載jar包
將下載好的jar包放到$SQOOP_HOME/lib下,然後重新創建:
先把之前創建失敗的job刪除了

[root@hadoop ~]# sqoop job --delete inc_job
Warning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/03/13 18:40:18 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hbase/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

創建job

[root@hadoop ~]# sqoop job --create inc_job -- import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username "scott" --password "tiger" --table INR_LAS --fields-terminated-by '\t' --l
ines-terminated-by '\n' --hive-import --hive-database oracle --hive-table INR_LAS --incremental append --check-column ETLTIME --last-value '2019-03-20 14:49:19' -m 1 --null-string '\\N' --null-non-string '\\N'Warning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/03/13 18:40:26 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hbase/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/03/13 18:40:26 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

列出來剛剛創建的job

[root@hadoop ~]# sqoop job --list 
Warning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/03/13 18:41:20 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hbase/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Available jobs:
  inc_job

查看剛剛創建的job保存的last_value

[root@hadoop ~]# sqoop job --show inc_job
Warning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/03/13 18:45:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hbase/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Enter password: -------這裏輸入的是執行這個job的系統用戶密碼。而執行的時候這裏輸入的是連接數據庫的用戶對應的密碼
Job: inc_job
Tool: import
Options:
----------------------------
verbose = false
hcatalog.drop.and.create.table = false
incremental.last.value = 2019-03-20 14:49:19
db.connect.string = jdbc:oracle:thin:@192.168.1.6:1521:orcl
codegen.output.delimiters.escape = 0
codegen.output.delimiters.enclose.required = false
codegen.input.delimiters.field = 0
mainframe.input.dataset.type = p
hbase.create.table = false
split.limit = null
null.string = \\N
db.require.password = true
skip.dist.cache = false
hdfs.append.dir = true
db.table = INR_LAS
codegen.input.delimiters.escape = 0
accumulo.create.table = false
import.fetch.size = null
codegen.input.delimiters.enclose.required = false
db.username = scott
reset.onemapper = false
codegen.output.delimiters.record = 10
import.max.inline.lob.size = 16777216
sqoop.throwOnError = false
hbase.bulk.load.enabled = false
hcatalog.create.table = false
db.clear.staging.table = false
incremental.col = ETLTIME
codegen.input.delimiters.record = 0
enable.compression = false
hive.overwrite.table = false
hive.import = true
codegen.input.delimiters.enclose = 0
hive.table.name = INR_LAS
accumulo.batch.size = 10240000
hive.database.name = oracle
hive.drop.delims = false
customtool.options.jsonmap = {}
null.non-string = \\N
codegen.output.delimiters.enclose = 0
hdfs.delete-target.dir = false
codegen.output.dir = .
codegen.auto.compile.dir = true
relaxed.isolation = false
mapreduce.num.mappers = 1
accumulo.max.latency = 5000
import.direct.split.size = 0
sqlconnection.metadata.transaction.isolation.level = 2
codegen.output.delimiters.field = 9
export.new.update = UpdateOnly
incremental.mode = AppendRows
hdfs.file.format = TextFile
sqoop.oracle.escaping.disabled = true
codegen.compile.dir = /tmp/sqoop-root/compile/1173d716481c4bd8f6cb589b87a382ea
direct.import = false
temporary.dirRoot = _sqoop
hive.fail.table.exists = false
db.batch = false

接下來手動執行

[root@hadoop ~]# sqoop job --exec inc_job
Warning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/03/13 18:47:46 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hbase/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Enter password: ---------這裏輸入的是連接數據庫的用戶對應的密碼,如何做到免密登錄?向下繼續看
19/03/13 18:47:50 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
19/03/13 18:47:50 INFO manager.SqlManager: Using default fetchSize of 1000
19/03/13 18:47:50 INFO tool.CodeGenTool: Beginning code generation
19/03/13 18:47:51 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 18:47:51 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM INR_LAS t WHERE 1=0
19/03/13 18:47:51 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /hadoop
Note: /tmp/sqoop-root/compile/f383a7cc7d1bc4f9665748405ec5dec2/INR_LAS.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/03/13 18:47:55 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/f383a7cc7d1bc4f9665748405ec5dec2/INR_LAS.jar
19/03/13 18:47:55 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 18:47:55 INFO tool.ImportTool: Maximal id query for free form incremental import: SELECT MAX(ETLTIME) FROM INR_LAS
19/03/13 18:47:55 INFO tool.ImportTool: Incremental import based on column ETLTIME
19/03/13 18:47:55 INFO tool.ImportTool: Lower bound value: TO_TIMESTAMP('2019-03-20 14:49:19', 'YYYY-MM-DD HH24:MI:SS.FF')
19/03/13 18:47:55 INFO tool.ImportTool: Upper bound value: TO_TIMESTAMP('2019-03-20 15:36:07.0', 'YYYY-MM-DD HH24:MI:SS.FF')
19/03/13 18:47:55 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 18:47:55 INFO mapreduce.ImportJobBase: Beginning import of INR_LAS
19/03/13 18:47:55 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/03/13 18:47:55 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 18:47:56 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/03/13 18:47:57 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.66:8032
19/03/13 18:48:00 INFO db.DBInputFormat: Using read commited transaction isolation
19/03/13 18:48:00 INFO mapreduce.JobSubmitter: number of splits:1
19/03/13 18:48:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552469242276_0016
19/03/13 18:48:01 INFO impl.YarnClientImpl: Submitted application application_1552469242276_0016
19/03/13 18:48:01 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1552469242276_0016/
19/03/13 18:48:01 INFO mapreduce.Job: Running job: job_1552469242276_0016
19/03/13 18:48:11 INFO mapreduce.Job: Job job_1552469242276_0016 running in uber mode : false
19/03/13 18:48:11 INFO mapreduce.Job:  map 0% reduce 0%
19/03/13 18:48:18 INFO mapreduce.Job:  map 100% reduce 0%
19/03/13 18:48:18 INFO mapreduce.Job: Job job_1552469242276_0016 completed successfully
19/03/13 18:48:19 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=144628
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=87
		HDFS: Number of bytes written=39
		HDFS: Number of read operations=4
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=1
		Other local map tasks=1
		Total time spent by all maps in occupied slots (ms)=4454
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=4454
		Total vcore-milliseconds taken by all map tasks=4454
		Total megabyte-milliseconds taken by all map tasks=4560896
	Map-Reduce Framework
		Map input records=1
		Map output records=1
		Input split bytes=87
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=229
		CPU time spent (ms)=2430
		Physical memory (bytes) snapshot=191975424
		Virtual memory (bytes) snapshot=2143756288
		Total committed heap usage (bytes)=116916224
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=39
19/03/13 18:48:19 INFO mapreduce.ImportJobBase: Transferred 39 bytes in 22.3135 seconds (1.7478 bytes/sec)
19/03/13 18:48:19 INFO mapreduce.ImportJobBase: Retrieved 1 records.
19/03/13 18:48:19 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners for table INR_LAS
19/03/13 18:48:19 INFO util.AppendUtils: Creating missing output directory - INR_LAS
19/03/13 18:48:19 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 18:48:19 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM INR_LAS t WHERE 1=0
19/03/13 18:48:19 WARN hive.TableDefWriter: Column EMPNO had to be cast to a less precise type in Hive
19/03/13 18:48:19 WARN hive.TableDefWriter: Column SAL had to be cast to a less precise type in Hive
19/03/13 18:48:19 WARN hive.TableDefWriter: Column ETLTIME had to be cast to a less precise type in Hive
19/03/13 18:48:19 INFO hive.HiveImport: Loading uploaded data into Hive
19/03/13 18:48:19 INFO conf.HiveConf: Found configuration file file:/hadoop/hive/conf/hive-site.xml

Logging initialized using configuration in jar:file:/hadoop/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
19/03/13 18:48:22 INFO SessionState: 
Logging initialized using configuration in jar:file:/hadoop/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
19/03/13 18:48:22 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/e09a2f96-2edd-4747-a65f-4899c2863aa0
19/03/13 18:48:22 INFO session.SessionState: Created local directory: /hadoop/hive/tmp/root/e09a2f96-2edd-4747-a65f-4899c2863aa0
19/03/13 18:48:22 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/e09a2f96-2edd-4747-a65f-4899c2863aa0/_tmp_space.db
19/03/13 18:48:22 INFO conf.HiveConf: Using the default value passed in for log id: e09a2f96-2edd-4747-a65f-4899c2863aa0
19/03/13 18:48:22 INFO session.SessionState: Updating thread name to e09a2f96-2edd-4747-a65f-4899c2863aa0 main
19/03/13 18:48:22 INFO conf.HiveConf: Using the default value passed in for log id: e09a2f96-2edd-4747-a65f-4899c2863aa0
19/03/13 18:48:22 INFO ql.Driver: Compiling command(queryId=root_20190313104822_91cdb575-b0c9-4533-916c-247304d39b46): CREATE TABLE IF NOT EXISTS `oracle`.`INR_LAS` ( `EMPNO` DOUBLE, `ENAME
` STRING, `JOB` STRING, `SAL` DOUBLE, `ETLTIME` STRING) COMMENT 'Imported by sqoop on 2019/03/13 10:48:19' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\011' LINES TERMINATED BY '\012' STORED AS TEXTFILE19/03/13 18:48:25 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/13 18:48:25 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/13 18:48:25 INFO hive.metastore: Connected to metastore.
19/03/13 18:48:25 INFO parse.CalcitePlanner: Starting Semantic Analysis
19/03/13 18:48:25 INFO parse.CalcitePlanner: Creating table oracle.INR_LAS position=27
19/03/13 18:48:25 INFO ql.Driver: Semantic Analysis Completed
19/03/13 18:48:25 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
19/03/13 18:48:25 INFO ql.Driver: Completed compiling command(queryId=root_20190313104822_91cdb575-b0c9-4533-916c-247304d39b46); Time taken: 3.251 seconds
19/03/13 18:48:25 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager
19/03/13 18:48:25 INFO ql.Driver: Executing command(queryId=root_20190313104822_91cdb575-b0c9-4533-916c-247304d39b46): CREATE TABLE IF NOT EXISTS `oracle`.`INR_LAS` ( `EMPNO` DOUBLE, `ENAME
` STRING, `JOB` STRING, `SAL` DOUBLE, `ETLTIME` STRING) COMMENT 'Imported by sqoop on 2019/03/13 10:48:19' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\011' LINES TERMINATED BY '\012' STORED AS TEXTFILE19/03/13 18:48:25 INFO sqlstd.SQLStdHiveAccessController: Created SQLStdHiveAccessController for session context : HiveAuthzSessionContext [sessionString=e09a2f96-2edd-4747-a65f-4899c2863aa
0, clientType=HIVECLI]19/03/13 18:48:25 WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
19/03/13 18:48:25 INFO hive.metastore: Mestastore configuration hive.metastore.filter.hook changed from org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl to org.apache.hadoop.
hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook19/03/13 18:48:26 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/13 18:48:26 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/13 18:48:26 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/13 18:48:26 INFO hive.metastore: Connected to metastore.
19/03/13 18:48:26 INFO ql.Driver: Completed executing command(queryId=root_20190313104822_91cdb575-b0c9-4533-916c-247304d39b46); Time taken: 0.113 seconds
OK
19/03/13 18:48:26 INFO ql.Driver: OK
Time taken: 3.379 seconds
19/03/13 18:48:26 INFO CliDriver: Time taken: 3.379 seconds
19/03/13 18:48:26 INFO conf.HiveConf: Using the default value passed in for log id: e09a2f96-2edd-4747-a65f-4899c2863aa0
19/03/13 18:48:26 INFO session.SessionState: Resetting thread name to  main
19/03/13 18:48:26 INFO conf.HiveConf: Using the default value passed in for log id: e09a2f96-2edd-4747-a65f-4899c2863aa0
19/03/13 18:48:26 INFO session.SessionState: Updating thread name to e09a2f96-2edd-4747-a65f-4899c2863aa0 main
19/03/13 18:48:26 INFO ql.Driver: Compiling command(queryId=root_20190313104826_5da0b171-d4e8-41c5-83ef-bdcffec0fea2): 
LOAD DATA INPATH 'hdfs://192.168.1.66:9000/user/root/INR_LAS' INTO TABLE `oracle`.`INR_LAS`
19/03/13 18:48:26 INFO ql.Driver: Semantic Analysis Completed
19/03/13 18:48:26 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
19/03/13 18:48:26 INFO ql.Driver: Completed compiling command(queryId=root_20190313104826_5da0b171-d4e8-41c5-83ef-bdcffec0fea2); Time taken: 0.426 seconds
19/03/13 18:48:26 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager
19/03/13 18:48:26 INFO ql.Driver: Executing command(queryId=root_20190313104826_5da0b171-d4e8-41c5-83ef-bdcffec0fea2): 
LOAD DATA INPATH 'hdfs://192.168.1.66:9000/user/root/INR_LAS' INTO TABLE `oracle`.`INR_LAS`
19/03/13 18:48:26 INFO ql.Driver: Starting task [Stage-0:MOVE] in serial mode
19/03/13 18:48:26 INFO hive.metastore: Closed a connection to metastore, current connections: 0
Loading data to table oracle.inr_las
19/03/13 18:48:26 INFO exec.Task: Loading data to table oracle.inr_las from hdfs://192.168.1.66:9000/user/root/INR_LAS
19/03/13 18:48:26 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/13 18:48:26 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/13 18:48:26 INFO hive.metastore: Connected to metastore.
19/03/13 18:48:26 ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
19/03/13 18:48:27 INFO ql.Driver: Starting task [Stage-1:STATS] in serial mode
19/03/13 18:48:27 INFO exec.StatsTask: Executing stats task
19/03/13 18:48:27 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/13 18:48:27 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/13 18:48:27 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/13 18:48:27 INFO hive.metastore: Connected to metastore.
19/03/13 18:48:27 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/13 18:48:27 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/13 18:48:27 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/13 18:48:27 INFO hive.metastore: Connected to metastore.
19/03/13 18:48:27 INFO exec.StatsTask: Table oracle.inr_las stats: [numFiles=6, numRows=0, totalSize=518, rawDataSize=0]
19/03/13 18:48:27 INFO ql.Driver: Completed executing command(queryId=root_20190313104826_5da0b171-d4e8-41c5-83ef-bdcffec0fea2); Time taken: 1.225 seconds
OK
19/03/13 18:48:27 INFO ql.Driver: OK
Time taken: 1.653 seconds
19/03/13 18:48:27 INFO CliDriver: Time taken: 1.653 seconds
19/03/13 18:48:27 INFO conf.HiveConf: Using the default value passed in for log id: e09a2f96-2edd-4747-a65f-4899c2863aa0
19/03/13 18:48:27 INFO session.SessionState: Resetting thread name to  main
19/03/13 18:48:27 INFO conf.HiveConf: Using the default value passed in for log id: e09a2f96-2edd-4747-a65f-4899c2863aa0
19/03/13 18:48:27 INFO session.SessionState: Deleted directory: /tmp/hive/root/e09a2f96-2edd-4747-a65f-4899c2863aa0 on fs with scheme hdfs
19/03/13 18:48:27 INFO session.SessionState: Deleted directory: /hadoop/hive/tmp/root/e09a2f96-2edd-4747-a65f-4899c2863aa0 on fs with scheme file
19/03/13 18:48:27 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/13 18:48:27 INFO hive.HiveImport: Hive import complete.
19/03/13 18:48:27 INFO hive.HiveImport: Export directory is empty, removing it.
19/03/13 18:48:27 INFO tool.ImportTool: Saving incremental import state to the metastore
19/03/13 18:48:27 INFO tool.ImportTool: Updated data for job: inc_job

通過上面實驗我們發現每次執行job時,都要輸入數據庫用戶密碼,怎麼實現免密登錄,可以參照這種方式:
在創建Job時,使用–password-file參數,而且非–passoword。主要原因是在執行Job時使用–password參數將有警告,並且需要輸入密碼才能執行Job。當我們採用–password-file參數時,執行Job無需輸入數據庫密碼,所以我們修改一下上面創建的job語句:
先drop原來的job

[root@hadoop conf]# sqoop job --delete inc_job

創建password-file文件
注:sqoop規定密碼文件必須放在HDFS之上,並且權限必須爲400

[root@hadoop sqoop]# mkdir pwd
[root@hadoop sqoop]# cd pwd
[root@hadoop pwd]# pwd
/hadoop/sqoop/pwd
[root@hadoop pwd]# echo -n "tiger" > scott.pwd
[root@hadoop pwd]# hdfs dfs -put scott.pwd /user/hive/warehouse
[root@hadoop pwd]# hdfs dfs -chmod 400 /user/hive/warehouse/scott.pwd

重新創建,這裏不在指定password而是passwordfile

[root@hadoop conf]# sqoop job --create inc_job -- import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username "scott" --password-file /user/hive/warehouse/scott.pwd --table INR_LAS --fields-terminated-by '\t' 
--lines-terminated-by '\n' --hive-import --hive-database oracle --hive-table INR_LAS --incremental append --check-column ETLTIME --last-value '2019-03-20 14:49:19' -m 1 --null-string '\\N' --null-non-string '\\N'

驗證,看下當前oracle數據庫表:

select * from inr_las;
EMPNO	    ENAME	    JOB	    SAL        	ETLTIME
1	        er	    CLERK	    800.00	2019/3/20 10:42:27
2	        ALLEN	SALESMAN	1600.00	2019/3/20 10:42:27
3	        WARD	SALESMAN	1250.00	2019/3/20 10:42:27
4	        JONES	MANAGER	    2975.00	2019/3/20 10:42:27
5	        MARTIN	SALESMAN	1250.00	2019/3/20 10:42:27
6	        zhao	DBA	        1000.00	2019/3/20 10:52:34
7	        yan	    BI	        100.00	2019/3/20 10:42:27
8	        dong	JAVA	    5232.00	2019/3/20 15:36:07

再看下當前hive表數據:

hive> select * from inr_las;
OK
1	er	CLERK	800.0	2019-03-20 10:42:27.0
2	ALLEN	SALESMAN	1600.0	2019-03-20 10:42:27.0
3	WARD	SALESMAN	1250.0	2019-03-20 10:42:27.0
4	JONES	MANAGER	2975.0	2019-03-20 10:42:27.0
5	MARTIN	SALESMAN	1250.0	2019-03-20 10:42:27.0
6	zhao	DBA	1000.0	2019-03-20 10:52:34.0
7	yan	BI	100.0	2019-03-20 10:42:27.0
8	dong	JAVA	332.0	2019-03-20 14:49:19.0
8	dong	JAVA	3232.0	2019-03-20 15:13:35.0
8	dong	JAVA	4232.0	2019-03-20 15:29:03.0
8	dong	JAVA	5232.0	2019-03-20 15:36:07.0
8	dong	JAVA	5232.0	2019-03-20 15:36:07.0
8	dong	JAVA	3232.0	2019-03-20 15:13:35.0
Time taken: 0.176 seconds, Fetched: 13 row(s)

我們job的增量時間設置的–last-value ‘2019-03-20 14:49:19’,源端有一條數據empno=8符合增量條件,現在再執行一下新創建的job:

[root@hadoop pwd]# sqoop job --exec inc_job
Warning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/03/13 19:14:30 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hbase/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/03/13 19:14:32 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
19/03/13 19:14:32 INFO manager.SqlManager: Using default fetchSize of 1000
19/03/13 19:14:32 INFO tool.CodeGenTool: Beginning code generation
19/03/13 19:14:33 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 19:14:33 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM INR_LAS t WHERE 1=0
19/03/13 19:14:33 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /hadoop
Note: /tmp/sqoop-root/compile/8df9a3027ead0f69733bef4c331c8f15/INR_LAS.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/03/13 19:14:38 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/8df9a3027ead0f69733bef4c331c8f15/INR_LAS.jar
19/03/13 19:14:38 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 19:14:38 INFO tool.ImportTool: Maximal id query for free form incremental import: SELECT MAX(ETLTIME) FROM INR_LAS
19/03/13 19:14:38 INFO tool.ImportTool: Incremental import based on column ETLTIME
19/03/13 19:14:38 INFO tool.ImportTool: Lower bound value: TO_TIMESTAMP('2019-03-20 14:49:19', 'YYYY-MM-DD HH24:MI:SS.FF')
19/03/13 19:14:38 INFO tool.ImportTool: Upper bound value: TO_TIMESTAMP('2019-03-20 15:36:07.0', 'YYYY-MM-DD HH24:MI:SS.FF')
19/03/13 19:14:38 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 19:14:38 INFO mapreduce.ImportJobBase: Beginning import of INR_LAS
19/03/13 19:14:38 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/03/13 19:14:38 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 19:14:38 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/03/13 19:14:38 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.66:8032
19/03/13 19:14:42 INFO db.DBInputFormat: Using read commited transaction isolation
19/03/13 19:14:42 INFO mapreduce.JobSubmitter: number of splits:1
19/03/13 19:14:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552469242276_0017
19/03/13 19:14:42 INFO impl.YarnClientImpl: Submitted application application_1552469242276_0017
19/03/13 19:14:43 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1552469242276_0017/
19/03/13 19:14:43 INFO mapreduce.Job: Running job: job_1552469242276_0017
19/03/13 19:14:53 INFO mapreduce.Job: Job job_1552469242276_0017 running in uber mode : false
19/03/13 19:14:53 INFO mapreduce.Job:  map 0% reduce 0%
19/03/13 19:15:00 INFO mapreduce.Job:  map 100% reduce 0%
19/03/13 19:15:00 INFO mapreduce.Job: Job job_1552469242276_0017 completed successfully
19/03/13 19:15:00 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=144775
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=87
		HDFS: Number of bytes written=39
		HDFS: Number of read operations=4
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=1
		Other local map tasks=1
		Total time spent by all maps in occupied slots (ms)=5332
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=5332
		Total vcore-milliseconds taken by all map tasks=5332
		Total megabyte-milliseconds taken by all map tasks=5459968
	Map-Reduce Framework
		Map input records=1
		Map output records=1
		Input split bytes=87
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=651
		CPU time spent (ms)=2670
		Physical memory (bytes) snapshot=188571648
		Virtual memory (bytes) snapshot=2148745216
		Total committed heap usage (bytes)=119537664
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=39
19/03/13 19:15:00 INFO mapreduce.ImportJobBase: Transferred 39 bytes in 22.3081 seconds (1.7482 bytes/sec)
19/03/13 19:15:00 INFO mapreduce.ImportJobBase: Retrieved 1 records.
19/03/13 19:15:00 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners for table INR_LAS
19/03/13 19:15:00 INFO util.AppendUtils: Creating missing output directory - INR_LAS
19/03/13 19:15:01 INFO manager.OracleManager: Time zone has been set to GMT
19/03/13 19:15:01 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM INR_LAS t WHERE 1=0
19/03/13 19:15:01 WARN hive.TableDefWriter: Column EMPNO had to be cast to a less precise type in Hive
19/03/13 19:15:01 WARN hive.TableDefWriter: Column SAL had to be cast to a less precise type in Hive
19/03/13 19:15:01 WARN hive.TableDefWriter: Column ETLTIME had to be cast to a less precise type in Hive
19/03/13 19:15:01 INFO hive.HiveImport: Loading uploaded data into Hive
19/03/13 19:15:01 INFO conf.HiveConf: Found configuration file file:/hadoop/hive/conf/hive-site.xml

Logging initialized using configuration in jar:file:/hadoop/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
19/03/13 19:15:04 INFO SessionState: 
Logging initialized using configuration in jar:file:/hadoop/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
19/03/13 19:15:04 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/7feac288-289d-4d74-8641-553c5ab65618
19/03/13 19:15:04 INFO session.SessionState: Created local directory: /hadoop/hive/tmp/root/7feac288-289d-4d74-8641-553c5ab65618
19/03/13 19:15:04 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/7feac288-289d-4d74-8641-553c5ab65618/_tmp_space.db
19/03/13 19:15:04 INFO conf.HiveConf: Using the default value passed in for log id: 7feac288-289d-4d74-8641-553c5ab65618
19/03/13 19:15:04 INFO session.SessionState: Updating thread name to 7feac288-289d-4d74-8641-553c5ab65618 main
19/03/13 19:15:04 INFO conf.HiveConf: Using the default value passed in for log id: 7feac288-289d-4d74-8641-553c5ab65618
19/03/13 19:15:04 INFO ql.Driver: Compiling command(queryId=root_20190313111504_d1db4a38-1b86-4c89-84c3-3d3be9404b0f): CREATE TABLE IF NOT EXISTS `oracle`.`INR_LAS` ( `EMPNO` DOUBLE, `ENAME
` STRING, `JOB` STRING, `SAL` DOUBLE, `ETLTIME` STRING) COMMENT 'Imported by sqoop on 2019/03/13 11:15:01' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\011' LINES TERMINATED BY '\012' STORED AS TEXTFILE19/03/13 19:15:09 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/13 19:15:09 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/13 19:15:09 INFO hive.metastore: Connected to metastore.
19/03/13 19:15:09 INFO parse.CalcitePlanner: Starting Semantic Analysis
19/03/13 19:15:09 INFO parse.CalcitePlanner: Creating table oracle.INR_LAS position=27
19/03/13 19:15:09 INFO ql.Driver: Semantic Analysis Completed
19/03/13 19:15:09 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
19/03/13 19:15:09 INFO ql.Driver: Completed compiling command(queryId=root_20190313111504_d1db4a38-1b86-4c89-84c3-3d3be9404b0f); Time taken: 5.309 seconds
19/03/13 19:15:09 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager
19/03/13 19:15:09 INFO ql.Driver: Executing command(queryId=root_20190313111504_d1db4a38-1b86-4c89-84c3-3d3be9404b0f): CREATE TABLE IF NOT EXISTS `oracle`.`INR_LAS` ( `EMPNO` DOUBLE, `ENAME
` STRING, `JOB` STRING, `SAL` DOUBLE, `ETLTIME` STRING) COMMENT 'Imported by sqoop on 2019/03/13 11:15:01' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\011' LINES TERMINATED BY '\012' STORED AS TEXTFILE19/03/13 19:15:09 INFO sqlstd.SQLStdHiveAccessController: Created SQLStdHiveAccessController for session context : HiveAuthzSessionContext [sessionString=7feac288-289d-4d74-8641-553c5ab6561
8, clientType=HIVECLI]19/03/13 19:15:09 WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
19/03/13 19:15:09 INFO hive.metastore: Mestastore configuration hive.metastore.filter.hook changed from org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl to org.apache.hadoop.
hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook19/03/13 19:15:10 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/13 19:15:10 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/13 19:15:10 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/13 19:15:10 INFO hive.metastore: Connected to metastore.
19/03/13 19:15:10 INFO ql.Driver: Completed executing command(queryId=root_20190313111504_d1db4a38-1b86-4c89-84c3-3d3be9404b0f); Time taken: 0.106 seconds
OK
19/03/13 19:15:10 INFO ql.Driver: OK
Time taken: 5.429 seconds
19/03/13 19:15:10 INFO CliDriver: Time taken: 5.429 seconds
19/03/13 19:15:10 INFO conf.HiveConf: Using the default value passed in for log id: 7feac288-289d-4d74-8641-553c5ab65618
19/03/13 19:15:10 INFO session.SessionState: Resetting thread name to  main
19/03/13 19:15:10 INFO conf.HiveConf: Using the default value passed in for log id: 7feac288-289d-4d74-8641-553c5ab65618
19/03/13 19:15:10 INFO session.SessionState: Updating thread name to 7feac288-289d-4d74-8641-553c5ab65618 main
19/03/13 19:15:10 INFO ql.Driver: Compiling command(queryId=root_20190313111510_cd9c21cf-b479-475c-a959-be8ff0ac5f01): 
LOAD DATA INPATH 'hdfs://192.168.1.66:9000/user/root/INR_LAS' INTO TABLE `oracle`.`INR_LAS`
19/03/13 19:15:10 INFO ql.Driver: Semantic Analysis Completed
19/03/13 19:15:10 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
19/03/13 19:15:10 INFO ql.Driver: Completed compiling command(queryId=root_20190313111510_cd9c21cf-b479-475c-a959-be8ff0ac5f01); Time taken: 0.415 seconds
19/03/13 19:15:10 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager
19/03/13 19:15:10 INFO ql.Driver: Executing command(queryId=root_20190313111510_cd9c21cf-b479-475c-a959-be8ff0ac5f01): 
LOAD DATA INPATH 'hdfs://192.168.1.66:9000/user/root/INR_LAS' INTO TABLE `oracle`.`INR_LAS`
19/03/13 19:15:10 INFO ql.Driver: Starting task [Stage-0:MOVE] in serial mode
19/03/13 19:15:10 INFO hive.metastore: Closed a connection to metastore, current connections: 0
Loading data to table oracle.inr_las
19/03/13 19:15:10 INFO exec.Task: Loading data to table oracle.inr_las from hdfs://192.168.1.66:9000/user/root/INR_LAS
19/03/13 19:15:10 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/13 19:15:10 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/13 19:15:10 INFO hive.metastore: Connected to metastore.
19/03/13 19:15:10 ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
19/03/13 19:15:11 INFO ql.Driver: Starting task [Stage-1:STATS] in serial mode
19/03/13 19:15:11 INFO exec.StatsTask: Executing stats task
19/03/13 19:15:11 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/13 19:15:11 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/13 19:15:11 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/13 19:15:11 INFO hive.metastore: Connected to metastore.
19/03/13 19:15:11 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/13 19:15:11 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.1.66:9083
19/03/13 19:15:11 INFO hive.metastore: Opened a connection to metastore, current connections: 1
19/03/13 19:15:11 INFO hive.metastore: Connected to metastore.
19/03/13 19:15:11 INFO exec.StatsTask: Table oracle.inr_las stats: [numFiles=7, numRows=0, totalSize=557, rawDataSize=0]
19/03/13 19:15:11 INFO ql.Driver: Completed executing command(queryId=root_20190313111510_cd9c21cf-b479-475c-a959-be8ff0ac5f01); Time taken: 1.296 seconds
OK
19/03/13 19:15:11 INFO ql.Driver: OK
Time taken: 1.713 seconds
19/03/13 19:15:11 INFO CliDriver: Time taken: 1.713 seconds
19/03/13 19:15:11 INFO conf.HiveConf: Using the default value passed in for log id: 7feac288-289d-4d74-8641-553c5ab65618
19/03/13 19:15:11 INFO session.SessionState: Resetting thread name to  main
19/03/13 19:15:11 INFO conf.HiveConf: Using the default value passed in for log id: 7feac288-289d-4d74-8641-553c5ab65618
19/03/13 19:15:11 INFO session.SessionState: Deleted directory: /tmp/hive/root/7feac288-289d-4d74-8641-553c5ab65618 on fs with scheme hdfs
19/03/13 19:15:11 INFO session.SessionState: Deleted directory: /hadoop/hive/tmp/root/7feac288-289d-4d74-8641-553c5ab65618 on fs with scheme file
19/03/13 19:15:11 INFO hive.metastore: Closed a connection to metastore, current connections: 0
19/03/13 19:15:11 INFO hive.HiveImport: Hive import complete.
19/03/13 19:15:11 INFO hive.HiveImport: Export directory is empty, removing it.
19/03/13 19:15:11 INFO tool.ImportTool: Saving incremental import state to the metastore
19/03/13 19:15:11 INFO tool.ImportTool: Updated data for job: inc_job

發現已經不需要輸入密碼了,再來看下hive表數據:

hive> select * from inr_las;
OK
1	er	CLERK	800.0	2019-03-20 10:42:27.0
2	ALLEN	SALESMAN	1600.0	2019-03-20 10:42:27.0
3	WARD	SALESMAN	1250.0	2019-03-20 10:42:27.0
4	JONES	MANAGER	2975.0	2019-03-20 10:42:27.0
5	MARTIN	SALESMAN	1250.0	2019-03-20 10:42:27.0
6	zhao	DBA	1000.0	2019-03-20 10:52:34.0
7	yan	BI	100.0	2019-03-20 10:42:27.0
8	dong	JAVA	332.0	2019-03-20 14:49:19.0
8	dong	JAVA	3232.0	2019-03-20 15:13:35.0
8	dong	JAVA	4232.0	2019-03-20 15:29:03.0
8	dong	JAVA	5232.0	2019-03-20 15:36:07.0
8	dong	JAVA	5232.0	2019-03-20 15:36:07.0
8	dong	JAVA	5232.0	2019-03-20 15:36:07.0
8	dong	JAVA	3232.0	2019-03-20 15:13:35.0
Time taken: 0.161 seconds, Fetched: 14 row(s)

成14條數據了,多了條empno=8的數據,成功了。
不過筆者這裏的需求是源端oracle數據庫做了update之後,由於時間戳也會跟着變化,所以我們要求根據時間戳找出變更的數據然後在hive增量更新,這裏就使用到了根據時間和主鍵進行合併增量的nerge-id模式,job的創建類似上面的例子.
這裏的例子爲:我們通過shell腳本進行封裝讓crontab 定時執行增量。
注意:先聲明一下,因爲筆者是測試增量導入給kylin做增量cube用,測試數據量很少,所以hive表只創建外部表,不在分區。
下面全流程演示如何一步步把一個表通過sqoop job結合crontab+shell腳本自動增量導入到hive:
第一步,先在oracle端創建一個要同步的表,這裏用上面的inr_las表再初始化一個表:

create table inr_job as select a.empno,
                       a.ename,
                       a.job,
                       a.sal,
                       sysdate etltime  from inr_las a ;

第二步,在hive創建目標表:

hive> use oracle;
OK
Time taken: 1.425 seconds
create table INR_JOB
(
  empno   int,
  ename   string,
  job     string,
  sal     float,
  etltime string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
location '/user/hive/warehouse/exter_inr_job'; 
Time taken: 2.836 seconds

第三步,全量把數據導入hive:
先刪除一下上面創建外部表時指定的目錄,因爲創建外部表時會自動創建目錄,而下面的全量導入也會自動創建,因此會導致衝突提示目錄存在的錯誤:

[root@hadoop hadoop]# hadoop fs -rmr /user/hive/warehouse/exter_inr_job
rmr: DEPRECATED: Please use 'rm -r' instead.
19/03/14 06:01:23 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minut
es, Emptier interval = 0 minutes.Deleted /user/hive/warehouse/exter_inr_job

接下來全量導入:

sqoop import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username scott --password tiger --table INR_JOB -m 1 --target-dir /user/hive/warehouse/exter_inr_job --fields-terminated-by '\t'

導入完成查詢下hive數據:

hive> select * from inr_job;
OK
1	er	CLERK	800.0	2019-03-22 17:24:42.0
2	ALLEN	SALESMAN	1600.0	2019-03-22 17:24:42.0
3	WARD	SALESMAN	1250.0	2019-03-22 17:24:42.0
4	JONES	MANAGER	2975.0	2019-03-22 17:24:42.0
5	MARTIN	SALESMAN	1250.0	2019-03-22 17:24:42.0
6	zhao	DBA	1000.0	2019-03-22 17:24:42.0
7	yan	BI	100.0	2019-03-22 17:24:42.0
8	dong	JAVA	400.0	2019-03-22 17:24:42.0
Time taken: 3.153 seconds, Fetched: 8 row(s)

第四步,創建增量sqoop job
下面的–password-file /user/hive/warehouse/scott.pwd 是之前上一篇文章創建的密碼文件,讀者可以看下上篇文章如何創建的

 sqoop job --create auto_job -- import --connect jdbc:oracle:thin:@192.168.1.6:1521:orcl --username scott  --password-file /user/hive/warehouse/scott.pwd  --table INR_JOB --fields-terminated-by '\t' --lines-terminated-by '\n'  --target-dir /user/hive/warehouse/exter_inr_job -m 1 --check-column ETLTIME --incremental lastmodified --merge-key EMPNO --last-value "2019-03-22 17:24:42"

看下創建的job信息:

[root@hadoop hadoop]# sqoop  job --show auto_job
Warning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/03/14 06:10:57 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/i
mpl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/hadoop/hbase/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLogg
erBinder.class]SLF4J: Found binding in [jar:file:/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLog
gerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Job: auto_job
Tool: import
Options:
----------------------------
verbose = false
hcatalog.drop.and.create.table = false
incremental.last.value = 2019-03-22 17:24:42
db.connect.string = jdbc:oracle:thin:@192.168.1.6:1521:orcl
codegen.output.delimiters.escape = 0
codegen.output.delimiters.enclose.required = false
codegen.input.delimiters.field = 0
mainframe.input.dataset.type = p
split.limit = null
hbase.create.table = false
skip.dist.cache = false
hdfs.append.dir = false
db.table = INR_JOB
codegen.input.delimiters.escape = 0
accumulo.create.table = false
import.fetch.size = null
codegen.input.delimiters.enclose.required = false
db.username = scott
reset.onemapper = false
codegen.output.delimiters.record = 10
import.max.inline.lob.size = 16777216
sqoop.throwOnError = false
hbase.bulk.load.enabled = false
hcatalog.create.table = false
db.clear.staging.table = false
incremental.col = ETLTIME
codegen.input.delimiters.record = 0
db.password.file = /user/hive/warehouse/scott.pwd
enable.compression = false
hive.overwrite.table = false
hive.import = false
codegen.input.delimiters.enclose = 0
accumulo.batch.size = 10240000
hive.drop.delims = false
customtool.options.jsonmap = {}
codegen.output.delimiters.enclose = 0
hdfs.delete-target.dir = false
codegen.output.dir = .
codegen.auto.compile.dir = true
relaxed.isolation = false
mapreduce.num.mappers = 1
accumulo.max.latency = 5000
import.direct.split.size = 0
sqlconnection.metadata.transaction.isolation.level = 2
codegen.output.delimiters.field = 9
export.new.update = UpdateOnly
incremental.mode = DateLastModified
hdfs.file.format = TextFile
sqoop.oracle.escaping.disabled = true
codegen.compile.dir = /tmp/sqoop-root/compile/be3b358816e17c786d114afb7a4f2a6d
direct.import = false
temporary.dirRoot = _sqoop
hdfs.target.dir = /user/hive/warehouse/exter_inr_job
hive.fail.table.exists = false
merge.key.col = EMPNO
db.batch = false

第五步,封裝到shell腳本,加入定時任務

[root@hadoop ~]# cd /hadoop/
[root@hadoop hadoop]# vim auto_inr.sh
加入下面內容:
#!/bin/bash
log="/hadoop/auto_job_log.log"
echo "======================`date "+%Y-%m-%d %H:%M:%S"`增量======================" >> $log
nohup sqoop job --exec auto_job >> $log 2>&1 &
保存退出,賦予權限
[root@hadoop hadoop]# chmod +x auto_inr.sh 

先來手動執行一下,不過執行前先再看看job的last_value時間:

[root@hadoop hadoop]# sqoop job --show auto_job
Warning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/03/25 17:50:59 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hbase/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Job: auto_job
Tool: import
Options:
----------------------------
verbose = false
hcatalog.drop.and.create.table = false
incremental.last.value = 2019-03-22 17:24:42
db.connect.string = jdbc:oracle:thin:@192.168.1.6:1521:orcl

看到是2019-03-22 17:24:42,再看下當前時間:

[root@hadoop hadoop]# date
Mon Mar 25 17:54:54 CST 2019

接下來手動執行下這個腳本:

[root@hadoop hadoop]# ./auto_inr.sh 

然後去看重定向的日誌:

[root@hadoop hadoop]# cat auto_job_log.log 
======================2019-03-25 17:55:46增量======================
Warning: /hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/03/25 17:55:48 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hbase/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/03/25 17:55:50 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
19/03/25 17:55:50 INFO manager.SqlManager: Using default fetchSize of 1000
19/03/25 17:55:50 INFO tool.CodeGenTool: Beginning code generation
19/03/25 17:55:51 INFO manager.OracleManager: Time zone has been set to GMT
19/03/25 17:55:51 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM INR_JOB t WHERE 1=0
19/03/25 17:55:51 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /hadoop
Note: /tmp/sqoop-root/compile/6f5f7577c1f664b94d5c83b578fd3dac/INR_JOB.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/03/25 17:55:54 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/6f5f7577c1f664b94d5c83b578fd3dac/INR_JOB.jar
19/03/25 17:55:54 INFO manager.OracleManager: Time zone has been set to GMT
19/03/25 17:55:54 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM INR_JOB t WHERE 1=0
19/03/25 17:55:54 INFO tool.ImportTool: Incremental import based on column ETLTIME
19/03/25 17:55:54 INFO tool.ImportTool: Lower bound value: TO_TIMESTAMP('2019-03-22 17:24:42', 'YYYY-MM-DD HH24:MI:SS.FF')
19/03/25 17:55:54 INFO tool.ImportTool: Upper bound value: TO_TIMESTAMP('2019-03-25 17:55:54.0', 'YYYY-MM-DD HH24:MI:SS.FF')
19/03/25 17:55:54 INFO manager.OracleManager: Time zone has been set to GMT
19/03/25 17:55:54 INFO mapreduce.ImportJobBase: Beginning import of INR_JOB
19/03/25 17:55:54 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/03/25 17:55:54 INFO manager.OracleManager: Time zone has been set to GMT
19/03/25 17:55:54 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/03/25 17:55:54 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.66:8032
19/03/25 17:55:57 INFO db.DBInputFormat: Using read commited transaction isolation
19/03/25 17:55:57 INFO mapreduce.JobSubmitter: number of splits:1
19/03/25 17:55:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1553503985304_0009
19/03/25 17:55:58 INFO impl.YarnClientImpl: Submitted application application_1553503985304_0009
19/03/25 17:55:58 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1553503985304_0009/
19/03/25 17:55:58 INFO mapreduce.Job: Running job: job_1553503985304_0009
19/03/25 17:56:07 INFO mapreduce.Job: Job job_1553503985304_0009 running in uber mode : false
19/03/25 17:56:07 INFO mapreduce.Job:  map 0% reduce 0%
19/03/25 17:56:15 INFO mapreduce.Job:  map 100% reduce 0%
19/03/25 17:56:15 INFO mapreduce.Job: Job job_1553503985304_0009 completed successfully
19/03/25 17:56:15 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=144775
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=87
		HDFS: Number of bytes written=323
		HDFS: Number of read operations=4
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=1
		Other local map tasks=1
		Total time spent by all maps in occupied slots (ms)=5270
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=5270
		Total vcore-milliseconds taken by all map tasks=5270
		Total megabyte-milliseconds taken by all map tasks=5396480
	Map-Reduce Framework
		Map input records=8
		Map output records=8
		Input split bytes=87
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=73
		CPU time spent (ms)=3000
		Physical memory (bytes) snapshot=205058048
		Virtual memory (bytes) snapshot=2135244800
		Total committed heap usage (bytes)=109576192
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=323
19/03/25 17:56:15 INFO mapreduce.ImportJobBase: Transferred 323 bytes in 20.9155 seconds (15.4431 bytes/sec)
19/03/25 17:56:15 INFO mapreduce.ImportJobBase: Retrieved 8 records.
19/03/25 17:56:15 INFO tool.ImportTool: Final destination exists, will run merge job.
19/03/25 17:56:15 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
19/03/25 17:56:15 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.66:8032
19/03/25 17:56:18 INFO input.FileInputFormat: Total input paths to process : 2
19/03/25 17:56:18 INFO mapreduce.JobSubmitter: number of splits:2
19/03/25 17:56:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1553503985304_0010
19/03/25 17:56:19 INFO impl.YarnClientImpl: Submitted application application_1553503985304_0010
19/03/25 17:56:19 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1553503985304_0010/
19/03/25 17:56:19 INFO mapreduce.Job: Running job: job_1553503985304_0010
19/03/25 17:56:29 INFO mapreduce.Job: Job job_1553503985304_0010 running in uber mode : false
19/03/25 17:56:29 INFO mapreduce.Job:  map 0% reduce 0%
19/03/25 17:56:39 INFO mapreduce.Job:  map 100% reduce 0%
19/03/25 17:56:50 INFO mapreduce.Job:  map 100% reduce 100%
19/03/25 17:56:50 INFO mapreduce.Job: Job job_1553503985304_0010 completed successfully
19/03/25 17:56:50 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=1090
		FILE: Number of bytes written=436771
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=942
		HDFS: Number of bytes written=323
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=14667
		Total time spent by all reduces in occupied slots (ms)=7258
		Total time spent by all map tasks (ms)=14667
		Total time spent by all reduce tasks (ms)=7258
		Total vcore-milliseconds taken by all map tasks=14667
		Total vcore-milliseconds taken by all reduce tasks=7258
		Total megabyte-milliseconds taken by all map tasks=15019008
		Total megabyte-milliseconds taken by all reduce tasks=7432192
	Map-Reduce Framework
		Map input records=16
		Map output records=16
		Map output bytes=1052
		Map output materialized bytes=1096
		Input split bytes=296
		Combine input records=0
		Combine output records=0
		Reduce input groups=8
		Reduce shuffle bytes=1096
		Reduce input records=16
		Reduce output records=8
		Spilled Records=32
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=230
		CPU time spent (ms)=5420
		Physical memory (bytes) snapshot=684474368
		Virtual memory (bytes) snapshot=6394597376
		Total committed heap usage (bytes)=511705088
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=646
	File Output Format Counters 
		Bytes Written=323
19/03/25 17:56:50 INFO tool.ImportTool: Saving incremental import state to the metastore
19/03/25 17:56:51 INFO tool.ImportTool: Updated data for job: auto_job

可以看到日誌中這麼一段話:

19/03/25 17:55:54 INFO tool.ImportTool: Upper bound value: TO_TIMESTAMP('2019-03-25 17:55:54.0', 'YYYY-MM-DD HH24:MI:SS.FF')

說明上限是當前時間,然後再看下當前job的last_value:

hcatalog.drop.and.create.table = false
incremental.last.value = 2019-03-25 17:55:54.0
db.connect.string = jdbc:oracle:thin:@192.168.1.6:1521:orcl

和上面日誌中的時間一致,如果job內容需要更改時,可以刪了job重建更改好的job,手動指定時間爲日誌中的Upper bound或刪除前記錄下面last_value,爲重建後的job提供增量時間。
上面手動調用沒問題。就剩最後一步crontab定時了,crontab -e加入下面一段話(每五分鐘增量一次):

*/2 * * * *  /hadoop/auto_inr.sh

如果一個表很大。我第一次初始化一部分最新的數據到hive表,如果沒初始化進來的歷史數據今天發生了變更,那merge-key的增量方式會不會報錯呢?看下一篇測試文章吧,等寫完會放鏈接到這:
https://blog.csdn.net/qq_28356739/article/details/88803284

發佈了27 篇原創文章 · 獲贊 33 · 訪問量 23萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章