sqoop 中文文檔 User guide 五 job,metastore,merge,codegen

12.sqoop-job

12.1.Purpose

The job tool allows you to create and work with saved jobs. Saved jobs remember the parameters used to specify a job, so they can be re-executed by invoking the job by its handle.

這個job 工具允許你創建和使用  保存 的job,已經 保存的job記得一個特定任務的參數,所以通過執行這個 已經 保存 的job就可以再次執行那個特定的任務。

If a saved job is configured to perform an incremental import, state regarding the most recently imported rows is updated in the saved job to allow the job to continually import only the newest rows.

如果一個保存的工作是執行增量導入的配置,最近的已經導入的行的狀態在 saved job中被更新,由此允許job不斷的導入最新的行。

12.2.Syntax

$ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]

Although the Hadoop generic arguments must preceed any job arguments, the job arguments can be entered in any order with respect to one another.

Table24.Job management options:

ArgumentDescription
--create <job-id>Define a new saved job with the specified job-id (name). A second Sqoop command-line, separated by a--should be specified; this defines the saved job //定義一個 新的job 通過指定jobid(名稱) 第二段的sqoop命令行通過 -- 分割必須被指定 它定義了要保存的job.
--delete <job-id>Delete a saved job 刪除一個已經  保存   的job.
--exec <job-id>Given a job defined with--create, run the saved job 運行已經創建的job.
--show <job-id>Show the parameters for a saved job 展示一個已經保存 的job的參數.
--listList all saved jobs 列出所有已經存儲的job


Creating saved jobs is done with the--createaction. This operation requires a--followed by a tool name and its arguments. The tool and its arguments will form the basis of the saved job. Consider:

創建保存工作是通過 --create功能完成。這個操作需要 --後面跟着一個工具名稱和它的參數。該工具及其參數將成爲已保存的job的主要內容。考慮:

$ sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \
    --table mytable

This creates a job namedmyjobwhich can be executed later. The job is not run. This job is now available in the list of saved jobs:

上面的命令創建了一個名爲myjob的job, 這個命令可以以後執行,當job沒有運行時,可以通過下面的命令列出已經保存 的job :

$ sqoop job --list
Available jobs:
  myjob

We can inspect the configuration of a job with theshowaction:

我們可以檢查一個job的配置,通過show功能:

 $ sqoop job --show myjob
 Job: myjob
 Tool: import
 Options:
 ----------------------------
 direct.import = false
 codegen.input.delimiters.record = 0
 hdfs.append.dir = false
 db.table = mytable
 ...

And if we are satisfied with it, we can run the job withexec:

如果我們是滿意的,我們可以運行工作 exec

$ sqoop job --exec myjob
10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
...

Theexecaction allows you to override arguments of the saved job by supplying them after a--. For example, if the database were changed to require a username, we could specify the username and password with:

exec操作允許您覆蓋已經保存的job的參數通過 -- ,。例如,如果數據庫是改變需要一個用戶名,我們可以指定用戶名和密碼:

$ sqoop job --exec myjob -- --username someuser -P
Enter password:
...

Table25.Metastore connection options:

ArgumentDescription
--meta-connect <jdbc-uri>Specifies the JDBC connect string used to connect to the metastore 指定用於連接元數據倉庫的jdbc連接字符串

By default, a private metastore is instantiated in$HOME/.sqoop. If you have configured a hosted metastore with thesqoop-metastoretool, you can connect to it by specifying the--meta-connectargument. This is a JDBC connect string just like the ones used to connect to databases for import.

默認,一個私有的元數據倉庫在  $HOME/.sqoop 中初始化,如果你配置一個主機元數據倉庫通過sqoop-metastore工具,你可以連接它通過指定--meta-connect 參數。這是一個JDBC連接字符串就像那些用於連接數據庫導入的字符串一樣。

Inconf/sqoop-site.xml, you can configuresqoop.metastore.client.autoconnect.urlwith this address, so you do not have to supply--meta-connectto use a remote metastore. This parameter can also be modified to move the private metastore to a location on your filesystem other than your home directory.

conf/sqoop-site.xml中,你可以配置sqoop.metastore.client.autoconnect.url通過這個地址,所有你不需要提供--meta-connect 來使用一個遠程元數據倉庫,這個參數也可以被修改, 以便於移動私有的元數據倉庫到除了home目錄以外的文件系統中的位置。

If you configuresqoop.metastore.client.enable.autoconnectwith the valuefalse, then you must explicitly supply--meta-connect.

如果你配置 sqoop.metastore.client.enable.autoconnect的值爲false,這時你必須明確地提供--meta-connect。

Table26.Common options:

ArgumentDescription
--helpPrint usage instructions
--verbosePrint more information while working

12.3.Saved jobs and passwords 已保存的job 和密碼

The Sqoop metastore is not a secure resource. Multiple users can access its contents. For this reason, Sqoop does not store passwords in the metastore. If you create a job that requires a password, you will be prompted for that password each time you execute the job.

Sqoop的 元數據倉庫  不是一個安全的資源。多個用戶可以訪問其內容。出於這個原因,Sqoop不存儲在metastore密碼。如果您創建了一個工作,需要一個密碼,每次執行工作時你將被提示輸入密碼:

You can enable passwords in the metastore by settingsqoop.metastore.client.record.passwordtotruein the configuration.

您可以在 元數據倉庫 中啓用密碼  通過在配置文件中設置 sqoop.metastore.client.record爲true。

Note that you have to setsqoop.metastore.client.record.passwordtotrueif you are executing saved jobs via Oozie because Sqoop cannot prompt the user to enter passwords while being executed as Oozie tasks.

請注意,您必須設置sqoop.metastore.client.record密碼爲true,如果你通過Oozie執行已保存的job,因爲Sqoop不能提示用戶輸入密碼。

12.4.Saved jobs and incremental imports

Incremental imports are performed by comparing the values in acheck columnagainst a reference value for the most recent import. For example, if the--incremental appendargument was specified, along with--check-column idand--last-value 100, all rows withid > 100will be imported. If an incremental import is run from the command line, the value which should be specified as--last-valuein a subsequent incremental import will be printed to the screen for your reference. If an incremental import is run from a saved job, this value will be retained in the saved job. Subsequent runs ofsqoop job --exec someIncrementalJobwill continue to import only newer rows than those previously imported.

增量導入通過比較目標表check 列的值和最近導入的check 列的值。例如,如果  --incremental append 參數被指定,還有--check-column id and --last-value 100 ,所有id > 100的行將被導入。如果這個增量導入從命令行運行,在一個後續增量導入中,該值應該重新被指定通過-last-value,將打印到屏幕上,供你參考。如果一個增量導入是運行於一個已保存的工作,這個值將會被保留在已保存的job。  sqoop job --exec someIncrementalJob後續的運行 只導入相比較於那些以前的導入有更新的行。

13.sqoop-metastore

13.1.Purpose

Themetastoretool configures Sqoop to host a shared metadata repository. Multiple users and/or remote users can define and execute saved jobs (created withsqoop job) defined in this metastore.

這個metastore工具配置Sqoop到一個主機共享的元數據存儲庫。多個用戶和/或遠程用戶可以定義和執行保存的job(通過sqoop job創建)定義在這個元數據倉庫。

Clients must be configured to connect to the metastore insqoop-site.xmlor with the--meta-connectargument.

客戶端必須配置元數據倉庫的連接在sqoop-site.xml中或通過  --meta-connect  參數

13.2.Syntax

$ sqoop metastore (generic-args) (metastore-args)
$ sqoop-metastore (generic-args) (metastore-args)

Although the Hadoop generic arguments must preceed any metastore arguments, the metastore arguments can be entered in any order with respect to one another.

Table27.Metastore management options:

ArgumentDescription
--shutdownShuts down a running metastore instance on the same machine在同一臺機器上  關閉一個metastore實例.

Runningsqoop-metastorelaunches a shared HSQLDB database instance on the current machine. Clients can connect to this metastore and create jobs which can be shared between users for execution.

運行sqoop-metastore開啓一個共享的HSQLDB數據庫實例在當前的機器上,客戶端可以連接這個元數據倉庫並創建以被多用戶分享,執行的job。

The location of the metastore’s files on disk is controlled by thesqoop.metastore.server.locationproperty inconf/sqoop-site.xml. This should point to a directory on the local filesystem.

元數據倉庫的文件的位置在磁盤上被conf / sqoop-site.xml 中的sqoop.metastore.server屬性控制。這應該指向一個本地文件系統上的目錄.

The metastore is available over TCP/IP. The port is controlled by thesqoop.metastore.server.portconfiguration parameter, and defaults to 16000.

元數據倉庫在TCP / IP上是可用的。端口號通過sqoop.metastore.server控制參數。默認值爲16000。

Clients should connect to the metastore by specifyingsqoop.metastore.client.autoconnect.urlor--meta-connectwith the valuejdbc:hsqldb:hsql://<server-name>:<port>/sqoop. For example,jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop.


客戶端必須連接元數據通過指定 sqoop.metastore.client.autoconnect.url or --meta-connect 通過值 jdbc:hsqldb:hsql://<server-name>:<port>/sqoop.例如: jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop.

This metastore may be hosted on a machine within the Hadoop cluster, or else where on the network.

這元數據倉庫可能是託管在Hadoop集羣中的某一臺機器,或網絡上。

參考資料:http://myeyeofjava.iteye.com/blog/1704644

14.sqoop-merge

14.1.Purpose

The merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasetsin HDFSwhere successively newer data appears in each dataset. Themergetool will "flatten" two datasets into one, taking the newest available records for each primary key.

merge工具允許你合併兩個數據集,新的數據集將覆蓋舊數據集。例如,一個last - modified模式的增量導入將在HDFS產生多個數據集,先後出現在每個數據集更新數據。merge工具將“平”兩個數據集到一個,取得最新的可用的記錄爲每個主鍵。

//紅字部分沒有弄懂,但沒關係。merge 工具適用於增量導入,多數用於last - modified模式,當第一次導入時目錄爲 -target-dir old,第二次導入時使用last - modified模式 目錄爲-target-dir new,第二次導入的數據相比較與第一次導入的數據中有相同主鍵的行有數據更新,此時可以使用merge工具合併兩個數據集,實例如下:

http://blog.csdn.net/coldplay/article/details/7619065

14.2.Syntax

$ sqoop merge (generic-args) (merge-args)
$ sqoop-merge (generic-args) (merge-args)

Although the Hadoop generic arguments must preceed any merge arguments, the job arguments can be entered in any order with respect to one another.

Table28.Merge options:

ArgumentDescription
--class-name <class>Specify the name of the record-specific class to use during the merge job 指定合併過程中要用到的class名.
--jar-file <file>Specify the name of the jar to load the record class from//指定用來加載class的jar包名.
--merge-key <col>Specify the name of a column to use as the merge key 指定用作合併的主鍵.
--new-data <path>Specify the path of the newer dataset 指定較新的數據集的路徑.
--onto <path>Specify the path of the older dataset  指定較舊的數據集的路徑  .
--target-dir <path>Specify the target path for the output of the merge job //指定合併任務的目標輸出路徑.

Themergetool runs a MapReduce job that takes two directories as input: a newer dataset, and an older one. These are specified with--new-dataand--ontorespectively. The output of the MapReduce job will be placed in the directory in HDFS specified by--target-dir.

merge工具運行MapReduce作業,需要輸入兩個目錄:一個較新的數據集,和一箇舊的數據集。這些通過--new-data and --onto分別地指定。MapReduce任務的輸出將放置在在HDFS中通過--target-dir指定的目錄

When merging the datasets, it is assumed that there is a unique primary key value in each record. The column for the primary key is specified with--merge-key. Multiple rows in the same dataset should not have the same primary key, or else data loss may occur.

當合並數據集,它是假設有一個唯一的主鍵值在每個記錄(一個記錄是要合併的兩個數據庫表中的一個)中。列的主鍵是通過--merge-key指定。在相同的數據集的多行不應具有相同的主鍵,否則可能發生數據丟失。

To parse the dataset and extract the key column, the auto-generated class from a previous import must be used. You should specify the class name and jar file with--class-nameand--jar-file. If this is not available you can recreate the class using thecodegentool.

爲了解析數據並提取key列,必須使用以前導入時自動生成的類。你應該指定類名和jar文件通過--class-name and --jar-file,如果沒有有可用的,你可以使用使用codegen工具重新創建類。

The merge tool is typically run after an incremental import with the date-last-modified mode (sqoop import --incremental lastmodified …).

merge工具通常是運行在一個date-last-modified模式的增量導入 (sqoop import --incremental lastmodified …).

Supposing two incremental imports were performed, where some older data is in an HDFS directory namedolderand newer data is in an HDFS directory namednewer, these could be merged like so:

假如兩個增量導入被執行,一些老的數據是在一個HDFS目錄中命名爲older ,新的數據是在一個HDFS目錄中命名爲newer ,這種情況可以像這樣合併:

$ sqoop merge --new-data newer --onto older --target-dir merged \
    --jar-file datatypes.jar --class-name Foo --merge-key id

This would run a MapReduce jobwhere the value in theidcolumn of each row is used to join rows;rows in thenewerdataset will be used in preference to rows in theolderdataset.

這將運行MapReduce  job,用id列的值做join條件;新的數據集的行將優先於舊的數據集的行被使用。

This can be used with both SequenceFile-, Avro- and text-based incremental imports. The file types of the newer and older datasets must be the same.

這可用於SequenceFile -,Avro-和基於文本的增量導入。新和舊的數據集的文件類型必須是相同的。

15.sqoop-codegen

15.1.Purpose

Thecodegentool generates Java classes which encapsulate and interpret imported records. The Java definition of a record is instantiated as part of the import process, but can also be performed separately. For example, if Java source is lost, it can be recreated. New versions of a class can be created which use different delimiters between fields, and so on.

codegen工具生成Java類,這個Java類,可以封裝和解釋導入的記錄。Java定義實例化是作爲導入過程過程,但也可以單獨執行。例如,如果Java源丟失,則可以重新創建。新版本的類可使用不同的字段分隔符,等等。


15.2.Syntax

$ sqoop codegen (generic-args) (codegen-args)
$ sqoop-codegen (generic-args) (codegen-args)

Although the Hadoop generic arguments must preceed any codegen arguments, the codegen arguments can be entered in any order with respect to one another.

Table29.Common arguments

ArgumentDescription
--connect <jdbc-uri>Specify JDBC connect string
--connection-manager <class-name>Specify connection manager class to use
--driver <class-name>Manually specify JDBC driver class to use
--hadoop-mapred-home <dir>Override $HADOOP_MAPRED_HOME
--helpPrint usage instructions
-PRead password from console
--password <password>Set authentication password
--username <username>Set authentication username
--verbosePrint more information while working
--connection-param-file <filename>Optional properties file that provides connection parameters

Table30.Code generation arguments:

ArgumentDescription
--bindir <dir>Output directory for compiled objects//指定class文件存放目錄
--class-name <name>Sets the generated class name. This overrides --package-name. When combined with --jar-file, sets the input class
設置生成的class名,這將覆蓋 --package-name ,它還可以和-jar-file一起使用,用來設置輸入的class( 指定一個導入時使用的class)
--jar-file <file>Disable code generation; use specified jar 代碼生成無效;使用特定的jar包
--outdir <dir>Output directory for generated code 生成代碼的輸出路徑
--package-name <name>Put auto-generated classes in this package //所有自動生成的class的包名
--map-column-java <m>Override default mapping from SQL type to Java type for configured columns//(不懂) .

Table31.Output line formatting arguments:

ArgumentDescription
--enclosed-by <char>Sets a required field enclosing character
--escaped-by <char>Sets the escape character
--fields-terminated-by <char>Sets the field separator character
--lines-terminated-by <char>Sets the end-of-line character
--mysql-delimitersUses MySQL’s default delimiter set: fields:,lines:\nescaped-by:\optionally-enclosed-by:'
--optionally-enclosed-by <char>Sets a field enclosing character

Table32.Input parsing arguments:

ArgumentDescription
--input-enclosed-by <char>Sets a required field encloser
--input-escaped-by <char>Sets the input escape character
--input-fields-terminated-by <char>Sets the input field separator
--input-lines-terminated-by <char>Sets the input end-of-line character
--input-optionally-enclosed-by <char>Sets a field enclosing character

Table33.Hive arguments:

ArgumentDescription
--hive-home <dir>Override$HIVE_HOME
--hive-importImport tables into Hive (Uses Hive’s default delimiters if none are set.)
--hive-overwriteOverwrite existing data in the Hive table.
--create-hive-tableIf set, then the job will fail if the target hive

table exits. By default this property is false.
--hive-table <table-name>Sets the table name to use when importing to Hive.
--hive-drop-import-delimsDrops\n,\r, and\01from string fields when importing to Hive.
--hive-delims-replacementReplace\n,\r, and\01from string fields with user defined string when importing to Hive.
--hive-partition-keyName of a hive field to partition are sharded on
--hive-partition-value <v>String-value that serves as partition key for this imported into hive in this job.
--map-column-hive <map>Override default mapping from SQL type to Hive type for configured columns.

If Hive arguments are provided to the code generation tool, Sqoop generates a file containing the HQL statements to create a table and load data.

如果Hive參數提供了代碼生成工具,Sqoop生成一個文件包含HQL語句來創建一個表和加載數據。

15.3.Example Invocations

Recreate the record interpretation code for theemployeestable of a corporate database:

重新創建記錄解釋代碼爲 一個corp數據庫的employees表

$ sqoop codegen --connect jdbc:mysql://db.example.com/corp \
    --table employees


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章