sqoop 中文文檔 User guide 三 export

9.sqoop-export

9.1.Purpose 目的

The export tool exports a set of files from HDFS back to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.

//導出工具導出一組文件從HDFS回到RDBMS。目標表必須已經存在於數據庫中。根據用戶指定的分隔符,輸入文件讀取和解析成一組記錄。

The default operation is to transform these into a set of INSERT statements that inject the records into the database. In "update mode," Sqoop will generate UPDATE statements that replace existing records in the database, and in "call mode" Sqoop will make a stored procedure call for each record.

//默認操作是把這些轉換爲一組INSERT語句,插入記錄到數據庫的記錄。在 “更新模式”下,Sqoop將生成更新語句,取代在數據庫中存在的記錄,在“調用模式”下,Sqoop將爲每條記錄調用存儲過程。

9.2.Syntax//語法

$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)

Although the Hadoop generic arguments must preceed any export arguments, the export arguments can be entered in any order with respect to one another.

雖然hadoop通用參數必須寫在所有導入參數前,但是 導入的參數 互相之間可以以任何順序輸入。

Table18.Common arguments (共用的參數,前面已翻譯過)

ArgumentDescription
--connect <jdbc-uri>Specify JDBC connect string
--connection-manager <class-name>Specify connection manager class to use
--driver <class-name>Manually specify JDBC driver class to use
--hadoop-mapred-home <dir>Override $HADOOP_MAPRED_HOME
--helpPrint usage instructions
-PRead password from console
--password <password>Set authentication password
--username <username>Set authentication username
--verbosePrint more information while working
--connection-param-file <filename>Optional properties file that provides connection parameters

Table19.Validation arguments More Details

ArgumentDescription
--validateEnable validation of data copied, supports single table copy only. --validator <class-name>Specify validator class to use.
--validation-threshold <class-name>Specify validation threshold class to use.
+--validation-failurehandler <class-name>+ Specify validation failure handler class to use.

Table20.Export control arguments:

導出控制參數:

ArgumentDescription
--directUse direct export fast path 使用快速通道的direct模式導出
--export-dir <dir>HDFS source path for the export  HDFS資源導出路徑
-m,--num-mappers <n>Use n map tasks to export in parallel 使用n個map任務並行導出
--table <table-name>Table to populate 填充的表名
--call <stored-proc-name>Stored Procedure to call  調用的存儲過程
--update-key <col-name>Anchor column to use for updates. Use a comma separated list of columns if there are more than one column 標註用於更新的列,多於一列用逗號分隔.
--update-mode <mode>Specify how updates are performed when new rows are found with non-matching keys in datab指定執行時如何更新,當發現導入的行沒有主鍵在數據庫中.

Legal values for mode include updateonly (default) and allowinsert 合法的值包括  updateonly(默認)和 allowinsert  .
--input-null-string <null-string>The string to be interpreted as null for string columns 字符串列NULL的解釋
--input-null-non-string <null-string>The string to be interpreted as null for non-string columns  非字符串列NULL的解釋
--staging-table <staging-table-name>The table in which data will be staged before being inserted into the destination table.//指定臨時表
--clear-staging-tableIndicates that any data present in the staging table can be deleted//導出前是否清空分段表.
--batchUse batch mode for underlying statement execution 指定基礎語句的執行使用批處理模式  .

staging table :直譯爲分段表,用於分段執行,可理解爲臨時表。

update-mode

Sqoop Export 工具默認地會將整個數據遷移工作分解爲一系列針對數據庫的 Insert 操作。對於數據庫中的目標表爲空表的情況,這種默認的方法並無不妥之處。但是,如果目標表非空且存在約束,那麼 Export 的 Insert 操作就可能會有由於主鍵衝突等問題而導致的失敗。目前,一條 Insert 操作的失敗就會導致整個 Sqoop Export 任務的失敗。爲了規避以上描述的問題產生,用戶可以利用 update-mode 參數。update-mode 參數定義了更新表操作的模式。它有兩種模式:

  • updateonly

  • allowinsert

第一種模式 (updateonly) 是默認的模式,它會把整個 Export 的數據遷移工作分解爲一系列的 Update 操作。而當更新表操作所參考的列則可以通過參數 update-key <col-name > 來指定。這種模式下,對於那些並不影響到已存在的表數據的 Update 操作 ( 比如,要 Update 的行在數據表中本來並不存在 ),不會導致任何失敗,但也不會更新任何數據。但如果用戶想在更新已存在的行的同時插入那些原先並不存在的行,就可以使用 allowinsert 模式——在這種模式下,對於不存在的列的操作會是 Insert,而非 Update。用戶可以根據實際情況,合理選擇 update-mode。

staging-table

正如上文所介紹的情況,當執行 Export 命令時,一條 Insert 語句的失敗可能會導致整個 Export 任務的失敗,但此時可能已經有部分數據插入到了數據庫表中——這將會引起數據不完整的問題。一個理想的狀態應該是要麼所有數據都成功地更新進數據庫,要麼 數據庫沒有帶來任何更新。爲了解決這個問題,用戶可以事先在數據庫中創建一張臨時表作爲存儲中間數據的分段表 (staging table)。分段表的結構和目標表完全一樣,只有當所有命令都成功執行完後,數據纔會從分段錶轉移到目標表裏;而當任務失敗時,目標表不會受到任何影 響。在使用 staging-stable 參數時,用戶可以同時使用 clear-staging-table 參數——該參數確保在開始 Export 任務前清空分段表。


The --export-dir argument and one of --table or --call are required. These specify the table to populate in the database (or the stored procedure to call), and the directory in HDFS that contains the source data.

--export-dir 和--table ,--call這兩個參數的其中之一是必須的,這些命令指定填充到數據庫中的表(或存儲過程調用)和HDFS目錄中包含的源數據。

You can control the number of mappers independently from the number of files present in the directory. Export performance depends on the degree of parallelism. By default, Sqoop will use four tasks in parallel for the export process. This may not be optimal; you will need to experiment with your own particular setup. Additional tasks may offer better concurrency, but if the database is already bottlenecked on updating indices, invoking triggers, and so on, then additional load may decrease performance. The --num-mappers or -m arguments control the number of map tasks, which is the degree of parallelism used.

你可以控制存在於目錄中的文件的獨立的映射器的數量,(獨立的映射器 可以理解爲map任務)。導出的性能依賴程度的並行度。默認情況下,Sqoop將爲導出執行使用四個並行任務。這可能不是最佳的,你需要通過你自己特定設置來實驗。追加任務數可能提供更好的併發性,但如果數據庫已經在更新索引,調用觸發諸如此類方面遇到瓶頸,然後額外的負載可能會降低性能。  --num-mappers-m 參數控制map任務的數量,即並行度。(並行時,每一個線程使用一個MAP任務,map任務數即並行數)

MySQL provides a direct mode for exports as well, using the mysqlimport tool. When exporting to MySQL, use the --direct argument to specify this codepath. This may be higher-performance than the standard JDBC codepath.

MySQL爲 導出提供了一個direct模式,使用mysqlimport工具當導出到MySQL,使用  --direct 參數來指定這個codepath。這可能是比標準的JDBC codepath更高性能


[Note]Note

When using export in direct mode with MySQL, the MySQL bulk utility mysqlimport must be available in the shell path of the task process.

當通過MySQL使用 direct模式,在執行任務的機器上,通過shell命令,MySQL的工具 mysqlimport必須是可用的。

The --input-null-string and --input-null-non-string arguments are optional. If --input-null-string is not specified, then the string "null" will be interpreted as null for string-type columns. If --input-null-non-string is not specified, then both the string "null" and the empty string will be interpreted as null for non-string columns. Note that, the empty string will be always interpreted as null for non-string columns, in addition to other string if specified by --input-null-non-string.

--input-null-string and --input-null-non-string 參數是可選的,如果 --input-null-string沒有指定,這時對於字符串列 “null” 字符串將被解釋成空值,如果--input-null-non-string 沒有指定,那麼 ,對於非字符串列,字符串“null”和空字符串都將被解釋爲空值。注意,對於非字符串列,空字符串將總是解釋爲空值,除非通過 -input-null-non-string指定其他字符串。

這句翻譯的很好,可以做爲範例

Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database. This can further lead to subsequent jobs failing due to insert collisions in some cases, or lead to duplicated data in others. You can overcome this problem by specifying a staging table via the --staging-table option which acts as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction.

因爲Sqoop把導出過程分解成多個事務,這樣可能出現 一個失敗的導出工作可能導致部分數據被提交到數據庫。這可能進一步導致後續工作失敗由於插入衝突,或導致重複數據。你可以克服這個問題通過指定一個分段表通過  --staging-table選項它充當一個輔助表,用於分段導出的數據。分段數據最終轉移到目標表在一個單獨事務中(參考 9.2 staging表講解)。

In order to use the staging facility, you must create the staging table prior to running the export job. This table must be structurally identical to the target table. This table should either be empty before the export job runs, or the --clear-staging-table option must be specified. If the staging table contains data and the --clear-staging-table option is specified, Sqoop will delete all of the data before starting the export job.

爲了靈活使用staging,您必須創建staging表在導出工作運行前。此表在結構上必須與目標表相同。這個表應該要麼是空的在導出之前的工作運行, 要麼--clear-staging-table 選項必須被指定。如果staging表包含數據並且 --clear-staging-table 選項被指定,那麼Sqoop將刪除該表的所有數據在導出工作開始前。

[Note]Note

Support for staging data prior to pushing it into the destination table is not available for --direct exports. It is also not available when export is invoked using the --update-key option for updating existing data, and when stored procedures are used to insert the data.

staging 數據在direct模式中不被支持,它也不支持修改已經在的數據和使用存儲過程插入數據。

9.3.Inserts vs. Updates //Inserts , Updates比較

By default, sqoop-export appends new rows to a table; each input record is transformed into an INSERT statement that adds a row to the target database table. If your table has constraints (e.g., a primary key column whose values must be unique) and already contains data, you must take care to avoid inserting records that violate these constraints. The export process will fail if an INSERT statement fails. This mode is primarily intended for exporting records to a new, empty table intended to receive these results.

默認情況下,sqoop-export附加新行到一個表,每個輸入記錄轉化爲一個INSERT語句添加一行到目標數據庫表。如果您的表有約束(如。,一個主鍵列的值必須是唯一的)並且已經包含數據,你必須小心去避免插入違反這些約束的記錄,。如果一個INSERT語句失敗,導出過程將會失敗。這種模式主要是針對導出記錄到一個新表或空表。

If you specify the --update-key argument, Sqoop will instead modify an existing dataset in the database. Each input record is treated as an UPDATE statement that modifies an existing row. The row a statement modifies is determined by the column name(s) specified with --update-key. For example, consider the following table definition:

如果你指定  --update-key 參數,Sqoop將修改現有數據庫中的數據集。每個輸入記錄被視爲一個更新語句,修改一個現有的行。修改那一行是由 --update-key 指定的參數決定的 。例如,考慮下面的表定義:


CREATE TABLE foo(
    id INT NOT NULL PRIMARY KEY,
    msg VARCHAR(32),
    bar INT);

Consider also a dataset in HDFS containing records like these:

考慮在HDFS數據集也包含這樣的記錄:

0,this is a test,42
1,some more data,100
...

Running sqoop-export --table foo --update-key id --export-dir /path/to/data --connect … will run an export job that executes SQL statements based on the data like so:

運行 sqoop-export --table foo --update-key id --export-dir /path/to/data --connect … ,將運行一個導出工作,在數據庫上執行的SQL語句像這樣:

UPDATE foo SET msg='this is a test', bar=42 WHERE id=0;
UPDATE foo SET msg='some more data', bar=100 WHERE id=1;
...

If an UPDATE statement modifies no rows, this is not considered an error; the export will silently continue. (In effect, this means that an update-based export will not insert new rows into the database.) Likewise, if the column specified with --update-key does not uniquely identify rows and multiple rows are updated by a single statement, this condition is also undetected.

如果一個更新語句修改沒有行,這不被認爲是一個錯誤;導出將安靜地的繼續。(實際上,這意味着一個基於更新的導出不會插入新行到數據庫中)。同樣,如果列指定 --update-key 不唯一地標識行,以條更新語句將修改多行,這種情況也不會被檢測(不會報錯)。

The argument --update-key can also be given a comma separated list of column names. In which case, Sqoop will match all keys from this list before updating any existing record.

這個參數--update-key 也可以給定一個逗號分隔的行列表(t1,t1,t3)。在這種情況下,Sqoop將匹配聯合更新條件(即 where value1=t1 and value2=t2 and value3=t3)從這個列表更新任何現有的記錄之前。

Depending on the target database, you may also specify the --update-mode argument with allowinsert mode if you want to update rows if they exist in the database already or insert rows if they do not exist yet.

根據目標數據庫,您還可以指定 --update-mode 參數 的allowinsert模式,可以實現如果數據行存在就更新如果數據行不存在就插入。

Table21.Input parsing arguments:

參考file format那個章節

ArgumentDescription
--input-enclosed-by <char>Sets a required field encloser 設置一個必用的字段關閉符
--input-escaped-by <char>Sets the input escape character 設置輸入轉義符
--input-fields-terminated-by <char>Sets the input field separator 設置輸入字段分隔符
--input-lines-terminated-by <char>Sets the input end-of-line character 設置輸入的 行結束符
--input-optionally-enclosed-by <char>Sets a field enclosing character 設置一個可選的閉合符

Table22.Output line formatting arguments(上面有解釋):

ArgumentDescription
--enclosed-by <char>Sets a required field enclosing character
--escaped-by <char>Sets the escape character
--fields-terminated-by <char>Sets the field separator character
--lines-terminated-by <char>Sets the end-of-line character
--mysql-delimitersUses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by:'
--optionally-enclosed-by <char>Sets a field enclosing character

Sqoop automatically generates code to parse and interpret records of the files containing the data to be exported back to the database. If these files were created with non-default delimiters (comma-separated fields with newline-separated records), you should specify the same delimiters again so that Sqoop can parse your files.

Sqoop自動生成代碼來解析並解釋包含數據的文件記錄,然後導出數據到數據庫中。如果這些文件用非默認的分隔符創建(逗號分隔的字段和換行符分隔的記錄),你應該再次指定相同的分隔符以便於Sqoop可以解析你的文件。

If you specify incorrect delimiters, Sqoop will fail to find enough columns per line. This will cause export map tasks to fail by throwing ParseExceptions.

如果你指定不正確的分隔符,Sqoop將無法找到足夠的每行的列。這將導致導出map任務失敗, 拋出ParseExceptions。

Table23.Code generation arguments:

ArgumentDescription
--bindir <dir>Output directory for compiled objects
--class-name <name>Sets the generated class name. This overrides --package-name. When combined with --jar-file, sets the input class.
--jar-file <file>Disable code generation; use specified jar
--outdir <dir>Output directory for generated code
--package-name <name>Put auto-generated classes in this package
--map-column-java <m>Override default mapping from SQL type to Java type for configured columns.

If the records to be exported were generated as the result of a previous import, then the original generated class can be used to read the data back. Specifying --jar-file and --class-name obviate the need to specify delimiters in this case.

如果要導出的記錄是由之前的導入生成的結果,然後原來的class可以用來讀取數據。指定 --jar-file and --class-name ,在這種情況下就不需要指定分隔符了。

The use of existing generated code is incompatible with --update-key; an update-mode export requires new code generation to perform the update. You cannot use --jar-file, and must fully specify any non-default delimiters.

使用現有的class代碼(上次導入時生成的)與--update-key不兼容 ,一個更新模式導出需要新生成的代碼來執行更新。你不能使用 --jar-file,  ,必須完全指定任何非默認分隔符。

9.4.Exports and Transactions //導出和事務

Exports are performed by multiple writers in parallel. Each writer uses a separate connection to the database; these have separate transactions from one another. Sqoop uses the multi-row INSERT syntax to insert up to 100 records per statement. Every 100 statements, the current transaction within a writer task is committed, causing a commit every 10,000 rows. This ensures that transaction buffers do not grow without bound, and cause out-of-memory conditions. Therefore, an export is not an atomic process. Partial results from the export will become visible before the export is complete.

導出是由多個writer並行執行。每個writer使用一個單獨的數據庫連接;這些有單獨的事務互相間。Sqoop使用多行插入語法,每個語句插入100條記錄。每執行100個語句,一個寫入任務內的當前事務被提交,導致提交事務每插入10000行。這將確保事務緩衝區不會無限制地增長,並導致內存不足的情況。因此,導出不是一個原子的過程。在導出之前完成前,導出的部分結果將成爲可見的(部分結果指的是從HDFS導出到數據庫的數據的一部分)。

Exports may fail for a number of reasons:

一些原因可能導致導出失敗:

         Loss of connectivity from the Hadoop cluster to the database (either due to hardware fault, or server software crashes) //Hadoop集羣到數據庫的連接丟失 (無論是由於硬件故障,或服務器軟件崩潰)

  • Attempting toINSERTa row which violates a consistency constraint (for example, inserting a duplicate primary key value)試圖插入違背了一致性約束的行(例如,插入一個重複的主鍵值)

  • Attempting to parse an incomplete or malformed record from the HDFS source data// 試圖 從HDFS源數據 解析一個不完整的或有缺陷的記錄

  • Attempting to parse records using incorrect delimiters//試圖使用不正確的分隔符解析記錄

  • Capacity issues (such as insufficient RAM or disk space)容量問題(比如內存或磁盤空間不足)

If an export map task fails due to these or other reasons, it will cause the export job to fail. The results of a failed export are undefined. Each export map task operates in aseparatetransaction. Furthermore,individualmap taskscommittheir current transaction periodically. If a task fails, the current transaction will be rolled back. Any previously-committed transactions will remain durable in the database, leading to a partially-complete export.

separate ,individual都有獨立的意思 separate :強調分開的,用於複數,各自的 individual:強調個體 用於單數

due to: 由於 。。的原因

如果由於上述或其他原因一個導出的Map任務失敗,這將導致導出工作失敗。一個失敗的結果是不確定的出口。每個導出map任務運行在一個單獨的事務。此外,單個任務定時提交當前事務。如果任務失敗,當前事務將回滾。任何已經提交了的事務在數據庫中仍將是生效的,導致部分導出完整。

9.6.Example Invocations

A basic export to populate a table namedbar:

填充一個命名爲bar的表的一個基本的導出:

$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar  \
    --export-dir /results/bar_data

This example takes the files in/results/bar_dataand injects their contents in to thebartable in thefoodatabase ondb.example.com. The target table must already exist in the database. Sqoop performs a set ofINSERT INTOoperations, without regard for existing content. If Sqoop attempts to insert rows which violate constraints in the database (for example, a particular primary key value already exists), then the export fails.

這個示例取出/results/bar_data中的文件並注入他們的內容到db.example.com的foo數據庫的bar表。目標表必須已經存在於數據庫中。Sqoop執行一組插入操作,不考慮現有內容。在數據庫中,如果Sqoop試圖插入違反約束的行(例如,一個特定的主鍵值已經存在),這時導出失敗。

Another basic export to populate a table namedbarwith validation enabled:More Details

在啓用驗證情況下填充一個命名爲bar的表的一個基本的導出 :更詳細

$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar  \
    --export-dir /results/bar_data --validate

An export that calls a stored procedure namedbarprocfor every record in/results/bar_datawould look like:

一個導出,爲/results/bar_data中的每條記錄調用一個命名爲barproc的存儲過程會看起來像:

$ sqoop export --connect jdbc:mysql://db.example.com/foo --call barproc \
    --export-dir /results/bar_data


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章