sqoop 中文文檔 User guide 二 import

7.sqoop-import

7.1.Purpose

Theimporttool imports an individual table from an RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation as Avro or SequenceFiles.

import tool 用導入 單張表 從RDBMS 到 HDFS。一張表的每一行代表一條單獨的記錄。

7.2.Syntax

$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)

While the Hadoop generic arguments must precede any import arguments, you can type the import arguments in any order with respect to one another.

hadoop通用參數必須寫在import參數前,import 參數 可以以任意參數排列。

[Note]Note

In this document, arguments are grouped into collections organized by function. Some collections are present in several tools (for example, the "common" arguments). An extended description of their functionality is given only on the first presentation in this document.

在這個文檔裏 ,參數是

Table1.Common arguments:(上面寫過)

ArgumentDescription
--connect <jdbc-uri>Specify JDBC connect string
--connection-manager <class-name>Specify connection manager class to use
--driver <class-name>Manually specify JDBC driver class to use
--hadoop-mapred-home <dir>Override $HADOOP_MAPRED_HOME //覆蓋
--helpPrint usage instructions
-PRead password from console
--password <password>Set authentication password
--username <username>Set authentication username
--verbosePrint more information while working
--connection-param-file <filename>Optional properties file that provides connection parameters

7.2.1.Connecting to a Database Server

Sqoop is designed to import tables from a database into HDFS. To do so, you must specify aconnect stringthat describes how to connect to the database. Theconnect stringis similar to a URL, and is communicated to Sqoop with the--connectargument. This describes the server and database to connect to; it may also specify the port. For example:

可以使用 --connect 連接數據庫,還可以指定端口號。例如:

$ sqoop import --connect jdbc:mysql://database.example.com/employees

This string will connect to a MySQL database namedemployeeson the hostdatabase.example.com. It’s important that youdo notuse the URLlocalhostif you intend to use Sqoop with a distributed Hadoop cluster. The connect string you supply will be used on TaskTracker nodes throughout your MapReduce cluster; if you specify the literal namelocalhost, each node will connect to a different database (or more likely, no database at all). Instead, you should use the full hostname or IP address of the database host that can be seen by all your remote nodes.

這個字符串會連接database.example.com  主機employees數據庫,不使用localhost URL是重要的, 如果在一個分佈式集羣上使用Sqoop, 你提供的連接字符串將被用在任務處理器節點遍及你的MapReduce集羣,如果你指定目標主機連接是localhost,每一個節點將會連接不同的數據庫(或更有可能根本沒有數據庫),換句話說,你必須使用數據庫主機完整的主機名和IP地址.

You might need to authenticate against the database before you can access it. You can use the--usernameand--passwordor-Pparameters to supply a username and a password to the database. For example:

你可能需要認證在你訪問數據庫之前,你可以使用 --username and --password or -P 參數提供用戶名和密碼給數據庫 例如:

$ sqoop import --connect jdbc:mysql://database.example.com/employees \
    --username aaron --password 12345
[Warning]Warning

The--passwordparameter is insecure, as other users may be able to read your password from the command-line arguments via the output of programs such asps. The-Pargument will read a password from a console prompt, and is the preferred method of entering credentials. Credentials may still be transferred between nodes of the MapReduce cluster using insecure means.

--password參數是不安全的,其他的用戶可能會讀取你的密碼通過輸出程序例如ps,-P參數會從控制檯提示讀取參數,-P是首選的認證方式,認證信息可能仍然一不安全的方式傳遞.

Sqoop automatically supports several databases, including MySQL. Connect strings beginning withjdbc:mysql://are handled automatically in Sqoop. (A full list of databases with built-in support is provided in the "Supported Databases" section. For some, you may need to install the JDBC driver yourself.)

Sqoop自動地支持幾種數據庫,包括mysql,使用 jdbc:mysql:// 開頭的連接字符串在sqoop自動是自動處理的(一個完整的內置支持數據庫列表被提供在Supported Databases,其他的你可能要安裝JDBC驅動.)

You can use Sqoop with any other JDBC-compliant database. First, download the appropriate JDBC driver for the type of database you want to import, and install the .jar file in the$SQOOP_HOME/libdirectory on your client machine. (This will be/usr/lib/sqoop/libif you installed from an RPM or Debian package.) Each driver.jarfile also has a specific driver class which defines the entry-point to the driver. For example, MySQL’s Connector/J library has a driver class ofcom.mysql.jdbc.Driver. Refer to your database vendor-specific documentation to determine the main driver class. This class must be provided as an argument to Sqoop with--driver.

您可以使用Sqoop在任何其他 jdbc規範的數據庫上,首先,下載合適的JDBC驅動類型你要導入的,安裝.jar文件在$SQOOP_HOME/lib目錄在你的客戶端機器,(這將在/usr/lib/sqoop/lib 如果從RPM 或Debian包安裝)每一個驅動.jar文件都有一個定義了驅動的指令的指定的驅動class點。

For example, to connect to a SQLServer database, first download the driver from microsoft.com and install it in your Sqoop lib path.

例如,連接一個SQLserver數據庫,首先從microsoft.com下載驅動並安裝它在你的Sqoop lib路徑。

這時運行sqoop ,例如:

Then run Sqoop. For example:

$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \
    --connect <connect-string> ...

When connecting to a database using JDBC, you can optionally specify extra JDBC parameters via a property file using the option--connection-param-file. The contents of this file are parsed as standard Java properties and passed into the driver while creating a connection.

當使用JDBC連接數據庫時,你可以隨意的指定jDBC參數通過一個屬性文件 使用選項 --connection-param-file.文件的內容被解析爲標準Java屬性並在創建連接的時候傳遞給數據庫。

[Note]Note

The parameters specified via the optional property file are only applicable to JDBC connections. Any fastpath connectors that use connections other than JDBC will ignore these parameters.

通過可選的屬性文件指定參數僅僅適用於JDBC連接,其他使用JDBC的快速通道連接器將會忽略這些參數。

Table2.Validation argumentsMore Details

ArgumentDescription
--validateEnable validation of data copied, supports single table copy only.--validator <class-name>Specify validator class to use.
啓用驗證的數據複製,只支持單表複製, --validator<class-name> 指定要使用的class驗證器。
--validation-threshold <class-name>Specify validation threshold class to use.// 指定要使用的 閥值驗證class
+--validation-failurehandler <class-name>+ Specify validation failure handler class to use. 指定要使用的驗證失敗處理class


Table3.Import control arguments:

ArgumentDescription
--appendAppend data to an existing dataset in HDFS //追加數據到HDFS中已經存在的數據集
--as-avrodatafileImports data to Avro Data Files 導入數據到Avro數據文件
--as-sequencefileImports data to SequenceFiles 導入數據到  SequenceFiles  
--as-textfileImports data as plain text (default)//導入格式作爲無格式text(默認)
--boundary-query <statement>Boundary query to use for creating splits 用於創建分割的邊界查詢語句
--columns <col,col,col…>Columns to import from table//指定導入的列
--directUse direct import fast path//使用導入中的快速通道,direct模式
--direct-split-size <n>Split the input stream everynbytes when importing in direct mode 當以direct模式導入時,  分割輸入流 每N bytes,
--fetch-size <n>Number of entries to read from database at once// 一次中從數據庫讀取N張表 .
--inline-lob-limit <n>Set the maximum size for an inline LOB 設置 一個內聯LOB的最大尺寸。
-m,--num-mappers <n>Usenmap tasks to import in parallel 使用N個map 任務以並行方式導入
-e,--query <statement>Import the results ofstatement. // 可以導入一個查詢的結果集,這裏指定一個查詢語句
--split-by <column-name>Column of the table used to split work units -m,--num-mappers 分割任務時的依據列
-m,--num-mappers --table <table-name>Table to read// 指定導入的單個表名
--target-dir <dir>HDFS destination dir// 導入文件的存放目錄
--warehouse-dir <dir>HDFS parent for table destination  指定在HDFS中 表的上級路徑
HDFS父表的目的地--where <where clause>WHERE clause to use during import 導入時的where子句
-z,--compressEnable compression//啓用壓縮
--compression-codec <c>Use Hadoop codec (default gzip)//指定hadoop編解碼器(默認 gzip  
--null-string <null-string>The string to be written for a null value for string columns// 指定導入一個空值的替換值
--null-non-string <null-string> The string to be written for a null value for non-string columns//  非字符串類型爲空時的默認值

The--null-stringand--null-non-stringarguments are optional.\ If not specified, then the string "null" will be used.

--null-string--null-non-string 參數是可選的,如果不指定,就會使用"null"

7.2.2.Selecting the Data to Import

Sqoop typically imports data in a table-centric fashion. Use the--tableargument to select the table to import. For example,--table employees. This argument can also identify aVIEWor other table-like entity in a database.

Sqoop通常導入在數據表爲中心的風格下。使用--table 參數選擇表導入。例如,——員工表。這個觀點也可以識別一個視圖或其他實體在數據庫表一樣。這個參數也可以識別視圖或其他相似表的實例在數據庫中。

By default, all columns within a table are selected for import. Imported data is written to HDFS in its "natural order;" that is, a table containing columns A, B, and C result in an import of data such as:

//默認,導入時選中一個表的所有行,導入數據到HDFS以表的自然順序,換言之,一個表包括列A, B,  C 在一個導入數據 比如:

A1,B1,C1
A2,B2,C2
...

You can select a subset of columns and control their ordering by using the--columnsargument. This should include a comma-delimited list of columns to import. For example:--columns "name,employee_id,jobtitle".

通過 --column參數你可以選擇所有列的子集和控制列的順序,這種方式必須指定一個逗號分割符的行列表用於導入。例如--columns "name,employee_id,jobtitle

You can control which rows are imported by adding a SQLWHEREclause to the import statement. By default, Sqoop generates statements of the formSELECT <column list> FROM <table name>. You can append aWHEREclause to this with the--whereargument. For example:--where "id > 400". Only rows where theidcolumn has a value greater than 400 will be imported.

您可以爲導入的行添加一個 where SQL條件,默認,sqoop導入一個表的所有行,你可添加一個WHERE通過--where參數,例如 --where "id > 400",只有id大於400的行才能別導入。

By default sqoop will use queryselect min(<split-by>), max(<split-by>) from <table name>to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using--boundary-queryargument.

默認情況下,sqoop將使用select min(<split-by>), max(<split-by>) from <table name>來找出分割的邊界,萬一這個查詢不是最優的,你可以指定任意返回兩個數值的查詢通過使用--boundary-query 參數。

7.2.3.Free-form Query Imports 自由形態的查詢導入

Sqoop can also import the result set of an arbitrary SQL query. Instead of using the--table,--columnsand--wherearguments, you can specify a SQL statement with the--queryargument.

sqoop 也可以導入任意的SQL查詢結果集,代替了 --table, --columns and --where參數,你可以指定一條SQL語句通過--query參數。

When importing a free-form query, you must specify a destination directory with--target-dir.

但導入一個自由查詢的結果集,你必須指定一個目標地址通過--target-dir.

(注意:

    1 導入的目的地是HDFS,導出的起始點也是HDFS

    2 並行數指的就是map任務數,每一個線程有一map任務  

If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token$CONDITIONSwhich each Sqoop process will replace with a unique condition expression.You must also select a splitting column with--split-by.

//如果你想導入查詢的結果集以並行的方式,這時每一個map任務必須執行查詢的拷貝,結果集分區邊界控制由Sqoop推斷,你的查詢必須包括$CONDITIONS記號,每個sqoop線程替換$CONDITIONS爲一個唯一條件表達式  ,你必須選擇一個分割列 通過--split-by.

For example:例如

$ sqoop import \
  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
  --split-by a.id --target-dir /user/foo/joinresults

Alternately, the query can be executed once and imported serially, by specifying a single map task with-m 1:

另外,這個查詢可以執行一次,然後連續的導入(分成多次導入,如每50行導入,直到完成所有導入),通過指定一個單獨的map 任務 通過-m 1:

$ sqoop import \
  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
  -m 1 --target-dir /user/foo/joinresults
[Note]Note

If you are issuing the query wrapped with double quotes ("), you will have to use\$CONDITIONSinstead of just$CONDITIONSto disallow your shell from treating it as a shell variable. For example, a double quoted query may look like:"SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"

在shell環境,如果查詢語句用雙引號括起來,$CONDITIONS 不會做爲一個shell 變量被解析,必須寫成 \$CONDITIONS 例如:"SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"

[Note]Note

The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and noORconditions in theWHEREclause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results.

在當前版本的Sqoop中自由格式的查詢僅限於簡單的查詢,這個查詢不可以有模棱兩可的預測,在where子句中也不能有or條件。使用複雜的查詢,如有子查詢或join連接會導致非期望的結果


7.2.4.Controlling Parallelism

Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the-mor--num-mappersargument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16.Do not increase the degree of parallelism greater than that available within your MapReduce cluster; tasks will run serially and will likely increase the amount of time required to perform the import. Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result.

Sqoop以並行方式導入數據從大多數數據庫源。您可以指定用於執行導入的的Map任務數(並行進程數),通過使用-m或 ---num-mapper參數。這些參數中的每一個都需要一個表示並行度的整數值,默認情況下,使用四個任務。通過 增加值到8或16,有些數據庫可能會看到改善的性能。不要增加並行度到大於在你的MapReduce集羣內可用的並行度,任務將連續運行,並可能會增加執行導入所需的時間量。同樣,不要增加並行度到高於你的數據庫能夠合理的支持的並行度, 100個併發客戶端連接到您的數據庫可能會增加數據庫服務器負載進而影響到了性能。

(這段文字只講了,平行度不能大 MapReduce,數據庫所支持數目,沒寫具體數,也沒寫方案,看來還需要自己實驗)

When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses asplitting columnto split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column ofidwhose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the formSELECT * FROM sometable WHERE id >= lo AND id < hi, with(lo, hi)set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.

當執行並行導入,Sqoop需要一個分割工作量的依據。Sqoop使用分割列來分割工作量,默認情況下,Sqoop將識別一個表的主鍵列(如果存在),並使用它作爲分裂列。分割列從數據庫檢索最大值和最小值。所有的列導入任務會平均分配,例如,如果你有一個表的主鍵列id的最小值是0,最大值是1000,Sqoop使用4個任務,Sqoop將運行四個進程,SQL語句的執行以下面的形式, SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) and 設置爲 (0, 250), (250, 500), (500, 750), and (750, 1001)。(就是任務量會被平均的分配給多個進程,一個任務一個進程)

If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the--split-byargument. For example,--split-by employee_id. Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.

如果主鍵的實際值不能用於行的排序,那麼這可能導致分配任務不平均。你應該明確地選擇一個除主鍵外的列通過 --split-by參數。例如,  --split-by employee_id 。Sqoop目前不能使用多行索引分割任務,如果你的表沒有索引列,或有多列主鍵,這時你也必須手動選擇一個分裂列。

7.2.5.Controlling the Import Process

By default, the import process will use JDBC which provides a reasonable cross-vendor import channel. Some databases can perform imports in a more high-performance fashion by using database-specific data movement tools. For example, MySQL provides themysqldumptool which can export data from MySQL to other systems very quickly. By supplying the--directargument, you are specifying that Sqoop should attempt the direct import channel. This channel may be higher performance than using JDBC. Currently, direct mode does not support imports of large object columns.

默認,導入程序將使用JDBC提供的一個合理的跨廠商的導入渠道。通過使用特定於數據庫的數據移動工具,一些數據庫能夠以更加高效的方式導入。例如,MySQL提供了mysqldump工具,它可以非常迅速地從MySQL導入數據到其他系統,通過提供--direct參數,Sqoop會嘗試使用direct導入頻道  。這個渠道可能比使用JDBC更高效。目前,直接模式不支持進口的大對象列(BLOB  或CLOB列)。

When importing from PostgreSQL in conjunction with direct mode, you can split the import into separate files after individual files reach a certain size. This size limit is controlled with the--direct-split-sizeargument.

當使用PostgreSQL的direct模式導入時,在文件導入單個文件後達到一定大小後,你可以分割成多個文件,這個大小的限制通過 --direct-split-size參數

By default, Sqoop will import a table namedfooto a directory namedfooinside your home directory in HDFS. For example, if your username issomeuser, then the import tool will write to/user/someuser/foo/(files). You can adjust the parent directory of the import with the--warehouse-dirargument. For example:

默認情況下,Sqoop導入命名爲foo的表的數據文件存到HDFS的home目錄foo目錄。例如,如果您的用戶名是someuser,然後導入工具 會將數據寫入 /user/someuser/foo/(files) 。你可以調整的父目錄導入通過 --warehouse-dir參數。例如:

$ sqoop import --connnect <connect-str> --table foo --warehouse-dir /shared \
    ...

This command would write to a set of files in the/shared/foo/directory.

這個命令會將在寫入數據到/shared/foo目錄的文件集合.

You can also explicitly choose the target directory, like so:

你也可以明確的選擇目標目錄,像這樣

$ sqoop import --connnect <connect-str> --table foo --target-dir /dest \
    ...

This will import the files into the/destdirectory.--target-diris incompatible with--warehouse-dir.

這會導入文件到 目錄,--target-dir 不兼容 --warehouse-dir.

When using direct mode, you can specify additional arguments which should be passed to the underlying tool. If the argument--is given on the command-line, then subsequent arguments are sent directly to the underlying tool. For example, the following adjusts the character set used bymysqldump:

當使用 direct 模式,您可以指定額外的參數傳遞給底層的工具。例如,下面的調整字符集通過使用mysqldump:

$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
    --direct -- --default-character-set=latin1

By default, imports go to a new target location. If the destination directory already exists in HDFS, Sqoop will refuse to import and overwrite that directory’s contents. If you use the--appendargument, Sqoop will import data to a temporary directory and then rename the files into the normal target directory in a manner that does not conflict with existing filenames in that directory.

默認下,導入指定的目錄如果已經存在,sqoop會拒絕覆蓋,如過你使用 --append 參數,sqoop導入數據到一個臨時目錄,然後重命名文件到標準的目標目錄而且不會與已經存在的文件名衝突。(使用 --append 即使指定的目錄是已經存在的,也能導入,目錄名是啥就得試試了)

[Note]Note

When using the direct mode of import, certain database client utilities are expected to be present in the shell path of the task process. For MySQL the utilitiesmysqldumpandmysqlimportare required, whereas for PostgreSQL the utilitypsqlis required.


7.2.6.Controlling type mapping//控制類型映射

Sqoop is preconfigured to map most SQL types to appropriate Java or Hive representatives. However the default mapping might not be suitable for everyone and might be overridden by--map-column-java(for changing mapping to Java) or--map-column-hive(for changing Hive mapping)

Sqoop預先配置 了大量的適用於java或hive的SQL類型,但是默認映射可能並不適合每個人可以同過下面的參數覆蓋

--map-column-java (用來改變Java映射) or --map-column-hive (用來改變hive映射)

Table4.Parameters for overriding mapping

ArgumentDescription
--map-column-java <mapping>Override mapping from SQL to Java type for configured columns 覆蓋被配置的類的SQL到Java類型映射.
--map-column-hive <mapping>Override mapping from SQL to Hive type for configured columns 覆蓋被配置的類的SQL到hive類型映射.  .

Sqoop is expecting comma separated list of mapping in form <name of column>=<new type>. For example:

指定多行類型時,要用逗號分隔,例如

$ sqoop import ... --map-column-java id=String,value=Integer

Sqoop will rise exception in case that some configured mapping will not be used.

如果映射配置不能被使用,sqoop將拋出異常。

7.2.7.Incremental Imports

Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows.

The following arguments control incremental imports:

sqoop 提供增量導入模式,他能夠檢索自上次導入後新增的數據行。

如下參數控制增量導入:

Table5.Incremental import arguments:

ArgumentDescription
--check-column (col)Specifies the column to be examined when determining which rows to import.//指定 決定那些行要導入 的檢查列。
--incremental (mode)Specifies how Sqoop determines which rows are new. Legal values formodeincludeappendandlastmodified//指定增量模式,合法的值是 append和 lastmodified  .
--last-value (value)Specifies the maximum value of the check column from the previous import 指定之前導入的檢查列的最大值.

Sqoop supports two types of incremental imports:appendandlastmodified. You can use the--incrementalargument to specify the type of incremental import to perform.

sqoop提供兩種類型的增量導入:  append   and lastmodified. 你可以使用參數指定要執行的增量導入類型。

You should specifyappendmode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with--check-column. Sqoop imports rows where the check column has a value greater than the one specified with--last-value.

append模式,即通過指定一個遞增的列是實現增量導入,比如:
--incremental append  --check-column num_iid --last-value 0

An alternate table update strategy supported by Sqoop is calledlastmodifiedmode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with--last-valueare imported.

lastmodified模式,根據時間戳實現增量導入,比如:
--incremental lastmodified --check-column created --last-value '2012-02-01 11:0:00'
就是隻導入created 比'2012-02-01 11:0:00'更大的數據。

At the end of an incremental import, the value which should be specified as--last-valuefor a subsequent import is printed to the screen. When running a subsequent import, you should specify--last-valuein this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.

增量導入時 需要指定--last-value值,這個值在數據導入後,再次導入時會發變化,這時就需要指定變化後的值,一個存儲的job能夠自動的處理這個值,稍後的 job章節會有更多的信息。

三 sqoop增量倒入

sqoop支持兩種增量MySql導入到hive的模式,
一種是 append,即通過指定一個遞增的列,比如:
--incremental append  --check-column num_iid --last-value 0
另種是可以根據時間戳,比如:
--incremental lastmodified --check-column created --last-value '2012-02-01 11:0:00'
就是隻導入created 比'2012-02-01 11:0:00'更大的數據。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章