sqoop 中文文檔 User guide 七

Sqoop uses JDBC to connect to databases and adheres to published standards as much as possible. For databases which do not support standards-compliant SQL, Sqoop uses alternate codepaths to provide functionality. In general, Sqoop is believed to be compatible with a large number of databases, but it is tested with only a few.

Sqoop使用 JDBC連接數據庫並儘可能的遵循規範,對於那些不支持符合標準SQL的數據庫,Sqoop 使用額外的codepaths 來支持功能,一般來說,Sqoop 被相信 能夠兼容大多數的數據庫,但它只在很少的幾種中做過測試。

codepaths 是神馬?

Nonetheless, several database-specific decisions were made in the implementation of Sqoop, and some databases offer additional settings which are extensions to the standard.

儘管如此,幾個特定的數據庫決策還是在sqoop中做了實現,並且一些數據庫提供額外的設置來擴展標準。


This section describes the databases tested with Sqoop, any exceptions in Sqoop’s handling of each database relative to the norm, and any database-specific settings available in Sqoop.

這個章節描述 一些數據庫在使用Sqoop時做過的測試,每個數據庫的Sqoop執行中 關於規範的一些異常。和一些特定數據庫的有效設置。

While JDBC is a compatibility layer that allows a program to access many different databases through a common API, slight differences in the SQL language spoken by each database may mean that Sqoop can’t use every database out of the box, or that some databases may be used in an inefficient manner.

JDBC 是一個兼容層 允許程序通過共同的API訪問不同的數據庫 ,每個數據庫的SQL語言的微小的差別,可能導致Sqoop在脫離每個數據庫的環境時, Sqoop不能使用,或者,一些數據庫中可能使用了無效的方式。

When you provide a connect string to Sqoop, it inspects the protocol scheme to determine appropriate vendor-specific logic to use. If Sqoop knows about a given database, it will work automatically. If not, you may need to specify the driver class to load via--driver. This will use a generic code path which will use standard SQL to access the database. Sqoop provides some databases with faster, non-JDBC-based access mechanisms. These can be enabled by specfying the--directparameter.

當提供一個連接字符串給Sqoop,它檢查協議的scheme 來決定 適當的制定廠商邏輯來使用,如果你使用的sqoop已經集成了指定數據庫的驅動,他會自動工作,否則 你可能需要一個指定的驅動類去加載通過 --driver(指定驅動jar ), 這會使用一個通用的標準的 連接字符串(如'mysql://192.168.2.104:3306/qun')去訪問數據庫。Sqoop提供一些數據庫的更快,非基於jdbc訪問機制,這些可以通過指定 --direct參數。

Sqoop includes vendor-specific support for the following databases:

Sqoop 包括了特定廠商的支持 爲以下的數據庫。

Sqoop may work with older versions of the databases listed, but we have only tested it with the versions specified above

Sqoop可能會工作在上述數據庫的老版本,但我們只測試了上述指定的版本。

Even if Sqoop supports a database internally, you may still need to install the database vendor’s JDBC driver in your $SQOOP_HOME/lib path on your client. Sqoop can load classes from any jars in $SQOOP_HOME/lib on the client and will use them as part of any MapReduce jobs it runs; unlike older versions, you no longer need to install JDBC jars in the Hadoop library path on your servers.

即使Sqoop內部支持數據庫,你可能還需要安裝數據庫廠商的JDBC驅動程序在你在你的客戶端的$SQOOP_HOME/ lib路徑。Sqoop運行時可以從客戶端的$SQOOP_HOME/ lib路徑加載任意jar包中的類並將使用他們作爲MapReduce工作的一部分;與老版本不同,您不再需要安裝在Hadoop JDBC jar庫路徑在你的服務器。

這裏有很多疑問 ,Sqoop 分爲客戶端 和服務器端麼,是本身提供了jar包,當需要在客戶端安裝,還是本身沒有提供jar包,需要安裝?


22.2.MySQL

JDBC Driver:MySQL Connector/J//驅動下載地址

MySQL v5.0 and above offers very thorough coverage by Sqoop. Sqoop has been tested withmysql-connector-java-5.1.13-bin.jar.

//就sqoop而言MySQLv5.0和它以上的版本提供了非常詳盡的報告,sqoop 已經使用 mysql-connector-java-5.1.13-bin.jar做過測試。

22.2.1.zeroDateTimeBehavior//零時間行爲

MySQL allows values of'0000-00-00\'forDATEcolumns, which is a non-standard extension to SQL. When communicated via JDBC, these values are handled in one of three different ways:

MySQL允許‘0000-00-00’值作爲日期列,這是一個非標準擴展SQL。當溝通通過JDBC,這些值在三種不同的處理方式:

  • Convert toNULL.//轉換成 null

  • Throw an exception in the client.//在客戶端拋出一個異常

  • Round to the nearest legal date ('0001-01-01\'). 估算一個最接近的一個合法日期。

You specify the behavior by using thezeroDateTimeBehaviorproperty of the connect string. If azeroDateTimeBehaviorproperty is not specified, Sqoop uses theconvertToNullbehavior.

You can override this behavior. For example:

你可以指定行爲通過使用zeroDateTimeBehavior這個屬性通過連接字符串,如果zeroDateTimeBehavior屬性不被指定,Sqooop使用convertToNull 行爲

$ sqoop import --table foo \
    --connect jdbc:mysql://db.example.com/someDb?zeroDateTimeround< /pre>

22.2.2.UNSIGNEDcolumns

Columns with typeUNSIGNEDin MySQL can hold values between 0 and 2^32 (4294967295), but the database will report the data type to Sqoop asINTEGER, which will can hold values between-2147483648and\+2147483647. Sqoop cannot currently importUNSIGNEDvalues above2147483647.

列在MySQL型UNSIGNED 可以容納值介於0和2 ^ 32(4294967295),但數據庫將數據類型報告爲INTEGER給Sqoop  ,這個列可容納-2147483648 - -之間的值和\ + 2147483647。Sqoop目前不能導入 高於2147483647的UNSIGNED值。

22.2.3.BLOBandCLOBcolumns

Sqoop’s direct mode does not support imports ofBLOB,CLOB, orLONGVARBINARYcolumns. Use JDBC-based imports for these columns; do not supply the--directargument to the import tool.

Sqoop 的 direct模式 不能支持導入  BLOB, CLOB, or LONGVARBINARY columns,對於這些行可以使用 基於JDBC的導入。這時導入工具不要提供 --direct 參數。


22.2.4.Importing views in direct mode //在direct模式中導入視圖

Sqoop is currently not supporting import from view in direct mode. Use JDBC based (non direct) mode in case that you need to import view (simply omit--directparameter).

Sqoop在 direct模式中不支持視圖導入,如果你需要導入視圖就使用基於JDBC的導入(簡單的省略 --direct參數)

22.2.5.Direct-mode Transactions

For performance, each writer will commit the current transaction approximately every 32 MB of exported data. You can control this by specifying the following argumentbeforeany tool-specific arguments:-D sqoop.mysql.export.checkpoint.bytes=size, wheresizeis a value in bytes. Setsizeto 0 to disable intermediate checkpoints, but individual files being exported will continue to be committed independently of one another.

在執行上來講,每個writer 導出大約每32MB的數據將提交一個事務,你可以控制這個通過指定下面的參數 在任意指定工具的參數前: -D sqoop.mysql.export.checkpoint.bytes=size,size的單位是 bytes,設置大小爲0將禁用的中間檢查,這個設置不影響的別的文件導入。

這裏的import,export 指的 從  hdfs端 ——》外部運行的mysql


Sometimes you need to export large data with Sqoop to a live MySQL cluster that is under a high load serving random queries from the users of your application. While data consistency issues during the export can be easily solved with a staging table, there is still a problem with the performance impact caused by the heavy export.

使用Sqoop時,有時候你需要導出大型數據到現場隨機查詢服務的應用程序的多用戶高負載,隨機訪問下的MySQL羣集。雖然 導出時數據一致性問題能夠輕鬆的被解決 通過一個臨時表,大量的導出仍然導致性能影響。

上面講的是導入,導出會產生的影響,只有 direct mode下會有這些影響?

答:不是direct模式也會有影響

性能影響的是 mysql還 hadoop集羣?

答:都會有影響

First off, the resources of MySQL dedicated to the import process can affect the performance of the live product, both on the master and on the slaves. Second, even if the servers can handle the import with no significant performance impact (mysqlimport should be relatively "cheap"), importing big tables can cause serious replication lag in the cluster risking data inconsistency.

首先,資源的MySQL專用的導入過程會影響正在運行的產品的性能,無論master還是slaves,即使服務器可以處理導入 在沒有顯著的性能影響(導入的表結構非常簡單),其次導入大表可能導致嚴重複制延遲而產生數據 不一致性的風險在一集羣(指的是hadoop集羣)上。

With-D sqoop.mysql.export.sleep.ms=time, wheretimeis a value in milliseconds,you can let the server relax between checkpoints and the replicas catch up by pausing the export process after transferring the number of bytes specified insqoop.mysql.export.checkpoint.bytes.Experiment with different settings of these two parameters to archieve an export pace that doesn’t endanger the stability of your MySQL cluster.

-D sqoop.mysql.export.sleep.ms=time, 時間以毫秒爲單位,實驗證明這兩個參數來設置不同來完成導出不會危及你的MySQL集羣的穩定性。

[Important]Important

Note that any arguments to Sqoop that are of the form-D parameter=valueare Hadoopgeneric argumentsand must appear before any tool-specific arguments (for example,--connect,--table, etc). Don’t forget that these parameters are only supported with the--directflag set.

注意 這些參數是hadoop 的通用參數 必須出現在 任何指定工具的參數前(例如 ,--connent,-table,etc).不要這些參數只能在 --direct 被設置時才能被支持。

22.3.PostgreSQL

Sqoop supports JDBC-based connector for PostgreSQL:http://jdbc.postgresql.org/ //驅動下載地址

The connector has been tested using JDBC driver version "9.1-903 JDBC 4" with PostgreSQL server 9.1.

已經使用JDBC驅動程序版本 "9.1-903 JDBC 4"在 PostgreSQL server 9.1上測試連接

22.3.1.Importing views in direct mode

Sqoop is currently not supporting import from view in direct mode. Use JDBC based (non direct) mode in case that you need to import view (simply omit--directparameter).

當前的sqoop在 direct模式下不支持導入視圖,如果需要導入視圖就使用基於JDBC模式。

××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××

22.4.Oracle

JDBC Driver:Oracle JDBC Thin Driver- Sqoop is compatible withojdbc6.jar.//驅動下載地址

Sqoop has been tested with Oracle 10.2.0 Express Edition. Oracle is notable in its different approach to SQL from the ANSI standard, and its non-standard JDBC driver. Therefore, several features work differently.

Sqoop已經Oracle 10.2.0 Express Edition中進行過測試。 Oracle 的sql跟標準的ANSI sql有着明顯的不同用法,它還有非標準的JDBC驅動程序。因此,一些特性的工作方式不同。

22.4.1.Dates and Times

Oracle JDBC representsDATEandTIMESQL types asTIMESTAMPvalues. AnyDATEcolumns in an Oracle database will be imported as aTIMESTAMPin Sqoop, and Sqoop-generated code will store these values injava.sql.Timestampfields.

When exporting data back to a database, Sqoop parses text fields asTIMESTAMPtypes (with the formyyyy-mm-dd HH:MM:SS.ffffffff) even if you expect these fields to be formatted with the JDBC date escape format ofyyyy-mm-dd. Dates exported to Oracle should be formatted as full timestamps.

Oracle also includes the additional date/time typesTIMESTAMP WITH TIMEZONEandTIMESTAMP WITH LOCAL TIMEZONE. To support these types, the user’s session timezone must be specified. By default, Sqoop will specify the timezone"GMT"to Oracle. You can override this setting by specifying a Hadoop propertyoracle.sessionTimeZoneon the command-line when running a Sqoop job. For example:

$ sqoop import -D oracle.sessionTimeZone=America/Los_Angeles \
    --connect jdbc:oracle:thin:@//db.example.com/foo --table bar

Note that Hadoop parameters (-D …) aregeneric argumentsand must appear before the tool-specific arguments (--connect,--table, and so on).

Legal values for the session timezone string are enumerated athttp://download-west.oracle.com/docs/cd/B19306_01/server.102/b14225/applocaledata.htm#i637736.

22.5.Schema Definition in Hive

Hive users will note that there is not a one-to-one mapping between SQL types and Hive types. In general, SQL types that do not have a direct mapping (for example,DATE,TIME, andTIMESTAMP) will be coerced toSTRINGin Hive. TheNUMERICandDECIMALSQL types will be coerced toDOUBLE. In these cases, Sqoop will emit a warning in its log messages informing you of the loss of precision.

hive的用戶可能注意到 sql類型 類型和hive類型不是一對一映射。 通常來說, SQL類型不具有直接映射(例如,DATETIMETIMESTAMP)將被強制轉換成STRING在hive中NUMERICDECIMALSQL類型將被強制轉換成DOUBLE在這些情況下,Sqoop將發出一個警告在其日誌中,通知你的損失精度。

23.Notes for specific connectors//特定連接器的注意事項

23.1.MySQL JDBC Connector

This section contains information specific to MySQL JDBC Connector.

這個章節包含關於Mysql JDBC 連接的特殊的信息。

23.1.1.Upsert functionality 更新或插入功能

MySQL JDBC Connector is supporting upsert functionality using argument--update-mode allowinsert.To achieve that Sqoop is using MySQL clause INSERT INTO … ON DUPLICATE KEY UPDATE. This clause do not allow user to specify which columns should be used to distinct whether we should update existing row or add new row. Instead this clause relies on table’s unique keys (primary key belongs to this set). MySQL will try to insert new row and if the insertion fails with duplicate unique key error it will update appropriate row instead. As a result, Sqoop is ignoring values specified in parameter--update-key, however user needs to specify at least one valid column to turn on update mode itself.?

MySQL的JDBC連接器支持更新或插入功能,使用參數--update-mode allowinsert爲了實現這一Sqoop使用MySQL子句INSERT INTO ... ON DUPLICATE KEY更新。此條款不允許用戶指定哪些列應該使用不同的,我們是否應該更新現有的行或添加新行。相反,這一條款依賴於表的唯一鍵(主鍵屬於這一套)。MySQL會嘗試插入新行,如果插入失敗,重複的唯一密鑰錯誤,它會更新相應的行,而不是。其結果是,,Sqoop忽略參數--update-key中指定的值,但是用戶需要指定至少一個有效的列打開更新模式本身。

23.2.Microsoft SQL Connector

23.2.1.Extra arguments

List of all extra arguments supported by Microsoft SQL Connector is shown below:

所以額外的參數 支持被    Microsoft SQL Connector 如下:

Table41.Supported Microsoft SQL Connector extra arguments:

ArgumentDescription
--schema <name>Schemename that sqoop should use. Default is "dbo"//Sqoop 要使用的模式 名,默認是‘dbo’.
--table-hints <hints>Table hints that Sqoop should use for data movement??//.

Scheme 應該怎麼翻譯?

23.2.2.Schema support

If you need to work with tables that are located in non-default schemas, you can specify schema names via the--schemaargument. Custom schemas are supported for both import and export jobs. For example:

如果你需要的工作表不想使用默認,架構,您可以通過指定的架構名稱--schema參數。自定義模式都支持進口和出口作業。例如
$ sqoop import ... --table custom_table -- --schema custom_schema

剩下的少部分還沒有翻譯,請期待

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章