sqoop 中文文檔 User guide 二 import續

7.2.8.File Formats// 文件格式化

You can import data in one of two file formats: delimited text or SequenceFiles.

你可以以兩種格式導入:分割符文本或序列文件

Delimited text is the default import format. You can also specify it explicitly by using the--as-textfileargument. This argument will write string-based representations of each record to the output files, with delimiter characters between individual columns and rows. These delimiters may be commas, tabs, or other characters. (The delimiters can be selected; see "Output line formatting arguments.") The following is the results of an example text-based import:

分割符文件是默認的導入格式,你也可以明確的指定這個格式通過使用  --as-textfile參數。這個參數將寫入基於字符串代表的記錄到輸出文件中去,單獨的行和列都有分割符,這些分割符可能是逗號,tab鍵,或其他符號(分割符可以被選擇,查看“輸出行參數格式化”)。

下面是一個導入成爲文本的示例:

1,here is a message,2010-05-01
2,happy new year!,2010-01-01
3,another message,2009-11-12

Delimited text is appropriate for most non-binary data types. It also readily supports further manipulation by other tools, such as Hive.

分隔符文本適合大多數非二進制數據類型。它也很容易支持進一步操縱其他工具,如hive。

SequenceFiles are a binary format that store individual records in custom record-specific data types. These data types are manifested as Java classes. Sqoop will automatically generate these data types for you. This format supports exact storage of all data in binary representations, and is appropriate for storing binary data (for example,VARBINARYcolumns), or data that will be principly manipulated by custom MapReduce programs (reading from SequenceFiles is higher-performance than reading from text files, as records do not need to be parsed).

individual, separate 兩個詞都有獨立的意思,區別?

序列文件是一種二進制格式文件,它用來存儲特殊數據類型的文件,這些數據類型表現爲Java類,Sqoop將爲你自動生成這些數據類型。這種格式支持精確存儲所有數據以二進制表示形式,適合存儲二進制數據(例如,VARBINARY列),或將被自定義的MapReduce程序操作的數據(讀取SequenceFiles比讀取文本文件高效,記錄不需要解析)。

Avro data files are a compact, efficient binary format that provides interoperability with applications written in other programming languages. Avro also supports versioning, so that when, e.g., columns are added or removed from a table, previously imported data files can be processed along with new ones.

Avro數據文件是一種緊湊、高效的二進制格式文件,其他編程語言編寫的應用程序也可以用它,Avro也提供版本控制,所以,舉例:,列添加或刪除從表,導入後,會新生成導入文件,但之前導入的數據文件記錄依然存在。

By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the-zor--compressargument, or specify any Hadoop compression codec using the--compression-codecargument. This applies to SequenceFile, text, and Avro files.

默認,數據是不被壓縮的,你可以壓縮你的數據通過使用壓縮算法(默認使用gzip工具) -z或--compress 參數,或指定任意一個hadoop 壓縮用的編解碼器 通過使用--compression-codec argument. 這些參數使用與 序列文件,text,和Avro 文件。

7.2.9.Large Objects 大型對象

Sqoop handles large objects (BLOBandCLOBcolumns) in particular ways. If this data is truly large, then these columns should not be fully materialized in memory for manipulation, as most columns are. Instead, their data is handled in a streaming fashion. Large objects can be stored inline with the rest of the data, in which case they are fully materialized in memory on every access, or they can be stored in a secondary storage file linked to the primary data storage. By default, large objects less than 16 MB in size are stored inline with the rest of the data. At a larger size,they are stored in files in the_lobssubdirectory of the import target directory. These files are stored in a separate format optimized for large record storage, which can accomodate records of up to 2^63 bytes each. The size at which lobs spill into separate files is controlled by the--inline-lob-limitargument, which takes a parameter specifying the largest lob size to keep inline, in bytes. If you set the inline LOB limit to 0, all large objects will be placed in external storage.

Sqoop處理大型對象(BLOB和CLOB列)以特定的方式。如果這個數據是真正的大,那麼這些列不應該在操作時完全實例化內存,即使大多數的列都是。相反,他們的數據以一個流的方式處理。大對象可以以與其他數據內聯的方式存儲,在這種情況下,每一個訪問中,他們是完全實例化內存,或者他們可以存儲在一個文件鏈接到主數據倉庫的二級文件倉庫。默認情況下,小於16MB大小的大型對象與其他數據內聯的方式存儲。在一個更大的對象中,它們存儲在導入目標目錄的子目錄的_lobs的文件中。這些文件以一種爲大型記錄存儲提供的單獨的格式優化方式存儲,每個記錄可容納多達2^63 bytes。分離文件的大小被控制通過 --inline-lob-limit參數,這個參數獲取最大lob值來保持內聯(文件的大小超過這個值,就分割),以字節爲單位。如果你設置內聯LOB爲0,所有大型對象將被放置在外部倉庫。

Table6.Output line formatting arguments:

ArgumentDescription
--enclosed-by <char>Sets a required field enclosing character 設置一個必用的字段閉合符
--escaped-by <char>Sets the escape character 設置轉義符
--fields-terminated-by <char>Sets the field separator character 設置 字段分隔符。
--lines-terminated-by <char>Sets the end-of-line character 設置行結束符。
--mysql-delimitersUses MySQL’s default delimiter set: fields:,lines:\nescaped-by:\optionally-enclosed-by:' 使用mysql默認的一組分割符設置: 字段:, 分割符:/ 可選閉合符:'
--optionally-enclosed-by <char>Sets a field enclosing character// 設置一個字段閉合符(該閉合符只有字段內出現分割符字符時纔會用於字段。)

When importing to delimited files, the choice of delimiter is important. Delimiters which appear inside string-based fields may cause ambiguous parsing of the imported data by subsequent analysis passes. For example, the string"Hello, pleased to meet you"should not be imported with the end-of-field delimiter set to a comma.

當導入數據到帶分隔符文件,選擇分隔符是重要的。分隔符出現在基於字符串字段可能導致已經導入的數據在後續分析傳遞過程中模糊不清的解析。例如,字符串“Hello, pleased to meet you ”,不應該在設置爲一個逗號爲字段結束分隔符的條件下導入。

Delimiters may be specified as:

分隔符可以指定爲:

  • a character// 一個字符 (--fields-terminated-by X)

  • an escape character //一個轉義字符(--fields-terminated-by \t). Supported escape characters are 支持的轉義字符如下:

    • \b(backspace)//(空格)

    • \n(newline 換行)

    • \r(carriage return// 回車)

    • \t(tab// tab鍵)

    • \"(double-quote//雙引號)

    • \\'(single-quote//單引號)

    • \\(backslash//反斜槓)

    • \0(NUL//空字符) - This will insert NUL characters between fields or lines, or will disable enclosing/escaping if used for one of the--enclosed-by,--optionally-enclosed-by, or--escaped-byarguments//這將插入NUL字符 在字段或行之間 ,或將禁用封閉/轉義  如果 使用--enclosed-by, --optionally-enclosed-by, or --escaped-by  參數的其中一個.

    下面兩段講的是字符和轉義符的8進制和16進製表示法
  • The octal representation of a UTF-8 character’s code point. This should be of the form\0ooo, whereooois the octal value. For example,--fields-terminated-by \001would yield the^Acharacter

  • 一個UTF-8字符的字符碼的8進製表示形式,必須 是\0ooo格式,其中ooo 是8進制值。例如,  --fields-terminated-by \001 表示指定一個字符 ^A

  • The hexadecimal representation of a UTF-8 character’s code point. This should be of the form\0xhhh, wherehhhis the hex value. For example,--fields-terminated-by \0x10would yield the carriage return character.

  • 一個UTF-8字符的字符碼的16進製表示形式 ,必須是\0xhhh格式,其中hhh 是16進制值。例如,  --fields-terminated-by \0x10 表示指定一個 回車 字符

The default delimiters are a comma (,) for fields, a newline (\n) for records, no quote character, and no escape character. Note that this can lead to ambiguous/unparsible records if you import database records containing commas or newlines in the field data. For unambiguous parsing, both must be enabled. For example, via--mysql-delimiters.

默認的分隔符,一行中的多個字段分割用逗號(,),多行分割用換行符(\n),沒有引號字符,沒有轉義字符。注意,這可能會導致模糊/ 不可解析,如果你導入的數據庫記錄中的字段數據(多行記錄中的一行)包含逗號或換行。爲明確的解析,都必須啓用(行分割符和列分割符都必須被指定)。例如, 通過--mysql-delimiters.

If unambiguous delimiters cannot be presented, then useenclosingandescapingcharacters. The combination of (optional) enclosing and escaping characters will allow unambiguous parsing of lines. For example, suppose one column of a dataset contained the following values:

如果不能提供明確的分隔符,這時可以使用封閉和轉義字符。結合(可選)封閉和轉義字符能夠明確的解析數據行。例如,假設一列的數據集包含下列值:

Some string, with a comma.
Another "string with quotes"

The following arguments would provide delimiters which can be unambiguously parsed:

以下參數如果爲分隔符,可以明確地解析:

$ sqoop import --fields-terminated-by , --escaped-by \\ --enclosed-by '\"' ...

(Note that to prevent the shell from mangling the enclosing character, we have enclosed that argument itself in single-quotes.)

The result of the above arguments applied to the above dataset would be:

上述參數應用到上面的數據集的結果將:

"Some string, with a comma.","1","2","3"...
"Another \"string with quotes\"","4","5","6"...

Here the imported strings are shown in the context of additional columns ("1","2","3", etc.) to demonstrate the full effect of enclosing and escaping. The enclosing character is only strictly necessary when delimiter characters appear in the imported text. The enclosing character can therefore be specified as optional:

這裏的導入字符串顯示在上下文附加列("1","2","3",等)來演示封閉和轉義的全部影響。僅當分隔符字符出現在導入的文本,封閉字符纔是必須有的。因此,這個封閉的字符可以被指定爲可選的

$ sqoop import --optionally-enclosed-by '\"' (the rest as above)...

Which would result in the following import:

這將導致以下導入:

"Some string, with a comma.",1,2,3...
"Another \"string with quotes\"",4,5,6...
[Note]Note

Even though Hive supports escaping characters, it does not handle escaping of new-line character. Also, it does not support the notion of enclosing characters that may include field delimiters in the enclosed string. It is therefore recommended that you choose unambiguous field and record-terminating delimiters without the help of escaping and enclosing characters when working with Hive; this is due to limitations of Hive’s input parsing abilities.

雖然Hive 支持轉義字符,它不處理換行字符的轉義。同時,在閉合字符串中可能包含字段分隔符的情況下,它不支持封閉字符的概念。因此,當使用hive時,我們建議您選擇明確的字段和記錄終止分隔符在沒有轉義和閉合符的幫助下,這是由於Hive的輸入解析能力的限制。

(hive在轉義和閉合字符方面有限制,到底有啥限制,我也沒徹底弄懂)

The--mysql-delimitersargument is a shorthand argument which uses the default delimiters for themysqldumpprogram. If you use themysqldumpdelimiters in conjunction with a direct-mode import (with--direct), very fast imports can be achieved.

--mysql-delimiters參數是一個速記的參數,它使用默認的分隔符爲mysqldump程序。如果你使用mysqldump分隔符聯同  直接導入模式 (使用 --direct) ,可以實現非常快的導入。

While the choice of delimiters is most important for a text-mode import, it is still relevant if you import to SequenceFiles with--as-sequencefile. The generated class'toString()method will use the delimiters you specify, so subsequent formatting of the output data will rely on the delimiters you choose.

for: 對於 ,爲了

而選擇分隔符是最重要的對於一個文本模式導入。它仍然是重要的,如果你導入數據到序列化文件 通過 --as-sequencefile.。生成的類的toString()方法將使用您指定的分隔符,所以後續輸出數據的格式化將依靠你選擇的分隔符

Table7.Input parsing arguments:輸入解析參數(import -export使用)

ArgumentDescription
--input-enclosed-by <char>Sets a required field encloser  設置一個必用的字段閉合符  
--input-escaped-by <char>Sets the input escape character 設置輸入轉義符d
--input-fields-terminated-by <char>Sets the input field separator 設置輸入字段分隔符s
--input-lines-terminated-by <char>Sets the input end-of-line character  設置輸入行結束符v
--input-optionally-enclosed-by <char>Sets a field enclosing character 設置一個字段閉合符(該閉合符只有字段內出現分割符字符時纔會用於字段)

When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates when doing a delimited-format import. The delimiters are chosen with arguments such as--fields-terminated-by; this controls both how the data is written to disk, and how the generatedparse()method reinterprets this data. The delimiters used by theparse()method can be chosen independently of the output arguments, by using--input-fields-terminated-by, and so on. This is useful, for example, to generate classes which can parse records created with one set of delimiters, and emit the records to a different set of files using a separate set of delimiters.

當Sqoop導入數據到HDFS,它生成一個Java類,這個Java類可以重新解釋文本文件,當一個分隔格式的導入時 它創建了。分隔符的選擇通過參數 如--fields-terminated-by; 這同時控制如何將數據寫入磁盤,以及如何生成的parse()方法解讀這些數據。parse()方法使用的分隔符可以自主選擇的輸出參數,通過使用--input-fields-terminated-by等等。這是有用的,例如,生成類可以解析已經創建的記錄 通過一組分隔符,併發送記錄到一組不同的文件使用一組不同的分隔符。

Table8.Hive arguments:

ArgumentDescription
--hive-home <dir>Override$HIVE_HOME 覆蓋 $HIVE_HOME
--hive-importImport tables into Hive (Uses Hive’s default delimiters if none are set.)導入表到hive(如果沒有設置使用hive的默認分割符)
--hive-overwriteOverwrite existing data in the Hive table 覆蓋hive中已經存在的表.
--create-hive-tableIf set, then the job will fail if the target hive 如果設置了,當表存在時,這個job會失敗,默認是false

table exits. By default this property is false.
--hive-table <table-name>Sets the table name to use when importing to Hive 導入到hive時設置要使用的表名.
--hive-drop-import-delimsDrops\n,\r, and\01from string fields when importing to Hive 當導入到hive時從字段字符串刪除\n, \r, and \01 .
--hive-delims-replacementReplace\n,\r, and\01from string fields with user defined string when importing to Hive 當導入到hive時從字段字符串替換 \n, \r, and \01  .
--hive-partition-keyName of a hive field to partition are sharded on 分區依據的key
--hive-partition-value <v>String-value that serves as partition key for this imported into hive in this job 分區依據的value.
--map-column-hive <map>Override default mapping from SQL type to Hive type for configured columns.

7.2.10.Importing Data Into Hive// 導入數據到 Hive

Sqoop’s import tool’s main function is to upload your data into files in HDFS. If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing aCREATE TABLEstatement to define the data’s layout in Hive. Importing data into Hive is as simple as adding the--hive-importoption to your Sqoop command line.

sqoop 的 import tool的主要功能是上傳你的數據到HDFS的文件中,如果你有一個關聯HDFS集羣的Hive數據倉庫,通過執行建表語句,sqoop也能導入數據到Hive中去,只要增加--hive-import選項到命令行中

If the Hive table already exists, you can specify the--hive-overwriteoption to indicate that existing table in hive must be replaced. After your data is imported into HDFS or this step is omitted, Sqoop will generate a Hive script containing aCREATE TABLEoperation defining your columns using Hive’s types, and aLOAD DATA INPATHstatement to move the data files into Hive’s warehouse directory.

如果導入到hive中的表已經存在,你可以指定選項--hive-overwrite來指明同名表必須被覆蓋,在你的數據導入到HDFS或省略了這一步,Sqoop將生成一個hive腳本包含創建表操作,這個創建表的操作使用Hive的類型定義你的列, 一個LOAD DATA INPATH 語句將數據文件到 Hive 的數據倉庫。

The script will be executed by calling the installed copy of hive on the machine where Sqoop is run. If you have multiple Hive installations, orhiveis not in your$PATH, use the--hive-homeoption to identify the Hive installation directory. Sqoop will use$HIVE_HOME/bin/hivefrom here.

該腳本將執行通過調用在機器上hive已經安裝的拷貝工具。如果你安裝了多個Hive,或hive不在你的$PATH,使用--hive-home選項確定hive安裝目錄。Sqoop將使用$HIVE_HOME/bin/hive。


[Note]Note

This function is incompatible with--as-avrodatafileand--as-sequencefile.

// 這個功能與--as-avrodatafile and --as-sequencefile 兼容。

Even though Hive supports escaping characters, it does not handle escaping of new-line character. Also, it does not support the notion of enclosing characters that may include field delimiters in the enclosed string. It is therefore recommended that you choose unambiguous field and record-terminating delimiters without the help of escaping and enclosing characters when working with Hive; this is due to limitations of Hive’s input parsing abilities. If you do use--escaped-by,--enclosed-by, or--optionally-enclosed-bywhen importing data into Hive, Sqoop will print a warning message.

雖然Hive 支持轉義字符,它不處理換行字符的轉義。同時,在閉合字符串中可能包含字段分隔符的情況下,它不支持封閉字符的概念。因此,當使用hive時,我們建議您選擇明確的字段和記錄終止分隔符在沒有轉義和閉合符的幫助下,這是由於Hive的輸入解析能力的限制.你如果使用--escaped-by, --enclosed-by, or --optionally-enclosed-by ,當導入數據到hive時,Sqoop將打印警告信息。

Hive will have problems using Sqoop-imported data if your database’s rows contain string fields that have Hive’s default row delimiters (\nand\rcharacters) or column delimiters (\01characters) present in them. You can use the--hive-drop-import-delimsoption to drop those characters on import to give Hive-compatible text data. Alternatively, you can use the--hive-delims-replacementoption to replace those characters with a user-defined string on import to give Hive-compatible text data. These options should only be used if you use Hive’s default delimiters and should not be used if different delimiters are specified.

使用Sqoop導入的數據時, hive會有問題 , 如果您的數據庫的行包含hive的行分割符(\n and \r 字符)或列分隔符 (\01 字符)。在導入時,您可以使用  --hive-drop-import-delims  選項來刪除這些字符來提供hive兼容的文本數據。或者,您可以使用--hive-delims-replacement 選項指定一個用戶定義的字符串來取代那些字符,進而來提供hive兼容的文本數據。這些選項只應該在你使用了hive的默認分隔符的情況下才能被使用,如果指定了其他的分割符,在些選項就不應該被使用。

Sqoop willpass the field and record delimiters through to Hive. If you do not set any delimiters and do use--hive-import, the field delimiter will be set to^Aand the record delimiter will be set to\nto be consistent with Hive’s defaults.

Sqoop將傳遞字段和記錄分隔符直達hive。如果你使用 了--hive-import,並且不設置任何分隔符,將使用hive分割符的默認設置,字段分隔符將被設置爲^A 並且記錄分隔符將被設置爲\ n。

Sqoop will by default import NULL values as stringnull. Hive is however using string\Nto denoteNULLvalues and therefore predicates dealing withNULL(likeIS NULL) will not work correctly. You should appendparameters--null-stringand--null-non-stringin case ofimport job or--input-null-stringand--input-null-non-stringin case of an export job if you wish to properly preserveNULLvalues. Because sqoop is using those parameters in generated code, you need to properly escape value\Nto\\N:

in case of:  someting 萬一 ,假設 是someting

sqoop默認導入NULL值爲 null 字符串,然而 hive是使用\N表示 NULL 值,因此NULL的處理可能會不正確,如果想正確的保存NULL值,你應該爲導入任務添加參數  --null-string and --null-non-string,爲導出任務添加參數 --input-null-string and --input-null-non-string.因爲是使用這些參數sqoop生成的代碼,你必須把 \N寫成\\N:

$ sqoop import  ... --null-string '\\N' --null-non-string '\\N'

The table name used in Hive is, by default, the same as that of the source table. You can control the output table name with the--hive-tableoption.

在hive中的表名默認與資源表名相同,你可以控制輸出表名通過 --hive-table 選項。

Hive can put data into partitions for more efficient queryperformance. You can tell a Sqoop job to import data for Hive into aparticularpartitionby specifying the--hive-partition-keyand--hive-partition-valuearguments. The partition value must be a string. Please see the Hive documentation for more details on partitioning.

爲了更高效哦查詢執行,hive可以把數據放入多個分區,通過指定--hive-partition-key and --hive-partition-value參數,你可以告訴sqoop 任務 爲hive導入數據到一個特殊的分區,這個分區值必須是一個字符串,瞭解更多的分區細節請查看hive文檔。

You can import compressed tables into Hive using the--compressand--compression-codecoptions. Onedownsideto compressing tables imported into Hive is that many codecs cannot be split for processing by parallel map tasks. Thelzopcodec, however, does support splitting. When importing tables with this codec, Sqoop will automatically index the files for splitting and configuring a new Hive table with the correct InputFormat. This feature currently requires that all partitions of a table be compressed with the lzop codec.

你可以使用--compress and --compression-codec 選項導入壓縮表到hive中去,一個缺陷是許多編解碼器不能以並行的map任務執行,lzop codec 可以支持平行的map任務(lzop 是一種編解碼器,可以由多個MapReduce並行來進行處理 詳見 http://www.tech126.com/hadoop-lzo/),不支持分割(我猜是不支持指定分割界限值),這個特性目前需要所有的表分區通過 lzop 編解碼器 壓縮。

Table9.HBase arguments:

ArgumentDescription
--column-family <family>Sets the target column family for the import //導入的目標列族集合。
--hbase-create-tableIf specified, create missing HBase tables //如果指定,創建缺失的HBase表
--hbase-row-key <col>Specifies which input column to use as the row key//指定輸入列要用的rowkey
--hbase-table <table-name>Specifies an HBase table to use as the target instead of HDFS// 指定要使用的HBase表用來代替HDFS

7用來.2.11.Importing Data Into HBase

Sqoop supports additional import targets beyond HDFS and Hive. Sqoop can also import records into a table in HBase.

sqoop除了支持導入到Hive和HDFS,sqoop也支持導入記錄到HBase。

By specifying--hbase-table, you instruct Sqoop to import to a table in HBase rather than a directory in HDFS. Sqoop will import data to the table specified as the argument to--hbase-table. Each row of the input table will be transformed into an HBasePutoperation to a row of the output table.The key for each row is taken from a column of the input. By default Sqoop will use the split-by column as the row key column. If that is not specified, it will try to identify the primary key column, if any, of the source table.You can manually specify the row key column with--hbase-row-key. Each output column will be placed in the same column family, which must be specified with--column-family.

紅字部分 是不指定rowkey時,rowkey的生成規則,我還沒搞懂,試試就知道。

通過指定  --hbase-table ,您可以指示Sqoop導入到HBase中的一個表而不是HDFS的個目錄。Sqoop將導入數據到參數 --hbase-table指定的表 。輸入表的每一行將被轉換爲一個 put操作寫入 HBase表。爲每一行的關鍵是取自一個某列的輸入。默認情況下Sqoop將使用分割列(標識不同行的列,比如主鍵,時間戳)作爲row key。如果沒有指定,它會嘗試識別主鍵列作爲rowkey,如果任何,源表的。你可以手動指定row key 列通過--hbase-row-key 。每個導入列將被放置在相同的列族,必須指定--column-family

[Note]Note

This function is incompatible with direct import (parameter--direct).

這個功能與direct模式不兼容(參數 --direct)

If the target table and column family do not exist, the Sqoop job will exit with an error. You should create the target table and column family before running an import. If you specify--hbase-create-table, Sqoop will create the target table and column family if they do not exist, using the default parameters from your HBase configuration.

如果目標表和列族不存在,Sqoop job將錯誤的退出。您應該創建目標表和列族在運行導入前。如果你指定  --hbase-create-table ,且目標表和列族 不存在,Sqoop將使用HBase配置默認參數創建它們。

Sqoop currently serializes all values to HBase by converting each field to its string representation (as if you were importing to HDFS in text mode), and then inserts the UTF-8 bytes of this string in the target cell. Sqoop will skip all rows containing null values in all columns except the row key column.

目前Sqoop序列化所有值到HBase中通過把每個字段轉換爲字符串(猶如你是以文本模式導入到HDFS),然後插入utf- 8字符串的byte值。Sqoop會跳過所有的行中包含空值的除了row key的所有列(如果有一行導入數據的某列爲空,除row key外,列名和null值都不會被HBASE中的表記錄)。

Table10.Code generation arguments:

ArgumentDescription
--bindir <dir>Output directory for compiled objects//指定class文件存放目錄
--class-name <name>Sets the generated class name. This overrides--package-name. When combined with--jar-file, sets the input class
設置生成的class名,這將覆蓋--package-name ,它還可以和-jar-file一起使用,用來設置輸入的class( 指定一個導入時使用的class)
--jar-file <file>Disable code generation; use specified jar 代碼生成無效;使用特定的jar包
--outdir <dir>Output directory for generated code 生成代碼的輸出路徑
--package-name <name>Put auto-generated classes in this package //所有自動生成的class的包名
--map-column-java <m>Override default mapping from SQL type to Java type for configured columns//(不懂) .

As mentioned earlier, a byproduct of importing a table to HDFS is a class which can manipulate the imported data. If the data is stored in SequenceFiles, this class will be used for the data’s serialization container. Therefore, you should use this class in your subsequent MapReduce processing of the data.

正如前面提到的,一個導入表到HDFS過程的副產品是一個類,可以操作導入數據。如果數據存儲在SequenceFiles,這個類將用於數據的序列化容器。因此,你應該使用這個類在你的隨後的MapReduce數據處理。

The class is typically named after the table; a table namedfoowill generate a class namedfoo. You may want to override this class name. For example, if your table is namedEMPLOYEES, you may want to specify--class-name Employeeinstead. Similarly, you can specify just the package name with--package-name. The following import generates a class namedcom.foocorp.SomeTable:

類通常是以表命名,一個命名爲foo的表將生成一個命名爲foo的類。你可能想要覆蓋這個類名。例如,如果您的表名爲EMPLOYEES,你可能通過指定--class-name 代替Employee 。同樣地,您可以只指定包名通過 --package-name。以下的導入生成一個命名com.foocorp.SomeTable的類:

$ sqoop import --connect <connect-str> --table SomeTable --package-name com.foocorp
--class-name,--package-name 能不能同時使用 我也沒搞清,試試就知道了

The.javasource file for your class will be written to the current working directory when you runsqoop. You can control the output directory with--outdir. For example,--outdir src/generated/.

當您運行sqoop,你的類java源文件爲將被寫入當前工作目錄。你可以控制輸出目錄通過--outdir,例如  --outdir src/generated/.

The import process compiles the source into.classand.jarfiles; these are ordinarily stored under/tmp. You can select an alternate target directory with--bindir. For example,--bindir /scratch.

導入過程編譯源代碼成爲 .class.jar  jar文件;這些都是通常存儲在/ tmp。你可以選擇另一個目標目錄通過--bindir.例如 --bindir /scratch.

If you already have a compiled class that can be used to perform the import and want to suppress the code-generation aspect of the import process, you can use an existing jar and class by providing the--jar-fileand--class-nameoptions. For example:

如果你已經有一個編譯後的類,可用於執行導入,並想抑制代碼生成方面的導入流程(不生成代碼,使用已有代碼),你可以使用一個現有的jar和類 通過 --jar-file and --class-name選項,例如:

$ sqoop import --table SomeTable --jar-file mydatatypes.jar \
    --class-name SomeTableType

This command will load theSomeTableTypeclass out ofmydatatypes.jar.

這個命令將加載mydatatypes.jar的SomeTableType類

7.2.12.Additional Import Configuration Properties

There are some additional properties which can be configured by modifyingconf/sqoop-site.xml. Properties can be specified the same as in Hadoop configuration files, for example:

有一些額外的屬性可以配置通過修改conf/sqoop-site.xml。通過Hadoop配置文件,屬性同樣可以指定,例如:

  <property>
    <name>property.name</name>
    <value>property.value</value>
  </property>

They can also be specified on the command line in the generic arguments, for example:

它們也可以在命令行的通用參數中指定,例如:

sqoop import -D property.name=property.value ...

Table11.Additional import configuration properties:

ArgumentDescription
sqoop.bigdecimal.format.stringControls how BigDecimal columns will formatted when stored as a String. A value oftrue(default) will use toPlainString to store them without an exponent component (0.0000001); while a value offalsewill use toString which may include an exponent (1E-7)

sqoop.hbase.add.row.keyWhen set tofalse(default), Sqoop will not add the column used as a row key into the row data in HBase當爲false時,不會添加row列到數據行中去.
When set totrue, the column used as a row key will be added to the row data in HBase// 當爲true時,row key將作爲一個列被添加到數據行中.

7.3.Example Invocations

The following examples illustrate how to use the import tool in a variety of situations.

下面的例子演示瞭如何在各種各樣的情況下使用import工具。

A basic import of a table namedEMPLOYEESin thecorpdatabase:

一個在corp數據庫中的 EMPLOYEES表的基本導入 :

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES

A basic import requiring a login:

一個要求登錄的基本導入:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --username SomeUser -P
Enter password: (hidden)

Selecting specific columns from theEMPLOYEEStable:

EMPLOYEES表中選擇指定的列:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --columns "employee_id,first_name,last_name,job_title"

Controlling the import parallelism (using 8 parallel tasks):

控制並行導入(使用8個並行任務):

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    -m 8

Enabling the MySQL "direct mode" fast path:

使用快速通道的 MySQL direct模式:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --direct

Storing data in SequenceFiles, and setting the generated class name tocom.foocorp.Employee:

以序列文件存儲數據,並設置生成class的名爲com.foocorp.Employee。

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --class-name com.foocorp.Employee --as-sequencefile

Specifying the delimiters to use in a text-mode import:

在文本模式的導入 中指定分割符:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --fields-terminated-by '\t' --lines-terminated-by '\n' \
    --optionally-enclosed-by '\"'

Importing the data to Hive:

導入數據到 Hive中:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --hive-import

Importing only new employees:

只導入新員工(條件查詢導入):

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --where "start_date > '2010-01-01'"

Changing the splitting column from the default:

改變默認分割列:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --split-by dept_id

Verifying that an import was successful:

驗證一個導入是否成功:

$ hadoop fs -ls EMPLOYEES
Found 5 items
drwxr-xr-x   - someuser somegrp          0 2010-04-27 16:40 /user/someuser/EMPLOYEES/_logs
-rw-r--r--   1 someuser somegrp    2913511 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00000
-rw-r--r--   1 someuser somegrp    1683938 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00001
-rw-r--r--   1 someuser somegrp    7245839 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00002
-rw-r--r--   1 someuser somegrp    7842523 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00003

$ hadoop fs -cat EMPLOYEES/part-m-00000 | head -n 10
0,joe,smith,engineering
1,jane,doe,marketing
...

Performing an incremental import of new data, after having already imported the first 100,000 rows of a table:

在已經導入一個表的條數據後,執行新數據的增量導入:

$ sqoop import --connect jdbc:mysql://db.foo.com/somedb --table sometable \
    --where "id > 100000" --target-dir /incremental_dataset --append

An import of a table namedEMPLOYEESin thecorpdatabase that uses validation to validate the import using the table row count and number of rows copied into HDFS:More Details

一個在corp數據庫中的 EMPLOYEES表的導入 , 使用數據庫表的行數和複製到HDFS的行數來驗證導入:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp \
    --table EMPLOYEES --validate

8.sqoop-import-all-tables

8.1.Purpose

Theimport-all-tablestool imports a set of tables from an RDBMS to HDFS. Data from each table is stored in a separate directory in HDFS.

import-all-tables 工具用來從RDBMS中導入表的結合到 HDFS。每個表的數據都存儲在HDFS中的單獨目錄(每一個表都有獨立的目錄)。

For theimport-all-tablestool to be useful, the following conditions must be met:

爲了import-all-tables tool好用,必須符合下列條件:

  • Each table must have a single-column primary key.// 每個表必須有一個 單列的主鍵

  • You must intend to import all columns of each table.// 你必須打算導入每個表的所有行

  • You must not intend to use non-default splitting column, nor impose any conditions via aWHEREclause.//你既不能使用非默認的分割列也不能通過WHERE子句限制任何條件

8.2.Syntax

$ sqoop import-all-tables (generic-args) (import-args)
$ sqoop-import-all-tables (generic-args) (import-args)

Although the Hadoop generic arguments must preceed any import arguments, the import arguments can be entered in any order with respect to one another.

雖然hadoop通用參數必須寫在所有導入參數前,但是 導入的參數 互相之間可以以任何順序輸入。

Table12.Common arguments(上面寫過了):共有的參數

ArgumentDescription
--connect <jdbc-uri>Specify JDBC connect string
--connection-manager <class-name>Specify connection manager class to use
--driver <class-name>Manually specify JDBC driver class to use
--hadoop-mapred-home <dir>Override $HADOOP_MAPRED_HOME
--helpPrint usage instructions
-PRead password from console
--password <password>Set authentication password
--username <username>Set authentication username
--verbosePrint more information while working
--connection-param-file <filename>Optional properties file that provides connection parameters

Table13.Import control arguments:

ArgumentDescription
--as-avrodatafileImports data to Avro Data Files
--as-sequencefileImports data to SequenceFiles
--as-textfileImports data as plain text (default)
--directUse direct import fast path
--direct-split-size <n>Split the input stream everynbytes when importing in direct mode
--inline-lob-limit <n>Set the maximum size for an inline LOB
-m,--num-mappers <n>Usenmap tasks to import in parallel
--warehouse-dir <dir>HDFS parent for table destination
-z,--compressEnable compression
--compression-codec <c>Use Hadoop codec (default gzip)

These arguments behave in the same manner as they do when used for thesqoop-importtool, but the--table,--split-by,--columns, and--wherearguments are invalid forsqoop-import-all-tables.

這裏的參數表現與sqoop-import 工具是一樣的方式,但是--table, --split-by, --columns, and --where參數是無效的在sqoop-import-all-tables中。

Table14.Output line formatting arguments:

ArgumentDescription
--enclosed-by <char>Sets a required field enclosing character
--escaped-by <char>Sets the escape character
--fields-terminated-by <char>Sets the field separator character
--lines-terminated-by <char>Sets the end-of-line character
--mysql-delimitersUses MySQL’s default delimiter set: fields:,lines:\nescaped-by:\optionally-enclosed-by:'
--optionally-enclosed-by <char>Sets a field enclosing character

Table15.Input parsing arguments:

ArgumentDescription
--input-enclosed-by <char>Sets a required field encloser
--input-escaped-by <char>Sets the input escape character
--input-fields-terminated-by <char>Sets the input field separator
--input-lines-terminated-by <char>Sets the input end-of-line character
--input-optionally-enclosed-by <char>Sets a field enclosing character

Table16.Hive arguments:

ArgumentDescription
--hive-home <dir>Override$HIVE_HOME
--hive-importImport tables into Hive (Uses Hive’s default delimiters if none are set.)
--hive-overwriteOverwrite existing data in the Hive table.
--create-hive-tableIf set, then the job will fail if the target hive

table exits. By default this property is false.
--hive-table <table-name>Sets the table name to use when importing to Hive.
--hive-drop-import-delimsDrops\n,\r, and\01from string fields when importing to Hive.
--hive-delims-replacementReplace\n,\r, and\01from string fields with user defined string when importing to Hive.
--hive-partition-keyName of a hive field to partition are sharded on
--hive-partition-value <v>String-value that serves as partition key for this imported into hive in this job.
--map-column-hive <map>Override default mapping from SQL type to Hive type for configured columns.

Table17.Code generation arguments:

ArgumentDescription
--bindir <dir>Output directory for compiled objects
--jar-file <file>Disable code generation; use specified jar
--outdir <dir>Output directory for generated code
--package-name <name>Put auto-generated classes in this package

Theimport-all-tablestool does not support the--class-nameargument. You may, however, specify a package with--package-namein which all generated classes will be placed.

import-all-tables 不支持 --class-name 參數,然而,你可以指定一個放置所有生成類的包名。

8.3.Example Invocations

Import all tables from thecorpdatabase:

導入corp數據庫的所有表:

$ sqoop import-all-tables --connect jdbc:mysql://db.foo.com/corp

Verifying that it worked:

驗證是否執行成功:

$ hadoop fs -ls
Found 4 items
drwxr-xr-x   - someuser somegrp       0 2010-04-27 17:15 /user/someuser/EMPLOYEES
drwxr-xr-x   - someuser somegrp       0 2010-04-27 17:15 /user/someuser/PAYCHECKS
drwxr-xr-x   - someuser somegrp       0 2010-04-27 17:15 /user/someuser/DEPARTMENTS
drwxr-xr-x   - someuser somegrp       0 2010-04-27 17:15 /user/someuser/OFFICE_SUPPLIES




發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章