Hive 建表詳解

注:hive其他語法在hive官網有說明,建議初學者,去官網學習一手的資料, 
官網:https://cwiki.apache.org/confluence/display/Hive/Home#Home-UserDocumentation

官網說明
Hive建表方式共有三種:

直接建表法
查詢建表法
like建表法

首先看官網介紹 

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
 
 
 
[AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
 
 
 
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];
 
data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type  -- (Note: Available in Hive 0.7.0 and later)
 
primitive_type
  : TINYINT
  | SMALLINT
  | INT
  | BIGINT
  | BOOLEAN
  | FLOAT
  | DOUBLE
  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
  | STRING
  | BINARY      -- (Note: Available in Hive 0.8.0 and later)
  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)
  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)
  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)
  | DATE        -- (Note: Available in Hive 0.12.0 and later)
  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)
  | CHAR        -- (Note: Available in Hive 0.13.0 and later)
 
array_type
  : ARRAY < data_type >
 
map_type
  : MAP < primitive_type, data_type >
 
struct_type
  : STRUCT < col_name : data_type [COMMENT col_comment], ...>
 
union_type
   : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)
 
row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
 
file_format:
  : SEQUENCEFILE
  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
  | RCFILE      -- (Note: Available in Hive 0.6.0 and later)
  | ORC         -- (Note: Available in Hive 0.11.0 and later)
  | PARQUET     -- (Note: Available in Hive 0.13.0 and later)
  | AVRO        -- (Note: Available in Hive 0.14.0 and later)
  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
 
constraint_specification:
  : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ]
    [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE




一、hive的string和varchar的區別
 簡要介紹:
Hive有2種類型用於存儲變長文本。
1.Hive-0.12.0版本引入了VARCHAR類型,VARCHAR類型使用長度指示器(1到65355)創建,長度指示器定義了在字符串中允許的最大字符數量。如果一個字符串值轉換爲或者被賦予一個varchar值,其長度超過了長度指示器則該字符串值會自動被截斷。
2.STRING存儲變長的文本,對長度沒有限制。理論上將STRING可以存儲的大小爲2GB,但是存儲特別大的對象時效率可能受到影響,可以考慮使用Sqoop提供的大對象支持。
二、兩者主要區別:
1.VARCHAR與STRING類似,但是STRING存儲變長的文本,對長度沒有限制;varchar長度上只允許在1-65355之間。
2.還沒有通用的UDF可以直接用於VARCHAR類型,可以使用String UDF代替,VARCHAR將會轉換爲String再傳遞給UDF。


觀察可發現一共有三種建表方式,接下來我們將一一講解。

1.直接建表法:
create table table_name(col_name data_type);



一個複雜的例子

主要要按照上面的定義的格式順序去進行編寫

CREATE EXTERNAL TABLE IF NOT EXISTS `dmp_clearlog` (
  `date_log` string COMMENT 'date in file', 
  `hour` int COMMENT 'hour', 
  `device_id` string COMMENT '(android) md5 imei / (ios) origin  mac', 
  `imei_orgin` string COMMENT 'origin value of imei', 
  `mac_orgin` string COMMENT 'origin value of mac', 
  `mac_md5` string COMMENT 'mac after md5 encrypt', 
  `android_id` string COMMENT 'androidid', 
  `os` string  COMMENT 'operating system', 
  `ip` string COMMENT 'remote real ip', 
  `app` string COMMENT 'appname' )
COMMENT 'cleared log of origin log'
PARTITIONED BY (
  `date` date COMMENT 'date used by partition'
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
TBLPROPERTIES ('creator'='szh', 'crate_time'='2018-06-07')
;


這裏我們針對裏面的一些不同於關係型數據庫的地方進行說明。
row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

Hive將HDFS上的文件映射成表結構,通過分隔符來區分列(比如’,’ ‘;’ or ‘^’ 等),row format就是用於指定序列化和反序列化的規則。 
比如對於以下記錄:
1,xiaoming,book-TV-code,beijing:chaoyang-shagnhai:pudong
2,lilei,book-code,nanjing:jiangning-taiwan:taibei
3,lihua,music-book,heilongjiang:haerbin

逗號用於分割列(FIELDS TERMINATED BY char:對應ID、name、hobby(數組形式,COLLECTION ITEMS TERMINATED BY char)、address(鍵值對形式map,MAP KEYS TERMINATED BY char)),而LINES TERMINATED BY char 用於區分不同條的數據,默認是換行符;


file format(HDFS文件存放的格式)
默認是TEXTFILE,即文本格式,可以直接打開。 

如下:根據上述文件內容,創建一個表t1
create table t1(
    id      int
   ,name    string
   ,hobby   array<string>
   ,add     map<String,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
;

下面插入數據 
注:一般很少用insert (不是insert overwrite)語句,因爲就算就算插入一條數據,也會調用MapReduce,這裏我們選擇Load Data的方式。
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

然後上載
load data local inpath '/home/hadoop/Desktop/data' overwrite into table t1;


external
未被external修飾的是內部表(managed table),被external修飾的爲外部表(external table); 
區別: 
內部表數據由Hive自身管理,外部表數據由HDFS管理; 
內部表數據存儲的位置是hive.metastore.warehouse.dir(默認:/user/hive/warehouse),外部表數據的存儲位置由自己制定; 
刪除內部表會直接刪除元數據(metadata)及存儲數據;刪除外部表僅僅會刪除元數據,HDFS上的文件並不會被刪除; 
對內部表的修改會將修改直接同步給元數據,而對外部表的表結構和分區進行修改,則需要修復(MSCK REPAIR TABLE table_name;)

創建一個外部表t2
create external table t2(
    id      int
   ,name    string
   ,hobby   array<string>
   ,add     map<String,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
location '/user/t2'
;


裝載數據
load data local inpath '/home/hadoop/Desktop/data' overwrite into table t2;


2.查詢建表法
通過AS 查詢語句完成建表:將子查詢的結果存在新表裏,有數據 
一般用於中間表
CREATE TABLE new_key_value_store
   ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
   STORED AS RCFile
   AS
SELECT (key % 1024) new_key, concat(key, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;

3.like建表法
會創建結構完全相同的表,但是沒有數據。 
常用語中間表
CREATE TABLE empty_key_value_store
LIKE key_value_store;
 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章