hive學習筆記-數據操作

hive數據操作

hive命令行操作
hive -d --define <key=value> 定義一個key-value可以在命令行中使用
hive -d database <databasename>   指定使用的數據庫
hive -e “hql”   不需要進入cli執行hql語句，可以在腳本中使用
hive -f fileName 將hql放到一個file文件中執行，sql語句來自file文件
hive -h hostname 訪問主機，通過主機的地址
hive -H --help 打印幫助信息
hive -H --hiveconf <property=value> 使用的信息可以在這裏定義property的value
hive -H --hivevar <key-value> 使用可變的命令，將一個命令重新賦值使用
hive -i filename hive的初始化文件，可以將hive的一些初始化信息放到這個文件中，比如使用自定義函數的時候可以將相應的jar包目錄寫進去
hive -S --silent 在shell下進入安靜模式，不需要打印一些輸出信息
hive -v --verbose 打印執行的詳細信息，比如執行的SQL語句

實例：查詢test表，並且將打印的結果放到/home/data/select_result.txt中
hive -S -e "select * from test" > /home/data/select_result.txt
含有執行的SQL語句
hive -v -e "select * from test" > /home/data/select_result.txt
執行放在文件中的SQL
hive -f /home/colin/hive-1.2.1/select_test

hive cli中使用list查看分佈式緩存中的file|jar|archive（比如通過add jar添加進去的，可以通過list jar查看添加到分佈式緩存中的jar包）
hive cli中使用source執行指定目錄下的文件，比如執行指定目錄下的一個sql文件 source /home/colin/hive-1.2.1/select_test

hive操作變量
配置變量
set val_Name=val_Value；
${hiveconf:val}
查看linux下的環境變量
${env:變量名稱},env查看所有的環境變量
實例定義變量val_test,設置爲yang，作爲查詢語句條件
set val_test=yang;
select * from test2 where name='${hiveconf:val_test}';
查看
HIVE_HOME的環境變量
select '${env:HIVE_HOME}' from test;
注：test表中有多少條記錄，打印多少次路徑

hive數據加載
內表數據加載
    創建表時加載
    create table tableName as select col_1,col_2... from tableName2;
    創建表的時候指定數據位置
    create table tablename(col_name typye comment ...) location 'path';(path爲hdfs中的路徑，注意這個path是文件上層的目錄，也就是說指定文件到上層目錄，目錄下的數據都會作爲該表的數據。並且這種方式不會在hive/warehouse下創建該表的目錄，因爲他會把hdfs中指定的path作爲該表目錄操作   )
       注：這種指定方式，在內表中會將數據的擁有權給當前表，當表刪除的時候數據也會刪除(連同上層目錄)
    本地數據加載
    load data local inpath 'localpath' [overwrite] into table tableName;
    加載HDFS中數據
    load data inpath 'hdfspath' [overwrite] into table tableName;
         注意：這種方式，是將hdfs中指定位置的數據移動到表的目錄下
    使用Hadoop命令拷貝數據到指定位置(hive中shell執行和Linux中shell執行)
    hdfs dfs -copyFromLocal /home/data /data
    hive shell中 dfs -copyFromLocal /home/data data（hadoop命令直接可以在hive中執行，同樣hive也可以執行linux命令，但是需要在命令前加上！）

    由查詢語句加載數據
    insert [overwrite|into] table tableName select col1,col2... from tablenName2 where ...
    from tableNable2 insert [overwrite|into] table tableName select col1,col2... where ...
    select col1,col2.. from tableName2 where ... insert [overwrite|into] table tableName;
    注：可以select的字段名字可以和table中不同，hive在數據加載時候不會進行字段檢測和類型檢測。只有在查詢的時候檢測
外表數據加載
    創建表的時候指定數據位置(因爲外表對數據沒有控制權)
    create external table (col_Name type comment...) location 'path';
    通過insert語句，和內表一樣
    通過hadoop命令，和內表一樣
hive分區表數據加載
    內部分區表和內表數據加載類似
    外部分區表和外表數據加載類似
不同之處是指定分區；在外部分區表中數據存放的層次要表的分區一致，如果分區表下沒有新增分區，即使目錄下有數據也是查不到的,當滿足目錄結構對應的時候需要添加分區 alter table tableName add partition (dt=20150820)。
    load data local inpath 'path' [overwrie] into table tableNmme partition(pName='..');
    insert [overwrite|into] table tableName partition(pName='..') select col1,col2.. from tableName2 where ...

注意：row format分隔符如果設定多個字符起分割作用，只有第一個字符有作用
      load數據的時候，字段類型不能相互轉換，否則會加載爲NULL
      插入數據時候如果selct後的類型也不能相互轉換，否則插入爲NULL;
      在HDFS中NULL是以\N來顯示的

Hive數據導出
導出方式：
   Hadoop命令的方式
          get
          text
   通過Insert....DIRECTOR
       insert overwrite [local] directory 'path' [row format delimited fields terminated by '\t' lines terminated by '\n'] select col1,col2.. from tableName
            注：如果使用local是導到本地，否則是HDFS中，row format只對導到本地起作用(在1.2.1hive中已經能夠在HDFS中使用row format了
)。
   通過Shell命令加管道
   通過第三方工具
實例：
hdfs dfs -get /user/hive/warehouse/test4/* ./data/newdata
hdfs dfs -text /user/hive/warehouse/test5/* > ./data/newdata(可以對多種格式進行輸出，壓縮、序列化等)
hive -S -e "select * from test4" | grep yang > ./data/newdata

hive動態分區
分區不確定，需要從查詢結果中查看。不需要爲每個分區都使用alter table添加
使用動態分區需要配置的參數：
set hive.exec.dynamic.partition=true;//使用動態分區
set hive.exec.dynamic.partition.model=nonstrick;//分區有兩種方式：一種是strick有限制分區，需要有一個靜態分區，且放在最前面。一種就是nostrick無限制模式
set hive.exec.max.dynamic.partitions.pernode=10000;//每個節點生成動態分區的最大個數
set hive.exec.max.dynamic.partitions=100000;//生成動態分區的最大個數
set hive.exec.max.created.fiels=150000;//一個任務最多可以創建的文件數目
set dfs.datanode.max.xcievers=8192;//限定一次最多打開的文件數

insert overwrie table test7 partition(dt) select name,time as dt from test6;

表屬性操作
修改表名：
alter table tableName rename to newName
修改列明:
alter table tableName change column old_col new_col newType comment '....' after colName(如果要爲第一列則將aftercolName 改爲first)
增加列：
alter table tableName add columns (c1 type comment '..',c2 type comment '...')
修改表屬性
查看錶屬性
desc formatted tablename
這個是可以要修改的表的屬性信息
Table Parameters:
   COLUMN_STATS_ACCURATE   false
   last_modified_by       colin
   last_modified_time     1440154819
   numFiles               0
   numRows                -1
   rawDataSize            -1
   totalSize              0
   transient_lastDdlTime   1440154819
修改屬性：
alter table tableName set tblproerties('propertiesName'='.....');
比如修改comment
alter table tableName set tblproperties("comment"="xxxxx");
修改序列化信息：
無分區表
alter table tableName set serdepropertie('fields.delim'='\t');
有分區表
alter table tableName partition(dt='xxxx') set serdeproperties('fields.delim'='\t');
修改Location：
alter table tableName [partition(..)] set localtion 'path';
內部表外部錶轉換：

alter table tableName set tblproperties ('EXTERNAL'='TRUE|FALSE');必須大寫EXTERNAL

更多屬性操作查看:https://cwiki.apache.org/confluence/display/Hive/Home

hive學習筆記-數據操作

java之ArrayList源碼解析

數據結構—基礎知識

Java集合彙總(一)

Linux系統管理—進程管理

MapReduce1和YARN(MapReduce2)運行機制

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結