安裝部署（八） Hive+Sqoop安裝部署和使用

Hive+Sqoop安裝

haddoop 2.7.2

spark 2.0.0

zookeeper 3.4.8

kafka 0.10.0.0

hbase 1.2.2

jdk1.8.0_101

ubuntu 14.04.04 x64

參考：
http://blog.csdn.net/yinedent/article/details/48275407
http://blog.csdn.net/suijiarui/article/details/51137316

一、Hive 2.1.0
1、下載
https://mirrors.tuna.tsinghua.edu.cn/apache/hive/stable-2/
https://mirrors.tuna.tsinghua.edu.cn/apache/hive/stable-2/apache-hive-2.1.0-bin.tar.gz

2、解壓
root@py-server:/server# tar xvzf apache-hive-2.1.0-bin.tar.gz
root@py-server:/server# mv apache-hive-2.1.0-bin/ hive

3、環境變量
vi ~/.bashrc
export HIVE_HOME=/server/hive
export PATH=$PATH:$HIVE_HOME/bin
source ~/.bashrc

4、配置
4.1 複製出配置文件
root@py-server:/server/hive/conf# cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties
root@py-server:/server/hive/conf# cp hive-log4j2.properties.template hive-log4j2.properties
root@py-server:/server/hive/conf# cp hive-env.sh.template hive-env.sh
root@py-server:/server/hive/conf# cp hive-default.xml.template hive-site.xml
root@py-server:/server/hive/conf# ll
總用量 504
drwxr-xr-x 2 root root 4096 8月 12 15:03 ./
drwxr-xr-x 9 root root 4096 8月 12 14:40 ../
-rw-r--r-- 1 root staff 1596 6月 3 18:43 beeline-log4j2.properties.template
-rw-r--r-- 1 root staff 225729 6月 17 08:03 hive-default.xml.template
-rw-r--r-- 1 root root 2378 8月 12 15:03 hive-env.sh
-rw-r--r-- 1 root staff 2378 6月 3 18:43 hive-env.sh.template
-rw-r--r-- 1 root root 2299 8月 12 15:02 hive-exec-log4j2.properties
-rw-r--r-- 1 root staff 2299 6月 3 18:43 hive-exec-log4j2.properties.template
-rw-r--r-- 1 root root 2950 8月 12 15:02 hive-log4j2.properties
-rw-r--r-- 1 root staff 2950 6月 3 18:43 hive-log4j2.properties.template
-rw-r--r-- 1 root root 225729 8月 12 15:03 hive-site.xml
-rw-r--r-- 1 root staff 2049 6月 10 17:00 ivysettings.xml
-rw-r--r-- 1 root staff 2768 6月 3 18:43 llap-cli-log4j2.properties.template
-rw-r--r-- 1 root staff 4241 6月 3 18:43 llap-daemon-log4j2.properties.template
-rw-r--r-- 1 root staff 2662 6月 9 02:47 parquet-logging.properties

4.2 修改hive-env.sh
# HADOOP_HOME=${bin}/../../hadoop
HADOOP_HOME=/server/hadoop
# Hive Configuration Directory can be controlled by:
# export HIVE_CONF_DIR=
HIVE_CONF_DIR=/server/hive/conf

4.3.1 修改hive-site.xml
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>fm1106</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://py-server:3306/hive?createDatabaseIfNotExist=true&characterEncoding=utf8&useSSL=true</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<value>py-server,py-11,py-12,py-13,py-14<value/>
<description>
List of ZooKeeper servers to talk to. This is needed for:
1. Read/write locks - when hive.lock.manager is set to
org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager,
2. When HiveServer2 supports service discovery via Zookeeper.
3. For delegation token storage if zookeeper store is used, if
hive.cluster.delegation.token.store.zookeeper.connectString is not set
4. LLAP daemon registry service
</description>
</property>
【可以不做修改 <property>
<name>hive.metastore.uris</name>
<value>thrift://py-server:9083</value>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>
】
<property>
<name>hive.exec.scratchdir</name>
<value>/server/tmp/hive</value>
<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/server/tmp/hive</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/server/tmp/hive</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
注意：thrift是遠程訪問數據用的。沒有安裝thrift這個不改此項，否則報錯。
另外ssl=True，否則會報警Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
。
注意：
如果包SSL錯誤，將hive-site.xml修改一下。
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://py-server:3306/hive?useSSL=false</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>

4.3.1.2
賦予權限：
mkdir -p /server/tmp/hive
chmod -R 775 /server/tmp/hive
hadoop fs -mkdir /tmp
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse
參考：
https://chu888chu888.gitbooks.io/hadoopstudy/content/Content/8/chapter0807.html
http://www.aboutyun.com/thread-10937-1-1.html
http://blog.csdn.net/suijiarui/article/details/51137316

4.3.2 上傳mysql jar包
如果使用mysql的話需要在${HIVE_HOME}/lib目錄下加入mysql的jdbc鏈接jar包
cp ${JAVA_HOME}/lib/mysql-connector-java-5.1.39-bin.jar $HIVE_HOME/lib

4.3.3 授權
mysql必須授權遠程登錄，如果你是的MySQL與hive是同一個服務器，還需要本地登錄授權
root@py-server:/server# mysql -uroot -p
mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'localhost'IDENTIFIED BY 'fm1106' WITH GRANT OPTION;
Query OK, 0 rows affected, 1 warning (0.36 sec)

mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'py-server'IDENTIFIED BY 'fm1106' WITH GRANT OPTION;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'fm1106' WITH GRANT OPTION;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.25 sec)

4.4 建目錄
如果改了logs也要建比如mkdir logs
root@py-server:/server# hadoop fs -mkdir /user/hive
root@py-server:/server# hadoop fs -mkdir /user/hive/warehouse

4.5替換zookeeper的jar包
root@py-server:/server# cp /server/zookeeper/zookeeper-3.4.6.jar $HBASE_HOME/lib
root@py-server:/server# cp /server/zookeeper/zookeeper-3.4.6.jar $HIVE_HOME/lib
如果有就不用拷貝

4.6創建表
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| gfdata |
| hive |
| mysql |
| performance_schema |
| stockdata |
| sys |
+--------------------+
7 rows in set (0.05 sec)
如果存在hive，就不用create database hive;

4.7啓動
4.7.1
進入之前需要初始化數據庫
root@py-server:/server/tmp# schematool -initSchema -dbType mysql
4.7.2
先啓動hive元數據服務，後臺啓動
root@py-server:/server/tmp# hive --service metastore&
[1] 10609
root@py-server:/server/tmp# Starting Hive Metastore Server
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/server/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/server/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

4.7.3
root@py-server:/server/tmp# hive

4.7.4 查詢
hive> show tables;
OK
Time taken: 1.219 seconds
hive> show databases;
OK
default
Time taken: 0.013 seconds, Fetched: 1 row(s)
hive>
4.7.5 建數據庫
hive> create database testdb;
OK
Time taken: 0.303 seconds
hive> show databases;
OK
default
testdb
Time taken: 0.011 seconds, Fetched: 2 row(s)
4.7.6 建表
參考：http://blog.itpub.net/26143577/viewspace-720092/
hive> create table test_hive2 (id int,id2 int,name string) row format delimited fields terminated by '\t';
OK
Time taken: 0.601 seconds
4.7.7 load txt
參考：http://blog.csdn.net/dst1213/article/details/51419072
http://blog.csdn.net/yinedent/article/details/48275407

二、Sqoop
Note that 1.99.7 is not compatible with 1.4.6 and not feature complete, it is not intended for production deployment.
生成環境還是根據建議用1.4.6，想裝1.99.7的可以參考本人之前的文章。
http://apache.fayea.com/sqoop/1.4.6/
http://apache.fayea.com/sqoop/1.4.6/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz

1 解壓
root@py-server:/server# tar xvzf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
root@py-server:/server# mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha/ sqoop

2 環境變量
vi ~/.bashrc
export SQOOP_HOME=/server/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
source ~/.bashrc

3 配置
cd $SQOOP_HOME/conf
cp sqoop-env-template.sh sqoop-env.sh
如果有安裝以下組件就修改相應的。
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/server/hadoop

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/server/hadoop

#set the path to where bin/hbase is available
export HBASE_HOME=/server/hbase

#Set the path to where bin/hive is available
export HIVE_HOME=/server/hive

#Set the path for where zookeper config dir is
export ZOOCFGDIR=/server/zookeeper/conf

4 mysql jar包
cp ${JAVA_HOME}/lib/mysql-connector-java-5.1.39-bin.jar $SQOOP_HOME/lib

5 環境變量
vi ~/.bashrc
export CLASSPATH=$CLASSPATH:$SQOOP_HOME/lib
source ~/.bashrc

6 測試
6.1 列出數據庫
root@py-server:/server/zookeeper/conf# sqoop list-databases --connect jdbc:mysql://py-server:3306/?useSSL=false --username root -P
結果
root@py-server:/server/zookeeper/conf# sqoop list-databases --connect jdbc:mysql://py-server:3306/?useSSL=false --username root -P
Warning: /server/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /server/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/08/12 17:50:15 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
Enter password:
16/08/12 17:50:20 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
gfdata
hive
mysql
performance_schema
stockdata
sys
root@py-server:/server/zookeeper/conf#
使用useSSL=false可以不報警ssl一堆warning，也可以不加
sqoop list-databases --connect jdbc:mysql://py-server:3306/ --username root -P
7 使用
7.1 列出表
sqoop list-tables --connect jdbc:mysql://py-server:3306/gfdata?useSSL=false --username root --password abc
結果：
root@py-server:/server/zookeeper/conf# sqoop list-tables --connect jdbc:mysql://py-server:3306/gfdata?useSSL=false --username root --password fm1106
Warning: /server/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /server/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/08/12 17:54:09 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
16/08/12 17:54:09 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/08/12 17:54:09 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
000010
000011
000030
000059
000065
000420

7.2 將MySQL的test.t1表結構複製到Hive的test庫中，表名爲mysql_t1
sqoop create-hive-table --connect jdbc:mysql://py-server:3306/gfdata?useSSL=false --table 000010 --username root --password 123456 --hive-table gfdata.mysql_000010
Time taken: 0.015 seconds, Fetched: 1 row(s)
hive> show tables;
OK
mysql_000010
mysql_603608

注：該命令可以多次執行不報錯

7.3 將mysql表的數據導入到hive中
# 建數據庫：
hive> create database gfdata;
OK
Time taken: 0.861 seconds
hive> show databases;
OK
default
gfdata
testdb
Time taken: 0.132 seconds, Fetched: 3 row(s)

# 追加數據
7.2獲得表結構，或者hive>create table mysql_603608;

root@py-server:/server/zookeeper/conf# sqoop import --connect jdbc:mysql://py-server:3306/gfdata?useSSL=false --username root --password fm1106 --table 603608 --hive-import --hive-table gfdata.mysql_603608
結果：
16/08/12 21:45:28 INFO hive.HiveImport: Export directory is contains the _SUCCESS file only, removing the directory.
hive> select * from mysql_603608;
OK
2016-02-18 11.51 13.86 11.51 13.86 58617 819801
2016-02-19 15.27 15.27 15.27 15.27 42024 652212
注意：hive默認不打印列名
參考：http://blog.csdn.net/qiaochao911/article/details/9035225
解決：

# 覆蓋數據
hive>create database test;
hive>use test;
hive>create table test
以上可以用7.2代替
sqoop import --connect jdbc:mysql://py-server:3306/test?useSSL=false --username root --password 123456 --table t1 --hive-import --hive-overwrite --hive-table test.mysql_t1
注：如果MySQL中的表沒有主鍵，則需要加--autoreset-to-one-mapper參數

7.4 將hive表的數據導入到mysql中
與HDFS導入MySQL相同，注意--table必須是空表，先用mysql創建好
step 1
mysql> create database gftest;
Query OK, 1 row affected (0.00 sec)

step 2
mysql> create table s000010 (date DATE PRIMARY KEY NOT NULL, open Double, high Double, low Double, close Double, volume Bigint, amount Bigint);
Query OK, 0 rows affected (0.19 sec)
mysql> show tables;
+------------------+
| Tables_in_gftest |
+------------------+
| s000010 |
+------------------+
1 row in set (0.00 sec)

step 3
root@py-server:/server/zookeeper/conf# hadoop fs -ls /user/hive/warehouse
Found 2 items
drwxrwxr-x - root supergroup 0 2016-08-12 21:08 /user/hive/warehouse/gfdata.db
drwxrwxr-x - root supergroup 0 2016-08-12 17:26 /user/hive/warehouse/test_hive2
root@py-14:~# hadoop fs -ls /user/hive/warehouse/gfdata.db
Found 1 items
drwxrwxr-x - root supergroup 0 2016-08-12 21:08 /user/hive/warehouse/gfdata.db/mysql_000010

root@py-server:/server/zookeeper/conf# sqoop export --connect jdbc:mysql://py-server:3306/hive?useSSL=false --username root --password s123456 --table s603608 --export-dir /user/hive/warehouse/gfdata.db/mysql_603608 --fields-terminated-by '\001'
【注意】：一id那個要加--fields-terminated-by '\001'，不然會報Error: java.io.IOException: Can't export data, please check failed map task logs
15/12/02 02:01:13 INFO mapreduce.ExportJobBase: Exported 0 records.
15/12/02 02:01:13 ERROR tool.ExportTool: Error during export: Export job failed!

參考：
http://blog.csdn.net/dst1213/article/details/51419072
http://blog.csdn.net/yinedent/article/details/48275407
http://blog.csdn.net/wzy0623/article/details/50921702

#################################################
hive默認查詢不會顯示列名，
參考：http://blog.csdn.net/qiaochao911/article/details/9035225
當一個表字段比較多的時候，往往看不出值與列之間的對應關係，對日常查錯及定位問題帶來不便，應同事要求，看了HIVE CLI源碼，做了下些許調整，加入列頭打印及行轉列顯示功能

未開啓行轉列功能之前:

hive>
>
> select * from example_table where dt='2012-03-31-02' limit 2;
OK
1333133185 0cf49387a23d9cec25da3d76d6988546 3CD5E9A1721861AE6688260ED26206C2 guanwang 1.1 3d3b0a5eca816ba47fc270967953f881 192.168.1.2.13331317500.0 NA 031/Mar/2012:02:46:44 +080 222.71.121.111 2012-03-31-02
1333133632 0cf49387a23d9cec25da3d76d6988546 3CD5E9A1721861AE6688260ED26206C2 10002 1.1 e4eec776b973366be21518b709486f3c 110.6.100.57.1332909301867.6 NA 0 31/Mar/2012:02:54:16 +080 110.6.74.219 2012-03-31-02
Time taken: 0.62 seconds
開啓行轉列功能之後:

set hive.cli.print.header=true; // 打印列名
set hive.cli.print.row.to.vertical=true; // 開啓行轉列功能, 前提必須開啓打印列名功能
set hive.cli.print.row.to.vertical.num=1; // 設置每行顯示的列數
> select * from example_table where pt='2012-03-31-02' limit 2;
OK
datetime col_1 col_2 channel version pcs cookie trac new time ip
datetime=1333133185
col_1=0cf49387a23d9cec25da3d76d6988546
clo_2=3CD5E9A1721861AE6688260ED26206C2
channel=test_name1
version=1.1
pcs=3d3b0a5eca816ba47fc270967953f881
cookie=192.168.1.2.13331317500.0
trac=NA
new=0
time=31/Mar/2012:02:46:44 +080
ip=222.71.121.111
-------------------------Gorgeous-split-line-----------------------
datetime=1333133632
col_1=0cf49387a23d9cec25da3d76d6988546
col_2=3CD5E9A1721861AE6688260ED26206C2
channel=test_name2
version=1.1
pcs=e4eec776b973366be21518b709486f3c
cookie=110.6.100.57.1332909301867.6
trac=NA
new=0
time=31/Mar/2012:02:54:16 +080
ip=110.6.74.219
--------------------------Gorgeous-split-line-----------------------
Time taken: 0.799 seconds
開啓行轉列功能後，每一行都已列顯示，值前面都加上列名，方便問題查找！

安裝部署（八） Hive+Sqoop安裝部署和使用

數據挖掘環境配置（一）JDK在ubuntu 16.04下的安裝配置

大數據爬蟲基礎（三）Scrapy在ubuntu 16.04下的安裝

大數據基礎（三）Ubuntu下基於Hadoop 2.6.2的Mahout 0.12.1安裝和使用

數據挖掘算法（一）提高文本分類算法準確率和性能的10條建議

大數據爬蟲基礎（三） MAVEN的安裝配置和使用（上）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結