hadoop+hive 做數據倉庫 & 一些測試

轉載標明 :www.bagbaby.cn
http://hi.baidu.com/dd_shop

背景需求和現狀
目前的日誌系統還稱不上系統，只是在幾臺服務器上存着所有的日誌，依靠NFS共享數據，並運算，帶來的問題諸多：
a) 數據存放凌亂，缺乏系統的目錄管理；
b) 存儲空間有限，並且擴展非常麻煩；
c) CV/PV等日誌分散存放，合併不方便；
d) 媒體服務日誌數據集中存放，數據龐大而難以做到輕量級備份；
e) 丟失數據的情況時有發生，且無從恢復；
f) 數據抓取性能低下，時常成爲運算瓶頸；

需求分析
a) 日誌採集&存儲
日誌名稱每天日誌大小備份管理人
.............

b) 日誌預處理
見附件：只寫了PV&CV日誌，沒有看到別的日誌格式。
Excel：****log.xlsx
c)

數據預處理
d) 規範數據（清洗）
遺漏數據處理
在PV日誌中發現多個記錄中的屬性值爲空，例如areaID，IP，referUrl，open等等。對於爲空的屬性值，可以採用以下方法
進行遺漏數據
忽略該條記錄。若一條記錄中有屬性值被遺漏了，則將此條記錄排除在數據挖掘過程之外，尤其當類別屬性的值沒有而又要
使用的主要分類數據時。當然這種方法並不很有效，尤其是在每個屬性遺漏值的記錄比例相差較大時。
手工填補遺漏值。一般講這種方法比較耗時，而且對於存在許多遺漏情況的大規模數據集而言，可行較差。
利用缺省值填補遺漏值。對一個屬性的所有遺漏的值均利用一個事先確定好的值來填補。
不一致數據處理
現實世界的數據常出現數據記錄內容的不一致，其中一些數據不一致可以利用它們與外部的關聯手工加以解決。例如：在不
同服務器編碼不一致，預處理可以幫助糾正使用編碼時所發生的不一致問題。
e) 數據轉換
數據轉換主要是對數據進行規格化操作。如：對於一個顧客信息數據庫中的年齡屬性或工資屬性，由於工資屬性的
取值比年齡屬性的取值要大許多，如果不進行規格化處理，基於工資屬性的距離計算值顯然將遠超過基於年齡屬性的距離計算值，這就意味着工資屬性的作用在整個數據對象的距

離計算中被錯誤地放大了。
f) 數據合併
對大規模數據庫內容進行復雜的數據分析通常需要耗費大量的時間，這就常常使得這樣的分析變得不現實和不可行，尤其是需要交互式數據挖掘時。數據消減技術正是用於幫助從

原有龐大數據集中獲得一個精簡的數據集合，並使這一精簡數據集保持原有數據集的完整性，這樣在精簡數據集上進行數據挖掘顯然效率更高，並且挖掘出來的結果與使用原有數

據集所獲得結果基本相同
g)

Hadoop介紹
。。。。。。。。
Hadoop家族
整個Hadoop由以下幾個子項目組成：
成員名用途
Hadoop Common Hadoop體系最底層的一個模塊，爲Hadoop各子項目提供各種工具，如：配置文件和日誌操作等。
Avro Avro是doug cutting主持的RPC項目，有點類似Google的protobuf和Facebook的thrift。avro用來做以後hadoop的RPC，使hadoop的RPC模塊通信速度更快、數據結構更緊湊

。
Chukwa Chukwa是基於Hadoop的大集羣監控系統，由yahoo貢獻。
HBase 基於Hadoop Distributed File System，是一個開源的，基於列存儲模型的分佈式數據庫。
HDFS 分佈式文件系統
Hive hive類似CloudBase，也是基於hadoop分佈式計算平臺上的提供data warehouse的sql功能的一套軟件。使得存儲在hadoop裏面的海量數據的彙總，即席查詢簡單化。hive

提供了一套QL的查詢語言，以sql爲基礎，使用起來很方便。
MapReduce 實現了MapReduce編程框架
Pig Pig是SQL-like語言，是在MapReduce上構建的一種高級查詢語言，把一些運算編譯進MapReduce模型的Map和Reduce中，並且用戶可以定義自己的功能。Yahoo網格運算部門

開發的又一個克隆Google的項目Sawzall。
ZooKeeper Zookeeper是Google的Chubby一個開源的實現。它是一個針對大型分佈式系統的可靠協調系統，提供的功能包括：配置維護、名字服務、分佈式同步、組服務等。

ZooKeeper的目標就是封裝好複雜易出錯的關鍵服務，將簡單易用的接口和性能高效、功能穩定的系統提供給用戶。

Hadoop安裝
h) 操作系統
Linux 2.6.31-20-generic Ubuntu 9.1
i) 必須軟件
ssh
apt-get install openssh-server

rsync
apt-get install rsync

java1.6
apt-get install sun-java16-jar sun-java16-jdk

ant
apt-get install ant
j) 配置環境Ssh免密碼登陸：
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
說明：單機不需要下面藍色字體的操作
scp .ssh/id_rsa.pub hadoop@*.*.*.*:/home/hadoop/id_rsa.pub
cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
測試登陸：
Ssh localhost or ssh *.*.*.*

k) 編譯
i. 下載就去官方網站，我就不寫了
ii. 我們把Hadoop都安裝在/usr/local/
tar zxvf hadoop-0.20.2.tar.gz
ln -s hadoop-0.20.2 hadoop
cd hadoop

iii. 配置Hadoop（我cp的是官方的默認配置，沒有寫。我寫的這個是單機的，集羣參考：http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html ）
conf/core-site.xml:

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

conf/hdfs-site.xml:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

conf/mapred-site.xml:

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

iv. 格式化Hadoop
錯誤如果出現：Error: JAVA_HOME is not set. 表示沒有配置java home。
我們把javahome配置爲全局；
vi /etc/environment
增加jave_home和/usr/local/hadoop/bin：
JAVA_HOME="/usr/lib/jvm/java-6-sun"
v. 啓動Hadoop
start-all.sh
vi. 檢查Hadoop是否正常
Netstat –nl |more
tcp6 0 0 127.0.0.1:9000 :::* LISTEN
tcp6 0 0 127.0.0.1:9001 :::* LISTEN
tcp6 0 0 :::50090 :::* LISTEN
tcp6 0 0 :::50070 :::* LISTEN

vii. 測試
hadoop fs -put CHANGES.txt input/
hadoop fs -ls input
這個例子是計算有多少個單詞的
hadoop jar hadoop-*-examples.jar grep input output '[a-z.]+'
root@hadoop-desktop:/usr/local/hadoop# hadoop fs -cat output/* |more
cat: Source must be a file.
3828 .
1969 via
1375 to

viii.
l) Api介紹
見附件：Word： hadoop的API.docx
m)
Hive 安裝
n) 下載，去官方下載最新版，我就不寫了。
o) 解壓；
tar zxvf hive-0.5.0-bin.tar.gz ;
ln –s hive-0.5.0-bin hive
p) 配置hive環境
vi /etc/environment
HIVE_HOME="/usr/local/hive/"
q) 創建hive存儲
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod g+w /user/hive/warehouse
r) 啓動hive
hive
進入： hive> 標識符
創建pokes表。
hivr> CREATE TABLE pokes (foo INT, bar STRING);
加載測試數據，加載的文件是2列。
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
hive> select count(1) from pokes;

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201004131133_0018, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0018
Kill Command = /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201004131133_0018
2010-04-13 16:32:12,188 Stage-1 map = 0%, reduce = 0%
2010-04-13 16:32:29,536 Stage-1 map = 100%, reduce = 0%
2010-04-13 16:32:38,768 Stage-1 map = 100%, reduce = 33%
2010-04-13 16:32:44,916 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0018
OK
500
Time taken: 38.379 seconds

hive> select count(bar),bar from pokes group by bar;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201004131133_0017, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0017
Kill Command = /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201004131133_0017
2010-04-13 16:26:55,791 Stage-1 map = 0%, reduce = 0%
2010-04-13 16:27:11,165 Stage-1 map = 100%, reduce = 0%
2010-04-13 16:27:20,268 Stage-1 map = 100%, reduce = 33%
2010-04-13 16:27:25,348 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0017
OK
3 val_0
1 val_10
……………
Time taken: 37.979 seconds

s) hive api：
參考： http://hadoop.apache.org/hive/docs/current/api/org/apache/hadoop/hive/conf/
t) hive 使用mysql 做meta
參考：http://www.mazsoft.com/blog/post/2010/02/01/Setting-up-HadoopHive-to-use-MySQL-as-metastore.aspx
是把meta信息存在mysql，防止hdfs掛了而得不到數據列表。感覺沒有必要，因爲hdfs掛了，有meta信息沒有什麼用。
u)

Hive & hadoop 的一些測試：
v) 加載gz 或者bz2格式元數據佔用空間&時間的比較：

hive> load data local inpath 'ok.txt.gz' overwrite into table page_test2 partition(dt='2010-04-16');
Copying data from file:/usr/local/ok.txt.gz
Loading data to table page_test2 partition {dt=2010-04-16}
OK
Time taken: 3.649 seconds
下面是Hadoop存儲的hive表的文件大小：
root@hadoop:/tmp/hadoop-root/dfs/data/current# du -ch blk_-945326243445352181
22M blk_-945326243445352181
22M total

w) 加載文本文件：
hive> load data local inpath 'ok.txt' overwrite into table page_test partition(dt='2010-04-17');
Copying data from file:/usr/local/ok.txt
Loading data to table page_test partition {dt=2010-04-17}
OK
Time taken: 41.593 seconds
下面是Hadoop存儲的hive表的文件大小：
root@hadoop:/tmp/hadoop-root/dfs/data/current# du -ch blk_7538941016314062501
64M blk_7538941016314062501
64M total

x) 源文件大小：
root@hadoop:/usr/local# du -ch ok.txt
196M ok.txt
196M total

root@hadoop:/usr/local# du -ch ok.txt.gz
22M ok.txt.gz
22M total

y) Hive查詢比較：
可以從結果看出壓縮的數據查詢速度比不壓縮的還快一點，奇怪了。
Gz文件導入並創建分區後使用hive查詢: hive> select count(1) from page_test2 a where a.dt='2010-04-16';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201004131133_0026, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0026
Kill Command = /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201004131133_0026
2010-04-16 13:43:39,435 Stage-1 map = 0%, reduce = 0%
2010-04-16 13:47:30,921 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0026
OK
17166483
Time taken: 239.447 seconds
Txt文件導入並創建分區後使用hive QL查詢: hive> select count(1) from page_test a where a.dt='2010-04-16';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201004131133_0025, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0025
Kill Command = /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201004131133_0025
2010-04-16 13:37:11,927 Stage-1 map = 0%, reduce = 0%
2010-04-16 13:42:01,382 Stage-1 map = 100%, reduce = 22%
2010-04-16 13:42:13,683 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0025
OK
17166483
Time taken: 314.291 seconds
Txt 沒有創建分區使用hive查詢沒有記錄下來，是400多秒

z) a
Hive 開發
a) 打開hive service：
在10000端口打開 hive服務
HIVE_PORT=10000 ./bin/hive --service hiveserver

b) 查看服務是否啓動：
netstat –nl |grep 100000

c) 寫測試程序：
官方給的例子，這個我編譯過去，執行有錯誤，沒有查出那裏問題。

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveJdbcClient {
private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

/**
* @param args
* @throws SQLException
*/
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.exit(1);
}
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
Statement stmt = con.createStatement();
String tableName = "testHiveDriverTable";
stmt.executeQuery("drop table " + tableName);
ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value string)");
// show tables
String sql = "show tables '" + tableName + "'";
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}
// describe table
sql = "describe " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1) + "\t" + res.getString(2));
}

// load data into table
// NOTE: filepath has to be local to the hive server
// NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
String filepath = "/tmp/a.txt";
sql = "load data local inpath '" + filepath + "' into table " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);

// select * query
sql = "select * from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
}

// regular hive query
sql = "select count(1) from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1));
}
}
}

站內首發文章

dy_252

發佈了37 篇原創文章 · 獲贊 1 · 訪問量 9萬+

私信關注

hadoop+hive 做數據倉庫 & 一些測試

985 碩士程序員，空窗 4 個月沒有 Offer！

營銷系統黑名單優化：位圖的應用解析

我真的從測試轉成了開發......

nginx添加相應配置，通過瀏覽器訪問或curl時返回客戶端對應公網IP

[oeasy]python020在遊戲中體驗數值自由_勇闖地下城_終端文字遊戲

爲何我建議你學會抄代碼

解密遊戲神作

導入地址表鉤取技術解析

盛大發布 | Zabbix 7.0 LTS--性能與擴展的卓越融合

mmsql 臨時表和主表 merge into 語法

這個週末有點累。。。

數據挖掘--統計基礎概念

SQL習慣

Apache Sqoop

hadoop+hive 做數據倉庫 & 一些測試

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結