作爲Hadoop生態系統一部分的Hive,使得用戶可以編寫類sql(HQL)語句後再由Hive進行轉化成不同的map-reduce任務交給hadoop來執行。Spark是一個分佈式的內存計算系統,主要充當分佈式計算的部分,比hadoop的map-reduce速度更快。shark是spark的一個組件,一個開源的、分佈式的、容錯的基於內存的分析系統,使用了現有的Hive客戶端和元數據存儲,兼容了Hive的功能,速度比hive快很多,是一個在spark之上調用shark來運行hive數據的系統,它把HQL轉化成多個小任務在spark上運行。 shark是大數據實時查詢分析的利器,支持HiveQL,Hive數據格式以及udf函數,shark SQL查詢比Hive快100倍,此外可以用來查詢HDFS,HBase數據,機器學習比hadoop快100倍。
- Spark 0.8.1
- Shark 0.8.1
- Hive 0.9.0
- Hadoop2.0.0-CDH4.4.0
- Scala-0.9.3
以下是各種下載地址:直接wget即可。
名稱 |
下載地址 |
Spark 0.8.1 |
http://d3kbcqa49mib13.cloudfront.net/spark-0.8.1-incubating-bin-cdh4.tgz |
Shark 0.8.1 |
https://github.com/amplab/shark/releases/download/v0.8.1/shark-0.8.1-bin-cdh4.tgz |
Hive 0.9.0 |
https://github.com/amplab/shark/releases/download/v0.8.1/hive-0.9.0-bin.tgz |
Hadoop2-CDH4.3 |
http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.3.0.tar.gz |
Scala 0.9.3 |
spark:
$cd $YOUR_SPARK_HOME/ $vim conf/spark-env.sh
修改spark-env.sh
SCALA_HOME=/usr/lib/spark-0.8.1-incubating-bin-cdh4/scala-2.9.3
JAVA_HOME=/usr/java/jdk1.7.0_25
export CLASSPATH=/usr/java/jdk1.7.0_25/lib
SPARK_MASTER_IP=192.168.10.220
SPARK_MASTER_PORT=8081
SPARK_MASTER_WEBUI_PORT=8090
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=4g
SPARK_WORKER_PORT=8091
SPARK_WORKER_WEBUI_PORT=8092
SPARK_WORKER_INSTANCES=1
export SPARK_JAVA_OPTS="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps"
export HADOOP_HOME=/usr/lib/hadoop
Hive:
若使用mysql保存hive的元數據相關信息,需要拷貝mysql-connector-java-3.1.13-bin.jar 到$HIVE_HOME/lib 目錄下。
#cd $YOUR_HIVE_HOME
#cp conf/hive-env.sh.template conf/hive-env.sh
#cp conf/hive-default.xml.template conf/hive-site.xml
#vim conf/hive-env.sh
修改hive-env.xml
.... HADOOP_HOME=$YOUR_HADOOP2_HOME ....
修改hive-site.xml
.....
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/SHARK_DATABASE?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>shark</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>shark</value>
<description>password to use against metastore database</description>
</property>
......
Shark:
#cd $YOUR_SHARK_HOME #cp conf/shark-env.sh.template conf/shark-env.sh #vim conf/shark-env.sh
修改shark-env.sh
......
export SCALA_HOME=/usr/lib/scala-2.9.3 export HADOOP_HOME=/usr/lib/hadoop export HIVE_HOME=/usr/lib/shark-0.8.1-bin-cdh4/hive-0.9.0-bin/ export MASTER=spark://192.168.10.220:8081 export SPARK_HOME=/usr/lib/spark-0.8.1-incubating-bin-cdh4 export HIVE_CONF_DIR=/usr/lib/shark-0.8.0-bin-cdh4/hive-0.9.0-bin/conf
# Java options # On EC2, change the local.dir to /mnt/tmp SPARK_JAVA_OPTS="-Dspark.local.dir=/tmp " SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 " SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps " export SPARK_JAVA_OPTS
......啓動並測試
#cd $YOUR_HADOOP_HOME #bin/hadoop namenode -format #sbin/start-all.sh #cd $YOUR_SPARK_HOME #bin/start-all.sh #cd $YOUR_SHARK_HOME #bin/shark Starting the Shark Command Line Client ...... ...... shark>show databases; shark>create database SHARK_DB; shark>use SHARK_DB; shark>create table tbl_test(ID STRING);
可以看到shark作業按照隊列逐一運行,在spark集羣上運行狀態如下:
Spark Master at spark://192.168.10.220:8081
- URL: spark://192.168.10.220:8081
- Workers: 4
- Cores: 16 Total, 16 Used
- Memory: 57.6 GB Total, 4.0 GB Used
- Applications: 2 Running, 11 Completed
Workers
Id | Address | State | Cores | Memory |
---|---|---|---|---|
worker-20140518173608-CHBM220-34134 | CHBM220:8081 | ALIVE | 4 (4 Used) | 14.4 GB (1024.0 MB Used) |
worker-20140518173608-CHBM221-41264 | CHBM221:8081 | ALIVE | 4 (4 Used) | 14.4 GB (1024.0 MB Used) |
worker-20140518173610-CHBM223-36479 | CHBM223:8081 | ALIVE | 4 (4 Used) | 14.4 GB (1024.0 MB Used) |
worker-20140518173611-CHBM224-41840 | CHBM224:8081 | ALIVE | 4 (4 Used) | 14.4 GB (1024.0 MB Used) |
Running Applications
ID | Name | Cores | Memory per Node | Submitted Time | User | State | Duration |
---|---|---|---|---|---|---|---|
app-20140519085613-0012 | Shark::CHBM220 | 0 | 1024.0 MB | 2014/05/19 08:56:13 | root | WAITING | 9.4 min |
app-20140519083651-0009 | Shark::CHBM220 | 16 | 1024.0 MB | 2014/05/19 08:36:51 | root | RUNNING | 29 min |
Completed Applications
ID | Name | Cores | Memory per Node | Submitted Time | User | State | Duration |
---|---|---|---|---|---|---|---|
app-20140519083741-0011 | Shark::CHBM220 | 0 | 1024.0 MB | 2014/05/19 08:37:41 | root | FINISHED | 8.7 min |
app-20140519083723-0010 | Shark::CHBM220 | 0 | 1024.0 MB | 2014/05/19 08:37:23 | root | FINISHED | 1 s |