环境介绍
系统 :Centos6.5
软件版本: hadoop2.6.0 jdk1.8 scala-2.11.7 spark-1.4.1-bin-hadoop2.6
集群状态:
master: www 192.168.78.110
slave1: node1 192.168.78.111
slave2: node2 192.168.78.112
hosts 文件
192.168.78.110 www
192.168.78.111 node1
192.168.78.112 node2
确保三台机器之间互ping 主机名能ping通
1. 下载hadoop,scala,spark,并解压到/opt/hadoop下
[hadoop@www hadoop]$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1-bin-hadoop2.6.tgz
[hadoop@www hadoop]$ wget http://downloads.typesafe.com/scala/2.11.7/scala-2.11.7.tgz?_ga=1.262254604.1613215006.1446896742
[hadoop@www hadoop]$ wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.6.2/hadoop-2.6.2.tar.gz
[hadoop@www hadoop]$ tar -xzvf spark-1.4.1-bin-hadoop2.6.taz //解压压缩包
[hadoop@www hadoop]$ tar -xzvf scala-2.11.7.tgz
[hadoop@www hadoop]$ tar -xzvf hadoop-2.6.2.tar.gz
最后结果为
[hadoop@www hadoop]$ pwd
/opt/hadoop
[hadoop@www hadoop]$ ll
总用量 12
drwxr-xr-x. 11 hadoop hadoop 4096 11月 8 08:30 hadoop-2.6.2
drwxr-xr-x. 6 hadoop hadoop 4096 11月 8 18:40 scala-2.11.7
drwxr-xr-x. 11 hadoop hadoop 4096 11月 8 18:40 spark-1.4.1-bin-hadoop2.6
- 配置hadoop完全分布式集群环境,详情见http://blog.csdn.net/erujo/article/details/49716841
- 编辑 ~/.bashrc文件 配置环境变量配置
[hadoop@www scala-2.11.7]$ vimx ~/.bashrc
# User specific aliases and functions
export JAVA_HOME=/usr/java/jdk1.8.0_65
export SCALA_HOME=/opt/hadoop/scala-2.11.7
export HADOOP_HOME=/opt/hadoop/hadoop-2.6.2
export SPARK_HOME=/opt/hadoop/spark-1.4.1-bin-hadoop2.6
PATH=$PATH:${SCALA_HOME}/bin:${SPARK_HOME}/bin:${HADOOP_HOME}/bin
[hadoop@www scala-2.11.7]$ source !$
source ~/.bashrc
测试scala
[hadoop@www scala-2.11.7]$ scala
Welcome to Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65).
Type in expressions to have them evaluated.
Type :help for more information.
scala> //说明成功
copy到slave机器
[hadoop@www scala-2.11.7]$ scp ~/.bashrc [email protected]:~/.bashrc
- 在master主机配置spark
4.1 spark-env.sh
[hadoop@www hadoop]$ cd spark-1.4.1-bin-hadoop2.6/conf/
[hadoop@www conf]$ mv spark-env.sh.template spark-env.sh
[hadoop@www conf]$ vimx spark-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_65
export SCALA_HOME=/opt/hadoop/scala-2.11.7
export SPARK_MASTER_IP=192.168.78.110
export SPARK_WORKER_MEMORY=2g
export HADOOP_CONF_DIR=/opt/hadoop/hadoop-2.6.2/etc/hadoop
4.2 slaves
[hadoop@www conf]$ vimx slaves
node1
node2
配置好后将spark目录复制到从节点上
- 启动spark分布式集群并查看信息
[hadoop@www conf]$ /opt/hadoop/hadoop-2.6.2/sbin/start-all.sh
[hadoop@www conf]$ /opt/hadoop/spark-1.4.1-bin-hadoop2.6/sbin/start-all.sh
查看进程
master
[hadoop@www spark-1.4.1-bin-hadoop2.6]$ jps
8725 jps
8724 Master
6679 ResourceManager
6504 SecondaryNameNode
6264 NameNode
slave
[hadoop@node1 spark-1.4.1-bin-hadoop2.6]$ jps
8880 Worker
8993 Jps
6770 NodeManager
6349 DataNode
如果进程都有则启动成功
- 启动spark-shell控制台
[hadoop@www spark-1.4.1-bin-hadoop2.6]$ spark-shell
之前我们在/input 目录上传了一个test.log文件,我们现在就用spark读取hdfs中test.log文件
现在用spark进行测试
scala> val file = sc.textFile("hdfs://master:9000/input/test.log")
scala> val count = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)
scala> count.collect()
最后一行可见
15/11/08 19:49:28 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:26, took 16.682841 s
res0: Array[(String, Int)] = Array((hadoop,1), (hello,2), (world,1))
在http://192.168.78.110:4040/stages网页上也可以看到相关内容
停止spark
[hadoop@www spark-1.4.1-bin-hadoop2.6]$ /opt/hadoop/spark-1.4.1-bin-hadoop2.6/sbin/stop-all.sh