紅字部分來源於:董的博客
目前Apache Spark支持三種分佈式部署方式,分別是standalone、spark on mesos和
spark on YARN,其中,第一種類似於MapReduce
1.0所採用的模式,內部實現了容錯性和資源管理,後兩種則是未來發展的趨勢,部分容錯性和資源管理交由統一的資源管理系統完成:讓Spark運行在一個通用的資源管理系統之上,這樣可以與其他計算框架,比如MapReduce,公用一個集羣資源,最大的好處是降低運維成本和提高資源利用率(資源按需分配)。
spark的本地模式類似於hadoop的單機模式,是爲了方便我們調試或入門的。
1.先去官網下載下來http://spark.apache.org/downloads.html,不要下錯了,下載pre-built(這是已經編譯好了,產生了二進制文件的)for 你的hadoop版本。
不過還要注意一點,打開http://spark.apache.org/documentation.html
選擇你下載的版本,進去之後看一下它所相求的java等版本,最好按它要求來,要不然會出現很多問題。
2.先用下local spark-shell,寫個scala版的wordcount
快速開始:http://spark.apache.org/docs/latest/quick-start.html
什麼都不用配,直接啓動spark-shell就可以了。如果之後你搭好了集羣,在spark-shell後加上master的url就是在集羣上啓動了。
guo@guo:~$ cd /opt/spark-1.6.1-bin-hadoop2.6/
guo@guo:/opt/spark-1.6.1-bin-hadoop2.6$ bin/spark-shell
啓動spark-shell時自動生成了sparkcontext對象sc
scala> val testlog=sc.textFile("test.log")
testlog: org.apache.spark.rdd.RDD[String] = test.log MapPartitionsRDD[5] at textFile at <console>:27
#看一下第一行
scala> testlog.first
res2: String = hello world
統計下總行數,無參的方法可以不寫()scala> testlog.count
res4: Long = 3
#看一下前三行,take()方法返回一個數組
scala> testlog.take(3)
res5: Array[String] = Array(hello world, hello kitty, hello guo)
主要代碼就是下面這一行
scala> val wordcount=testlog.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey((a,b)=>a+b)
wordcount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:29
collect方法返回的是一個數組
scala> wordcount.collect
res6: Array[(String, Int)] = Array((hello,3), (world,1), (guo,1), (kitty,1))
遍歷一下這個數組
scala> wordcount.collect.foreach(println)
(hello,3)
(world,1)
(guo,1)
(kitty,1)
Ctrl+D或:q退出spark shell
或者直接在idea裏運行
/**
* Created by guo on 16-4-24.
*/
import org.apache.spark.{SparkContext,SparkConf}
object WordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("wordcount").setMaster("local")
val sc = new SparkContext(conf)
val textFile = sc.textFile("/home/guo/test.log")
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey((a,b) => a+b)
wordCounts.collect().foreach(println)
sc.stop()
}
}
是不是要愛上scala了!
3.僞分佈模式
啓動master
guo@guo:~$ cd /opt/spark-1.6.1-bin-hadoop2.6/
guo@guo:/opt/spark-1.6.1-bin-hadoop2.6$ sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-1.6.1-bin-hadoop2.6/logs/spark-guo-org.apache.spark.deploy.master.Master-1-guo.out
guo@guo:/opt/spark-1.6.1-bin-hadoop2.6$ jps
2323 NameNode
3187 NodeManager
5204 Jps
2870 ResourceManager
2489 DataNode
2700 SecondaryNameNode
5116 Master
啓動master之後查看進程就會發現多了一個master,如果沒有去看一下日誌哪出錯了。有的話你就可以打開瀏覽器看一下8080端口,網頁上有這麼一行 URL: spark://guo:7077,這代表master的地址,後面會用到。dfs和yarn的那些進程是沒有的,那是我之前啓動的。
啓動slave
guo@guo:/opt/spark-1.6.1-bin-hadoop2.6$ sbin/start-slave.sh spark://guo:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-1.6.1-bin-hadoop2.6/logs/spark-guo-org.apache.spark.deploy.worker.Worker-1-guo.out
guo@guo:/opt/spark-1.6.1-bin-hadoop2.6$ jps
2323 NameNode
3187 NodeManager
2870 ResourceManager
5241 Worker
2489 DataNode
5306 Jps
2700 SecondaryNameNode
5116 Master
查看/tmp你會發現多了倆文件spark-guo-org.apache.spark.deploy.master.Master-1.pid和spark-guo-org.apache.spark.deploy.worker.Worker-1.pid,這裏面存的就是進程ID。如果沒有設置它會默認存在/tmp目錄下,因爲/tmp目錄裏的文件會自動清除,所以在生產環境中要設置一下SPARK_PID_DIR。
如果想啓動多個worker怎麼辦?
可以這樣,先關掉slave,然後export 。。。
guo@guo:/opt/spark-1.6.1-bin-hadoop2.6$ sbin/stop-slave.sh
stopping org.apache.spark.deploy.worker.Worker
guo@guo:/opt/spark-1.6.1-bin-hadoop2.6$ export SPARK_WORKER_INSTANCES=3
guo@guo:/opt/spark-1.6.1-bin-hadoop2.6$ sbin/start-slave.sh spark://guo:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-1.6.1-bin-hadoop2.6/logs/spark-guo-org.apache.spark.deploy.worker.Worker-1-guo.out
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-1.6.1-bin-hadoop2.6/logs/spark-guo-org.apache.spark.deploy.worker.Worker-2-guo.out
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-1.6.1-bin-hadoop2.6/logs/spark-guo-org.apache.spark.deploy.worker.Worker-3-guo.out
guo@guo:/opt/spark-1.6.1-bin-hadoop2.6$ jps
5536 Worker
2323 NameNode
3187 NodeManager
5606 Worker
2870 ResourceManager
2489 DataNode
5675 Worker
5740 Jps
2700 SecondaryNameNode
5116 Master
想搞HA怎麼辦(兩個master)?
修改這個腳本,當然這只是模擬
guo@guo:/opt/spark-1.6.1-bin-hadoop2.6$ gedit ./sbin/start-master.sh
"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
把其中的1改爲2就可以了
然後再啓動master
guo@guo:/opt/spark-1.6.1-bin-hadoop2.6$ sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-1.6.1-bin-hadoop2.6/logs/spark-guo-org.apache.spark.deploy.master.Master-2-guo.out
guo@guo:/opt/spark-1.6.1-bin-hadoop2.6$ jps
5536 Worker
2323 NameNode
3187 NodeManager
5827 Master
5606 Worker
2870 ResourceManager
2489 DataNode
5675 Worker
2700 SecondaryNameNode
5116 Master
5935 Jps
vi中 /XX 查找XX