spark Local環境搭建,第一個DEMO程序的編寫
- 機器:windows 10 64位。
- 開發語言: Java
- JDK: 1.8.
1.spark,hadoop環境標題變量配置
去 http://spark.apache.org/downloads.html 網站下載spark,我下載的是spark-1.6.1-bin-hadoop2.6,spark版本是1.6.1,同時下載hadoop-2.6.0.tar.gz
spark是基於hadoop之上的,運行過程中會調用相關hadoop庫,如果沒配置相關hadoop運行環境,會出錯.
- SPARK_HOME D:\software\bigdata\spark-1.6.1-bin-hadoop2.6
- HADOOP_HOME D:\software\bigdata\hadoop-2.6.0
- PATH追加: %SPARK_HOME%\bin %SPARK_HOME%\sbin %HADOOP_HOME%\bin
至此,在cmd命令下輸入spark-shell.正常輸出即是成功.
2.DEMO搭建
POM:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.credo</groupId>
<artifactId>spark-test</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<!-- http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.1</version>
</dependency>
</dependencies>
<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
<compilerArgument>-proc:none</compilerArgument>
</configuration>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
main方法:
package org.credo;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
import java.util.Arrays;
import java.util.UUID;
/**
* Created by ZhaoQian on 2016/6/12.
*/
public class spark {
public static void main(String[] args) {
System.out.println("================spark begin==============================");
System.setProperty("hadoop.home.dir", "D:\\software\\bigdata\\hadoop-2.6.0");
//創建一個Java版本的spark Context
SparkConf sparkConf=new SparkConf().setAppName("wordCount");
JavaSparkContext javaSparkContext=new JavaSparkContext(sparkConf);
//讀取某個文件
JavaRDD<String> input=javaSparkContext.textFile("D:\\logger\\server.log2");
/**普通的寫法*/
// JavaRDD<String> words=input.flatMap(
// new FlatMapFunction<String, String>() {
// @Override
// public Iterable<String> call(String s) throws Exception {
// return Arrays.asList(s.split(" "));
// }
// }
// );
// //轉換爲鍵值對並計數
// JavaPairRDD<String,Integer> counts=words.mapToPair(new PairFunction<String, String, Integer>() {
// @Override
// public Tuple2<String, Integer> call(String s) throws Exception {
// return new Tuple2<String, Integer>(s,1);
// }
// }).reduceByKey(new Function2<Integer, Integer, Integer>() {
// @Override
// public Integer call(Integer v1, Integer v2) throws Exception {
// return v1+v2;
// }
// });
//切分爲單詞,上面是默認方法,下面是lambda表達式.
JavaRDD<String> words=input
.flatMap((FlatMapFunction<String, String>) s -> Arrays.asList(s.split(" ")));
JavaPairRDD<String,Integer> counts=words
.mapToPair((PairFunction<String, String, Integer>) s -> new Tuple2<>(s,1))
.reduceByKey((Function2<Integer, Integer, Integer>) (v1, v2) -> v1+v2);
//在文件中顯示統計的單詞信息 ("某單詞","單詞統計出的次數")
counts.saveAsTextFile("D://logger//"+ UUID.randomUUID().toString());
System.out.println("================spark end==============================");
}
}
3.過程中的一些問題解決:
解決A master URL must be set in your configuration錯誤
在運行spark的測試程序SparkPi時,點擊運行,出現瞭如下錯誤:
- Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
- at org.apache.spark.SparkContext.<init>(SparkContext.scala:185)
- at SparkPi$.main(SparkPi.scala:12)
- at SparkPi.main(SparkPi.scala)
- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
- at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
- at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
- at java.lang.reflect.Method.invoke(Method.java:606)
- at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
- Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
從提示中可以看出找不到程序運行的master,此時需要配置環境變量。 傳遞給spark的master url可以有如下幾種:
- local 本地單線程
- local[K] 本地多線程(指定K個內核)
- local[*] 本地多線程(指定所有可用內核)
- spark://HOST:PORT 連接到指定的 Spark standalone cluster master,需要指定端口。
- mesos://HOST:PORT 連接到指定的 Mesos 集羣,需要指定端口。
- yarn-client客戶端模式 連接到 YARN 集羣。需要配置 HADOOP_CONF_DIR。
- yarn-cluster集羣模式 連接到 YARN 集羣。需要配置 HADOOP_CONF_DIR。
VM options中輸入“-Dspark.master=local”,指示本程序本地單線程運行,再次運行即可。
_Failed to locate the winutils binary in the hadoop binary path java.io.IOExc [權限或文件缺失,或者是hadoop環境未配置正確引起]: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html