Spark RDD QuickStart

前言

本章主要介紹Spark RDD的QuickStart. 並且記錄相關的操作過程與錯誤.


Spark 集羣與本地集羣

本地集羣 配置spark-en.shslaves文件設置相關配置即可. 主要都在conf文件夾內. 其餘相關操作見本系列的前幾節.

  • slaves文件
#slaves文件
# A Spark Worker will be started on each of the machines listed below.
localhost
#192.168.31.80
  • spark-env.sh 文件
# spark-env.sh 文件
export SCALA_HOME=/Users/Sean/Software/Scala/scala-2.11.8
#export SCALA_HOME=/Users/Sean/Software/Scala/scala-2.11.7
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
export SPARK_MASTER_IP=127.0.0.1
export SPARK_LOCAL_IP=127.0.0.1
export SPARK_WORKER_MEMORY=1g
#export SPARK_WORKER_MEMORY=512M
export SPARK_EXECUTOR_CORES=2
export SPARK_MASTER_PORT=7077 
#export master=spark://192.168.31.80:7070 

設置單個節點爲1G2核.

  • 啓動文件 sbin/start-all.sh
cd sbin
./start-all.sh

Java RDD 程序

  • 本地運行&集羣運行
SparkConf sparkConf = new SparkConf().setAppName("Spark-Overview-WordCount");
if(runLocalFlag) {
	sparkConf.setMaster("local[2]");
}else {
	sparkConf.setMaster("spark://localhost:7077")
	.setJars(new String[] {"/Users/sean/Documents/Gitrep/bigdata/spark/target/spark-demo.jar"});
		}
//獲取context對象
JavaSparkContext context = new JavaSparkContext(sparkConf);
  1. 本地運行 setMaster(“local[2]”)
  2. 集羣模式 setMaster(“spark://localhost:7077”)
    .setJars(new String[] {"/Users/sean/Documents/Gitrep/bigdata/spark/target/spark-demo.jar"}); }
  3. 注意事項: 集羣模式需要將代碼打成包, 隨後設定指定地址. 提交遠端的集羣需要使用spark-submit命令進行提交.
  • RDD 相關操作
JavaSparkContext sc = new JavaSparkContext(conf);

List<Integer> data = Arrays.asList(1,2,3,4,5);
JavaRDD<Integer> distData = sc.parallelize(data);	
int sum = distData.reduce((a, b) -> a + b);
//distData.map((x) -> x*10);
//distData.map((x) -> x).reduce((a,b) ->(a+b));

System.out.println(sum);		

例子中, 使用了mapreduce兩種算子.


Q&A

解決措施:

  1. 集羣模式提交時候, 沒有setJars(xxxx). 將包打包, 一併提交即可.
  2. 使用本地開發模式local[2].
    [1] .使用Idea提交Spark程序
    [2]. Spark SerializedLambda錯誤解決方案
    [3]. idea連接spark集羣報錯解析:Caused by: java.lang.ClassCastException
JavaSparkContext sc = new JavaSparkContext(conf);
### 好像不怎麼管用.
sc.setLogLevel("WARN");
  • Q5: 未分配到資源.
    在這裏插入圖片描述
    增加節點分配的內存和其他.

  • Q6: 轉換異常. (解決辦法見Q2 Q3)

20/07/02 21:06:35 ERROR Executor: Exception in task 1.2 in stage 0.0 (TID 4)
java.io.IOException: unexpected exception type
	at java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1582)
	at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1154)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1817)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
	at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1148)
	... 27 more
Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization
	at com.yanxml.bigdata.java.spark.rdd.FirstRDDDemo.$deserializeLambda$(FirstRDDDemo.java:10)
	... 37 more
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章