基於maven創建spark工程、調試並運行

原創

qq_23617681

2020-02-20 15:29

建立spark工程有兩種方式：java工程、maven工程。

由於maven管理jar包很方便，本篇基於maven構建spark工程。

spark支持四種語言：scala、java、python、R。

其中scala是基於java的語言，必須有JDK支持。

同時也是spark源碼語言，官方API文檔對scala的支持是最好的。

如果能選擇scala語言作爲spark程序的開發，是最好的。

java、python是spark中支持比較好的語言，官方文檔中有完整的API解釋。

R語言是spark1.4版之後纔開始支持，官方資源較少，網絡資源也少。

由於博主之前用的是java，這裏爲了快速入手，構建出能運行spark實例，還是用java開發spark程序。

先決條件：

1、已安裝好maven。

2、已安裝好hadoop。

3、已安裝好spark。

maven構建spark工程基本步驟：

1、新建maven工程。

2、新建JavaSparkPi類。

3、添加spark解壓包中JavaSparkPi.java代碼。

package sparkTest;

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;

import java.util.ArrayList;
import java.util.List;

/** 
 * Computes an approximation to pi
 * Usage: JavaSparkPi [slices]
 */
public final class JavaSparkPi {

  public static void main(String[] args) throws Exception {
    SparkConf sparkConf = new SparkConf().setAppName("JavaSparkPi").setMaster("local");
    JavaSparkContext jsc = new JavaSparkContext(sparkConf);

    int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2;
    int n = 100000 * slices;
    List<Integer> l = new ArrayList<Integer>(n);
    for (int i = 0; i < n; i++) {
      l.add(i);
    }

    JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);

    int count = dataSet.map(new Function<Integer, Integer>() {
      @Override
      public Integer call(Integer integer) {
        double x = Math.random() * 2 - 1;
        double y = Math.random() * 2 - 1;
        return (x * x + y * y < 1) ? 1 : 0;
      }
    }).reduce(new Function2<Integer, Integer, Integer>() {
      @Override
      public Integer call(Integer integer, Integer integer2) {
        return integer + integer2;
      }
    });

    System.out.println("Pi is roughly " + 4.0 * count / n);

    jsc.stop();
  }
}

4、下載spark-core_2.10-1.6.1jar包及其依賴包。

5、import jar包，消除源碼編譯錯誤。

6、運行程序，提示錯誤：A master URL must be configuration.

解決：sparkContext.setMaster("local").local表示本地運行程序。

形式：SparkConf sparkConf = new SparkConf().setAppName("JavaSparkPi").setMaster("local");

7、再次運行，報錯：sparkDriver failed。

原因：IP地址和端口不對

解決方法：第一步：打開spark-env.sh文件。添加：SPARK_MASTER_IP=127.0.0.1,SPARK_LOCAL_IP=127.0.0.1.最好填IP而非名字（localhost）

第二步：sudo gedit /etc/hosts打開系統hosts文件。添加如下內容。

				192.168.0.115 localhost peter-HP-ENVY-Notebook
				255.255.255.255 broadcasthost
				127.0.0.1    localhost localhost.localdomain localhost4 localhost4.localdomain4 peter-HP-ENVY-Notebook
				::1          	    localhost localhost.localdomain localhost6 localhost6.localdomain6 peter-HP-ENVY-Notebook

具體原因還沒深究。大致解釋如下。

第一行表示IP、對應的機器爲本機、對應機器名。

第二行表示IP、局域網廣播host IP

第三行表示本機對應的各個別名。

第四行表示、具體我也沒搞懂。

注意：IP127.0.0.1就是本機的另一種寫法。IP192.168.0.115纔是本機真實的IP地址，是局域網IP地址。

sparkDriver failed就是對應的sparkDriver 端口port 0連不上，就是hosts中沒有設置真實IP地址，之前只設置了127.0.0.1 localhost。

擴展調試：

1、stop-dfs.sh關閉hadoop的hdfs分佈式文件存儲系統後，運行sparkPi程序，同樣成功。

分析：sparkPi程序沒有讀取HDFS上的文件，不需要HDFS支持。

具體計算由spark程序完成，不需要mapreduce。

資源調度不開啓yarn也可以。本身hadoop程序不開啓yarn也可以。

2、stop-all.sh關閉spark的master、worker，運行sparkPi程序，同樣OK。

分析：此例子設置了local運行模式，所以不需要master、worker模式。

如果沒設置，是standalone模式，則需要開啓master、worker。

補充說明：

1、sparkDriver failed報錯，查看的幾個文件。

spark-defaults.conf.template，沒有設置，後面需要注意這個文件的作用及何時設置。

slaves.template，沒有設置，它決定worker的IP地址。

spark-env.sh，修改設置了，它決定master的IP地址，還有很有參數。我的設置如下。

# set spark environment
    export JAVA_HOME=/usr/lib/jvm/java
    export SCALA_HOME=/opt/scala
    export SPARK_MASTER_IP=127.0.0.1
    export SPARK_LOCAL_IP=127.0.0.1
    export SPARK_WORKER_CORES=2
    export SPARK_WORKER_MEMORY=1g
    export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

2、如何程序打成jar包，放在spark集羣上運行，這裏給出初步思路。

利用maven或者eclipse將程序打成jar包。

命令行運行spark-submit ×××.jar --params，運行jar包，注意參數的輸入。

小知識：

命令hostname，查看本機machine name。

qq_23617681

發佈了139 篇原創文章 · 獲贊 19 · 訪問量 20萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

基於maven創建spark工程、調試並運行

hadoop程序開發實踐——簡單程序

海量數據的KNN分類、Kmeans聚類

spark程序解析——WordCount

spark中協同過濾算法分析

算法模型好壞、評價標準、算法系統設計

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結