sparkPi源碼解析

在ubuntu的eclipse系統上，基於maven建立了第一個spark程序sparkPi，順利執行正確結果。

現在對sparkPi源碼進行解析，藉此熟悉spark java API，爲後面基於java的spark編程做準備。

sparkPi源碼如下：

package sparkTest;

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;

import java.util.ArrayList;
import java.util.List;

/** 
 * Computes an approximation to pi
 * Usage: JavaSparkPi [slices]
 */
public final class JavaSparkPi {

  public static void main(String[] args) throws Exception {
    SparkConf sparkConf = new SparkConf().setAppName("JavaSparkPi").setMaster("local");
    JavaSparkContext jsc = new JavaSparkContext(sparkConf);

    int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2;
    int n = 100000 * slices;
    List<Integer> l = new ArrayList<Integer>(n);
    for (int i = 0; i < n; i++) {
      l.add(i);
    }

    JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);

    int count = dataSet.map(new Function<Integer, Integer>() {
      @Override
      public Integer call(Integer integer) {
        double x = Math.random() * 2 - 1;
        double y = Math.random() * 2 - 1;
        return (x * x + y * y < 1) ? 1 : 0;
      }
    }).reduce(new Function2<Integer, Integer, Integer>() {
      @Override
      public Integer call(Integer integer, Integer integer2) {
        return integer + integer2;
      }
    });

    System.out.println("Pi is roughly " + 4.0 * count / n);

    jsc.stop();
  }
}

源碼解析：

1、SparkContext是

spark程序中非常核心的變量，所有spark程序的第一步都是創建+設置SparkContext。

sparkContext的作用是告訴spark如何訪問集羣cluster。

appName()設置程序名稱，將會顯示在cluster UI上。

master()是spark、Mesos、yarn集羣的URL；或者是“local”字符串，表示在本地模式運行程序。

實際編程和運行程序時，並不會在編碼中設置master()，因爲這樣靈活性不強。

常用的做法是，啓動spark-submit -params，通過參數輸入設置master()。

如果是本地測試程序，可以設置master爲local。

2、parallelize()對輸入數據進行並行化處理，處理後的數據就能進行RDD上的操作了，比如各種transformations、actions。

<span style="white-space:pre">	</span>parallelize()

3、map操作是一種transformation，將map的函數參數在數據集上的每個element，操作一次，得到一個新的RDD。

<span style="white-space:pre">	</span>map()

4、reduce操作是將RDD所有元素的聚合起來，通過reduce中的函數參數，從而得到一個最終結果。

<span style="white-space:pre">	</span>reduce

5、map、reduce內的函數傳遞是spark程序中常見的方式。它的實現形式有兩種。

new fuction()：傳統方式。

lamda表達式：更簡潔。

6、函數Function是接口函數。T1爲輸入參數，R爲返回值。同理Function2也一樣，不同的是起實現函數call不同。

Interface Function<T1,R>; Function2<T1,T2,R>

<span style="white-space:pre">	</span>new Function<Integer, Integer>()

<pre name="code" class="java"><span style="white-space:pre">	</span>new Function2<Integer, Integer, Integer>()

7、spark除了map、reduce函數外，還支持許多數據操作，比如filter()、collect()、count()等等。

補充說明：

1、數據如何並行化處理？

<span style="white-space:pre">	</span>List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
<span style="white-space:pre">	</span>JavaRDD<Integer> distData = sc.parallelize(data);

通過上述操作，將普通集合data通過函數parallelize()，並行化處理後，就變成了分佈式數據集distributed dataset（distData），這種類型就能被spark集羣做並行處理了。

<span style="white-space:pre">	</span>sc.parallelize(data, 10)

我們也可以通過手動設置dataset的partitions個數，一個partition對應一個spark task。

如果沒有設置partitions，則spark自動設置partitions個數。

2、spark支持的外部輸入數據有哪些？

spark的輸入數據可以來自hadoop的hdfs、local file system、Cassandra、HBase、Amazon S3、

textFile()函數用來將文件轉化爲RDD，轉化結束後，新的RDD上就可以進行map、reduce、transformations、actions等操作。

textFile()函數中的參數path可以是directory、compress files、wildcards.

textFile()對文件的劃分默認是64M一塊（等同與HDFS系統），我們也可以根據參數修改劃分標準。

val lines = textFile(“test.txt”)的解讀。這句話並沒有將test.txt加載到內存，也沒有在文件上進行任何操作，只是使lines指向這個文件。

3、spark的RDD有兩種創建方式：對以存在的數據集合進行並行化；對Hadoop HDFS、Hbase的數據進行轉化。

4、spark支持兩種操作：transformations、actions。

transformations：從已有數據集創建RDD數據集。比如：map就是一種transformation。

actions：對RDD上的dataset進行computation，return a value to the driver program.比如：reduce就是一種action。

transformations採取懶惰策略。即直到action需要result時，transformation才進行具體操作。之前都只是記住這些操作。

這種lazy策略，能夠提高spark效率。避免transformation結果產生，只產生action結果。

對於如下操作：

<span style="white-space:pre">	</span>JavaRDD<String> lines = sc.textFile("data.txt");
<span style="white-space:pre">	</span>JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
<span style="white-space:pre">	</span>int totalLength = lineLengths.reduce((a, b) -> a + b);

前面兩步都沒有直接執行，知道最後一部reduce，才觸發真個spark程序執行。

reduce觸發spark程序將task分解到集羣中的每個機器上，每個機器執行其本地的map和reduce操作。向driver program返回其執行結果。

如果我們想在程序執行完畢後，查看lineLengths的值。可以使用如下方式，將其值persist().

lineLengths.persist(StorageLevel.MEMORY_ONLY());

5、RDD可以通過persist()和cache()讓其貯存在內存中，加快下次訪問RDD的速度。

RDD也支持將起貯存在disk，多節點之間的deplicated。

如何選取存儲介質，需要根據具體情況。

如果不想等cache()滿了以後自動觸發清除操作，可以手動.unpersist()清除數據。

6、何爲driver program，官方文檔中反覆提及。

可以理解爲我們編寫的spark源程序。

7、spark提供兩種方式共享數據。

broadcast variables

accumulators

spark工作方式，簡單參考圖如下：

qq_23617681

發佈了139 篇原創文章 · 獲贊 19 · 訪問量 20萬+

私信關注

sparkPi源碼解析

Interface Function<T1,R>; Function2<T1,T2,R>

AI 畫圖真刺激，手把手教你如何用 ComfyUI 來畫出刺激的圖

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

數據展示動態（跑分）顯示

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

hadoop程序開發實踐——簡單程序

海量數據的KNN分類、Kmeans聚類

spark程序解析——WordCount

spark中協同過濾算法分析

算法模型好壞、評價標準、算法系統設計

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結