不可變的分佈式的對象集合：只包含對象引用，實際對象在集羣的節點上。
彈性、容錯。
Transformations：operations都是增加新的RDD，original增加後不再修改。

默認地，RRD使用hash算法做分區。
分區數依賴節點數和數據大小。

RDD Creation

Parallelizing a collection: splits成分區，跨集羣distributes分區
Reading data from an external source
Transformation of an existing RDD
Streaming API

Transformations

從已經存在的RDD，增加新的RDD。比如，splitting、filtering、排序。
可以按順序執行幾個transformations。

Actions

Action triggers the entire DAG (Directed Acyclic Graph) of transformations。
To trigger the computation, we run an action.
action指示Spark計算一系列transformations的結果。

兩種類型：

Driver：比如collect count等
Distributed：比如saveAsTextfile。

Actions可以：

在控制檯查看結果
使用相應的語言，將數據收集成native objects
把數據寫到數據源

flightData2015 = spark\
    .read\
    .option("inferSchema", "true")\
    .option("header", "true")\
    .csv("/data/flight-data/csv/2015-summary.csv")

flightData2015.sort("count").explain()

== Physical Plan ==
*Sort [count#195 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#195 ASC NULLS FIRST, 200)
    +- *FileScan csv [DEST_COUNTRY_NAME#193,ORIGIN_COUNTRY_NAME#194,count#195] ...

Shuffling

爲repartitioning，而移動數據叫shuffling。

如果

spark.conf.set("spark.sql.shuffle.partitions", "5")
flightData2015.sort("count").take(2)

shuffling越多，影響性能的stages就越多。

Narrow Dependencies

簡單的一對一transformation，比如filter、map、flatMap等，子RDD是一對一依賴於父RDD的。
數據在相同節點（父所在的節點）轉換。不會跨executors傳輸。

Narrow dependencies are in the same stage of the job execution.

Wide Dependencies

repartition或者redistribute數據，比如aggregateByKey、reduceByKey。

Wide dependencies introduce new stages in the job execution.

Broadcast variables

Broadcast variables are shared variables across all executors.

driver只能廣播它擁有的數據，而不能廣播使用引用的RDDs。

Accumulators

Accumulators是跨executors的共享變量。

Lazy Evaluation

直到最後，才執行graph of computation instructions。

register any DataFrame as a table or view (a temporary table)
使用pure SQL查詢
沒有性能差異 between writing SQL queries or writing DataFrame code
都編譯成相同的底層計劃

比如：

sqlWay = spark.sql("""
    SELECT DEST_COUNTRY_NAME, count(1)
    FROM flight_data_2015
    GROUP BY DEST_COUNTRY_NAME
    """)
sqlWay.explain()

和

dataFrameWay = flightData2015\
    .groupBy("DEST_COUNTRY_NAME")\
    .count()
    
dataFrameWay.explain()

的執行計劃都是

*HashAggregate(keys=[DEST_COUNTRY_NAME#182], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#182, 5)
    +- *HashAggregate(keys=[DEST_COUNTRY_NAME#182], functions=[partial_count(1)])
        +- *FileScan csv [DEST_COUNTRY_NAME#182] ...

對於

maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")

maxSql.show()

如果使用DataFrame

flightData2015
    .groupBy("DEST_COUNTRY_NAME")
    .sum("count")
    .withColumnRenamed("sum(count)", "destination_total")
    .sort(desc("destination_total"))
    .limit(5)
    .show()

程序的邏輯是

執行計劃是

TakeOrderedAndProject(limit=5, orderBy=[destination_total#16194L DESC], outpu...
+- *HashAggregate(keys=[DEST_COUNTRY_NAME#7323], functions=[sum(count#7325L)])
    +- Exchange hashpartitioning(DEST_COUNTRY_NAME#7323, 5)
        +- *HashAggregate(keys=[DEST_COUNTRY_NAME#7323], functions=[partial_sum...
            +- InMemoryTableScan [DEST_COUNTRY_NAME#7323, count#7325L]
                +- InMemoryRelation [DEST_COUNTRY_NAME#7323, ORIGIN_COUNTRY_NA...
                    +- *Scan csv [DEST_COUNTRY_NAME#7578,ORIGIN_COUNTRY_NAME...

DataFrames Versus Datasets

Datasets的類型，在編譯時確定。
而DataFrames的類型，在運行時確定。

Datasets只在基於JVM的語言中有效。
大多數情況下，你可能更喜歡使用DataFrames。它相當於Row類型的Datasets。
“Row”類型是專門爲內存計算而優化的內部表達（Catalyst格式），和JVM類型相比，GC成本也低。

partitioning scheme定義存儲位置（物理分佈）。

Columns

簡單類型，比如integer或者string
complex type，比如數組或者map
可以是null

Schemas定義列的名字和類型。

Structured API Execution

寫DataFrame/Dataset/SQL代碼
如果代碼有效，轉換成Logical Plan
轉換成Physical Plan，優化
在集羣內執行Physical Plan

Logical Planning

logical plan只包含transformations，沒有executors和drivers。
結果傳給Catalyst Optimizer。

Physical Planning

logical plan該如何執行。
會根據cost model比較不同策略。返回一系列RDDs and transformations。

Execution

runs all of this code over RDDs。
會在運行時進一步優化。生成native Java bytecode。

StructType:

name
type
是否可空
metadata（可選的）

Columns and Expressions

Columns提供表達式功能的子集。
邏輯樹，比如(((col(“someCol”) + 5) * 200) - 6) < col(“otherCol”)

DataFrame Transformations

How Spark Performs Joins

要先理解兩個概念：

node-to-node communication strategy
per node computation strategy

有兩種不同的通信方式，shuffle join（all-to-all communication）或者broadcast join。
隨着基於成本的優化器和通信策略的改進，這些內部優化會隨着時間推移而變化。

Big table–to–big table

使用shuffle join

每個節點（worker nodes）告訴其他所有節點，而且他們跨節點共享數據。
如果數據沒分區好，通信會很昂貴。

Big table–to–small table

當表足夠小，可以加載到單個worker node的內存中。會把每個小的DataFrame複製到每個worker node。

SQL接口支持連接的hints（註釋語法）。不過，這不是強制的，優化器可能選擇忽略他們。
可選項包括MAPJOIN、BROADCAST和BROADCASTJOIN。

SELECT /*+ MAPJOIN(graduateProgram) */ * FROM person JOIN graduateProgram
    ON person.graduate_program = graduateProgram.id

如果broadcast太大的數據，driver node可能會crash。

Little table–to–little table

最好讓Spark決定怎樣連接他們。

Broadcast Variables

跨集羣共享，不可變，沒有封裝在閉包中。
一般來說，在閉包中，使用引用訪問對象。但是，對於大對象，worker nodes可能要反序列化對象很多次。
如果在多個actions和jobs中訪問相同的對象，每個job都會重新發送給workers。

而Broadcast variables會在每個machine共享，而不用每次發送。

Broadcast Variables在我們觸發action的時候，纔會被髮送。通過value方法訪問值。

Spark - 筆記 3