Spark 性能調優從配置以及代碼層面綜合調優

前言

For instance, if you had an extremely fast network, that would make

many of your Spark jobs faster because shuffles are so often one of the costlier

steps in a Spark job. Most likely, you won’t have much ability to control such

things; therefore, we’re going to discuss the things you can control through code

choices or configuration

也就是說如果有充分好的硬件，那麼spark會運行的很好，這個是我們不可控的，我們主要討論哪些可控。

調優之前

One of the best things you can do to figure out how to improve performance is to

implement good monitoring and job history tracking. Without this information, it

can be difficult to know whether you’re really improving job performance

調優大致方向

Code-level design choices (e.g., RDDs versus DataFrames)
Data at rest
Joins
Aggregations
Data in flight
Individual application properties
Inside of the Java Virtual Machine (JVM) of an executor
Worker nodes
Cluster and deployment propertie

We can either do so indirectly by setting configuration values or changing

the runtime environment. These should improve things across Spark

Applications or across Spark jobs. Alternatively, we can try to directly change

execution characteristic or design choices at the individual Spark job, stage, or

task level. These kinds of fixes are very specific to that one area of our

application and therefore have limited overall impact.

間接的通過調節Spark 配置使程序運行更優或者直接調節spark執行特點以及設計spark job

1 Indirect Performance Enhancements 間接

We’ll skip the obvious ones like “improve

your hardware” and focus more on the things within your control

1 Design Choices

好的設計選擇可以讓程序更穩定運行並且可以更好應對外部變化

1 Scala Versus Java Versus Python Versus R

適合R語言：For instance, if you want to perform some singlenode machine learning after performing a large ETL job, we might recommend running your Extract, Transform, and Load (ETL) code as SparkR code and then

using R’s massive machine learning ecosystem to run your single-node machine learning algorithms

2 DataFrame Versus SQL versus Datasets versus RDDS

Across all languages, DataFrames, Datasets, and SQL are equivalent in speed.

DataFrames SQL and DataSet code compiles down to RDDS ，Spark 優化這些執行

如果使用RDD開發推薦使用scala或者java

2 Object Serialization in RDDs RDD對象的序列化

Kryo序列化更加簡潔和高效比起java的序列化，但是相應地會更加麻煩，需要手動指定spark.kryo.classesToRegister configuration 而且對於需要使用的類庫需要註冊

conf.registerKryoClasses()

3 Cluster Configurations

1 Cluster/Application sizing and sharing 資源共享

there are a lot of options for how you want to share resources at the

cluster level or at the application level

2 Dynamic allocation 動態申請資源

This means that your application

can give resources back to the cluster if they are no longer used, and request

them again later when there is demand.

set spark.dynamicAllocation.enabled to true.

4 Scheduling 調度

spark.scheduler.mode to FAIR to allow better sharing of resources across

multiple users, or setting --max-executor-cores, which specifies the

maximum number of executor cores that your application will need. Specifying

this value can ensure that your application does not take up all the resources on

the cluster.

5 Data at Rest 靜態數據

Making sure that you’re storing your data for effective reads later on is

absolutely essential to successful big data projects.

1 File-based long-term data storage 基於文件存儲的長期數據

One of the easiest ways to optimize your Spark jobs is to follow best practices when storing

data and choose the most efficient storage format possible

The most efficient file format you can generally choose is Apache Parquet.

2 Splittable file types and compression 可分割文件類型和壓縮

不管採用什麼方式存儲，文件一定是可切分的，這樣不同的tasks可以並行的讀取數據

壓縮格式：

Zip Tar 格式不支持split 也就是說有10files在一個zip文件裏，並且有10cores那麼只有一個cores能讀取數據因爲不能並行讀取

gzip bzip2 lz4 格式支持切分

For your own input data, the simplest way to make it splittable is to upload it as separate files,

ideally each no larger than a few hundred megabytes

3 Table partitioning

表按照字段分區，比如經常使用的date customerID 這樣會減少不必要的查詢

分區弊端是，可能會導致很多小文件

4 Bucketing

bucketing your data allows Spark to “pre-partition” data according to how joins or aggregations

are likely to be performed by readers

提高性能和穩定性，因爲數據可以數據可以劃分到不同Partitions 避免都傾斜到一個或兩個分區裏

This can help prevent a shuffle before a join and therefore help speed up data acces

5 The number of files

寫數據方式

One way of controlling data partitioning when you write your data is through a

write option introduced in Spark 2.2. To control how many records go into each

file, you can specify the maxRecordsPerFile option to the write operation

6 Data locality 數據本地化

7 Statistics collection 分析集合

分析表或者列數據

There are two kinds of

statistics: table-level and column-level statistics. Statistics collection is available

only on named tables, not on arbitrary DataFrames or RDDs

6 Shuffle Configurations

Configuring Spark’s external shuffle service (discussed in Chapters 16 and 17)

can often increase performance because it allows nodes to read shuffle data from

remote machines even when the executors on those machines are busy (e.g., with

garbage collection).

7 Memory Pressure and Garbage Collection

1 何時會內存壓力？

當一個spark app在運行時佔用太多的內存或者垃圾收集頻繁發生或者是運行緩慢因爲有太多的JVM對象已經不用了還沒有回收儘量使用Structured APIs 不僅能提高性能，而且可以減少內存壓力，因爲減少對JVM對象依賴

2 Measuring the impact of garbage collection

垃圾回收第一步是分析統計垃圾回收發生的頻率和每次回收所用的時間

具體可以通過設置

spark.executor.extraJavaOptions -verbose:gc

-XX:+PrintGCDetails

-XX:+PrintGCTimeStamps

設置後再運行spark程序就能夠看到垃圾回收再發生的時候在wokrer‘s 節點的日誌中打印

在worker nodes 的 stdout files 不是driver節點

3 Garbage collection tuning

要調優首先要懂JVM內存管理

1 JVM 內存管理

java 堆內存分爲兩個區域:Young And Old

Java heap space is divided into two regions: Young and Old. The Young generation is meant to hold short-lived objects whereas the Old generation is intended for objects with longer lifetimes.
The Young generation is further divided into three regions: Eden, Survivor1, and Survivor2

2 JVM 垃圾回收步驟

When Eden is full, a minor garbage collection is run on Eden and objects that are alive from Eden and Survivor1 are copied to Survivor2
The Survivor regions are swapped.
If an object is old enough or if Survivor2 is full, that object is moved to Old.
Finally, when Old is close to full, a full garbage collection is invoked.This involves tracing through all the objects on the heap, deleting the unreferenced ones, and moving the others to fill up unused space, so it is generally the slowest garbage collection operation

3 垃圾回收目標

垃圾回收就是要確保只有長期緩存的dataset 存儲在老年代，而年輕代要保證充足的空間來存儲短期存活的對象，這樣能夠避免full garbage collectionsa的發生

The goal of garbage collection tuning in Spark is to ensure that only long-lived

cached datasets are stored in the Old generation and that the Young generation is

sufficiently sized to store all short-lived objects. This will help avoid full

garbage collections to collect temporary objects created during task execution

4 合理垃圾回收處理

如果觀察到一個任務在完成前頻繁地進行full garbage collection 那麼說明沒有充足的內存來執行任務，應該減少緩存內存 spark.memory.fraction 基於同一內存管理 UnifiedMemoryManager

簡單介紹一下統一內存管理分爲三部分假設JVM總內存1300Mb

300Mb 保留內存空間，這部分不允許修改，用戶Spark系統內部使用

應用內存（總內存-300）*0.4 =400Mb

數據存儲於計算內存（總內存-300）*0.6 =600Mb 細分爲（數據存儲 spark.memory.storageFraction 和執行內存這兩部分可以動態佔用，具體細節可以參看SparkSQL 內核解析 162頁）

如果是頻繁發生minor collections 而 major garbage collections比較少可以考慮把Eden區域的內存調大，比如Eden計劃內存爲E可以適當調爲4/3 E

2 Direct Performance Enhancements 直接

1 Parallelism

提高stage的運行速度，可以增加並行度；

建議是每個CPU Core對應 2-3個tasks 如果這個stage 運行大量任務

設置方式

把以下兩個參數調節到相適應CPU core

spark.default.parallelism

spark.sql.shuffle.partitions

2 Improved Filtering

moving filters to the earliest part of your Spark job that you can. Sometimes, these filters can be

pushed into the data sources themselves and this means that you can avoid

reading and working with data that is irrelevant to your end result.

把filter提前可以及早在數據源就可以過濾不需要的數據，避免讀取無用數據

3 Repartitioning and Coalescing

shuffle伴隨着repartition

1coalesce

if you’re reducing the number of overall partitions in a DataFrame or RDD, first try coalesce method, which will not perform a shuffle but rather merge partitions on the same node into one partition.

2repartition

The slower repartition method will also shuffle data across the network to achieve even

load balancing. Repartitions can be particularly helpful when performing joins or prior to a cache call.

3 custom partitioning

如果想更好的將數據分區可以在RDD級別按照自定義規則將數據分區

4 User-Defined Functions UDFS

避免UDF函數使用儘量使用Structured API

UDFs are expensive because they force representing data as objects in the JVM and

sometimes do this multiple times per record in a query. You should try to use the

Structured APIs as much as possible to perform your manipulations simply

because they are going to perform the transformations in a much more efficient

manner than you can do in a high-level languag

5 Temporary Data Storage Caching

經常使用的DataSets 可以通過緩存來避免重複讀取但是緩存帶來的問題是序列化和反序列化如果不是經常使用的Dataset沒必要緩存起來

Caching is a lazy operation, meaning that things will be cached only as they are accessed.

RDD cache

cache the actual ,physical data (the bits),When this data is accessed again, Spark returns the proper data.

This is done through the RDD reference

rdd緩存的是物理數據，再次訪問能返回正確的數據

Structured API Cache

based on the physical plan.

This means that we effectively store

the physical plan as our key (as opposed to the object reference) and perform a

lookup prior to the execution of a Structured job. This can cause confusion

because sometimes you might be expecting to access raw data but because

someone else already cached the data, you’re actually accessing their cached version

有點迷惑如果有人把數據換存了訪問的是緩存的數據

6 Joins

7 Aggregations

8 Broadcast Variables

Spark 性能調優從配置以及代碼層面綜合調優

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

整理一下Eclipse中安裝PlantUml流程

windows 安裝msi 出現報錯 2503 無權限使用cmd模式安裝

Kubernetes 深入理解Pod 上

Kubernetes 網絡模型

linux 內核組成內核空間用戶空間頁表

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Spark 性能調優 從配置 以及代碼層面 綜合調優

Spark 性能調優從配置以及代碼層面綜合調優