Spark 性能調優 從配置 以及代碼層面 綜合調優

 

前言

調優之前

調優大致方向

1 Indirect Performance Enhancements 間接

1 Design Choices

1 Scala Versus Java Versus Python Versus R

2 DataFrame Versus SQL versus Datasets versus RDDS

2 Object Serialization in RDDs RDD對象的序列化

3 Cluster Configurations

1 Cluster/Application sizing and sharing 資源共享

2 Dynamic allocation 動態申請資源

4 Scheduling 調度

5 Data at Rest 靜態數據

1 File-based long-term data storage 基於文件存儲的長期數據

2 Splittable file types and compression 可分割文件類型和壓縮

3 Table partitioning

4 Bucketing

5 The number of files

6 Data locality 數據本地化

7 Statistics collection 分析集合

6 Shuffle Configurations

7 Memory Pressure and Garbage Collection

1 何時會內存壓力?

2 Measuring the impact of garbage collection

3 Garbage collection tuning

2 Direct Performance Enhancements 直接

1 Parallelism

設置方式

2 Improved Filtering

3 Repartitioning and Coalescing

1coalesce

2repartition

3 custom partitioning

4 User-Defined Functions UDFS

5 Temporary Data Storage Caching

RDD cache

Structured API Cache

6 Joins

7 Aggregations

8 Broadcast Variables

前言

 

For instance, if you had an extremely fast network, that would make

many of your Spark jobs faster because shuffles are so often one of the costlier

steps in a Spark job. Most likely, you won’t have much ability to control such

things; therefore, we’re going to discuss the things you can control through code

choices or configuration

也就是說如果有充分好的硬件,那麼spark會運行的很好,這個是我們不可控的,我們主要討論哪些可控。

調優之前

One of the best things you can do to figure out how to improve performance is to

implement good monitoring and job history tracking. Without this information, it

can be difficult to know whether you’re really improving job performance

調優大致方向

 

  • Code-level design choices (e.g., RDDs versus DataFrames)
  • Data at rest
  • Joins
  • Aggregations
  • Data in flight
  • Individual application properties
  • Inside of the Java Virtual Machine (JVM) of an executor
  • Worker nodes
  • Cluster and deployment propertie

 

We can either do so indirectly by setting configuration values or changing

the runtime environment. These should improve things across Spark

Applications or across Spark jobs. Alternatively, we can try to directly change

execution characteristic or design choices at the individual Spark job, stage, or

task level. These kinds of fixes are very specific to that one area of our

application and therefore have limited overall impact.

間接的通過調節Spark 配置 使程序運行更優 或者直接調節spark執行特點以及設計spark job

 

1 Indirect Performance Enhancements 間接

We’ll skip the obvious ones like “improve

your hardware” and focus more on the things within your control

1 Design Choices

好的設計選擇可以讓程序更穩定運行並且可以更好應對外部變化

1 Scala Versus Java Versus Python Versus R

適合R語言:For instance, if you want to perform some singlenode machine learning after performing a large ETL job, we might recommend running your Extract, Transform, and Load (ETL) code as SparkR code and then

using R’s massive machine learning ecosystem to run your single-node machine learning algorithms

2 DataFrame Versus SQL versus Datasets versus RDDS

Across all languages, DataFrames, Datasets, and SQL are equivalent in speed.

DataFrames SQL and DataSet code compiles down to RDDS ,Spark 優化這些執行

如果使用RDD開發 推薦使用scala或者java

2 Object Serialization in RDDs RDD對象的序列化

Kryo序列化更加簡潔和高效比起java的序列化,但是相應地會更加麻煩,需要手動指定spark.kryo.classesToRegister configuration 而且對於需要使用的類庫 需要註冊

conf.registerKryoClasses()

3 Cluster Configurations

1 Cluster/Application sizing and sharing 資源共享

there are a lot of options for how you want to share resources at the

cluster level or at the application level

2 Dynamic allocation 動態申請資源

This means that your application

can give resources back to the cluster if they are no longer used, and request

them again later when there is demand.

set spark.dynamicAllocation.enabled to true.

4 Scheduling 調度

spark.scheduler.mode to FAIR to allow better sharing of resources across

multiple users, or setting --max-executor-cores, which specifies the

maximum number of executor cores that your application will need. Specifying

this value can ensure that your application does not take up all the resources on

the cluster.

5 Data at Rest 靜態數據

Making sure that you’re storing your data for effective reads later on is

absolutely essential to successful big data projects.

1 File-based long-term data storage 基於文件存儲的長期數據

One of the easiest ways to optimize your Spark jobs is to follow best practices when storing

data and choose the most efficient storage format possible

The most efficient file format you can generally choose is Apache Parquet.

 

2 Splittable file types and compression 可分割文件類型和壓縮

不管採用什麼方式存儲,文件一定是可切分的,這樣不同的tasks可以並行的讀取數據

壓縮格式:

Zip Tar 格式不支持split 也就是說有10files在一個zip文件裏,並且有10cores那麼只有一個cores能讀取數據因爲不能並行讀取

gzip bzip2 lz4 格式支持切分

For your own input data, the simplest way to make it splittable is to upload it as separate files,

ideally each no larger than a few hundred megabytes

3 Table partitioning

表按照字段分區,比如經常使用的date customerID 這樣會減少不必要的查詢

分區弊端是,可能會導致很多小文件

4 Bucketing

bucketing your data allows Spark to “pre-partition” data according to how joins or aggregations

are likely to be performed by readers

提高性能和穩定性,因爲數據可以數據可以劃分到不同Partitions 避免都傾斜到一個或兩個分區裏

This can help prevent a shuffle before a join and therefore help speed up data acces

5 The number of files

寫數據方式

One way of controlling data partitioning when you write your data is through a

write option introduced in Spark 2.2. To control how many records go into each

file, you can specify the maxRecordsPerFile option to the write operation

6 Data locality 數據本地化

7 Statistics collection 分析集合

分析表或者列數據

There are two kinds of

statistics: table-level and column-level statistics. Statistics collection is available

only on named tables, not on arbitrary DataFrames or RDDs

6 Shuffle Configurations

Configuring Spark’s external shuffle service (discussed in Chapters 16 and 17)

can often increase performance because it allows nodes to read shuffle data from

remote machines even when the executors on those machines are busy (e.g., with

garbage collection).

 

7 Memory Pressure and Garbage Collection

1 何時會內存壓力?

當一個spark app在運行時佔用太多的內存或者垃圾收集頻繁發生或者是運行緩慢因爲有太多的JVM對象已經不用了還沒有回收 儘量使用Structured APIs 不僅能提高性能,而且可以減少內存壓力,因爲減少對JVM對象依賴

2 Measuring the impact of garbage collection

垃圾回收第一步 是分析統計垃圾回收發生的頻率和每次回收所用的時間

具體可以通過設置

spark.executor.extraJavaOptions -verbose:gc

-XX:+PrintGCDetails

-XX:+PrintGCTimeStamps

設置後再運行spark程序就能夠看到垃圾回收再發生的時候在wokrer‘s 節點的日誌中打印

在worker nodes 的 stdout files 不是driver節點

3 Garbage collection tuning

要調優 首先要懂JVM內存管理

1 JVM 內存管理

java 堆內存分爲 兩個區域:Young And Old

  • Java heap space is divided into two regions: Young and Old. The Young generation is meant to hold short-lived objects whereas the Old generation is intended for objects with longer lifetimes.
  • The Young generation is further divided into three regions: Eden, Survivor1, and Survivor2

2 JVM 垃圾回收步驟

  1. When Eden is full, a minor garbage collection is run on Eden and objects that are alive from Eden and Survivor1 are copied to Survivor2
  2. The Survivor regions are swapped.
  3. If an object is old enough or if Survivor2 is full, that object is moved to Old.
  4. Finally, when Old is close to full, a full garbage collection is invoked.This involves tracing through all the objects on the heap, deleting the unreferenced ones, and moving the others to fill up unused space, so it is generally the slowest garbage collection operation

3 垃圾回收目標

垃圾回收就是要確保只有長期緩存的dataset 存儲在老年代,而年輕代要保證充足的空間來存儲短期存活的對象,這樣能夠避免full garbage collectionsa的發生

The goal of garbage collection tuning in Spark is to ensure that only long-lived

cached datasets are stored in the Old generation and that the Young generation is

sufficiently sized to store all short-lived objects. This will help avoid full

garbage collections to collect temporary objects created during task execution

4 合理垃圾回收處理

  1. 如果觀察到一個任務在完成前頻繁地進行full garbage collection 那麼說明沒有充足的內存來執行任務,應該減少緩存內存 spark.memory.fraction 基於同一內存管理 UnifiedMemoryManager

簡單介紹一下統一內存管理分爲三部分 假設JVM總內存1300Mb

300Mb 保留內存空間,這部分不允許修改,用戶Spark系統內部使用

應用內存 (總內存-300)*0.4 =400Mb

數據存儲於計算內存 (總內存-300)*0.6 =600Mb 細分爲(數據存儲 spark.memory.storageFraction 和執行內存 這兩部分可以動態佔用,具體細節可以參看SparkSQL 內核解析 162頁)

  1. 如果是頻繁發生minor collections 而 major garbage collections比較少 可以考慮把Eden區域的內存調大,比如Eden計劃內存爲E可以適當調爲4/3 E

2 Direct Performance Enhancements 直接

1 Parallelism

提高stage的運行速度,可以增加並行度;

建議是每個CPU Core對應 2-3個tasks 如果這個stage 運行大量任務

設置方式

把以下兩個參數調節到相適應CPU core

spark.default.parallelism

spark.sql.shuffle.partitions

 

2 Improved Filtering

moving filters to the earliest part of your Spark job that you can. Sometimes, these filters can be

pushed into the data sources themselves and this means that you can avoid

reading and working with data that is irrelevant to your end result.

把filter提前可以及早在數據源就可以過濾不需要的數據,避免讀取無用數據

3 Repartitioning and Coalescing

shuffle伴隨着repartition

1coalesce

if you’re reducing the number of overall partitions in a DataFrame or RDD, first try coalesce method, which will not perform a shuffle but rather merge partitions on the same node into one partition.

2repartition

The slower repartition method will also shuffle data across the network to achieve even

load balancing. Repartitions can be particularly helpful when performing joins or prior to a cache call.

3 custom partitioning

如果想更好的將數據分區 可以在RDD級別 按照自定義規則 將數據分區

4 User-Defined Functions UDFS

避免UDF函數使用 儘量使用Structured API

UDFs are expensive because they force representing data as objects in the JVM and

sometimes do this multiple times per record in a query. You should try to use the

Structured APIs as much as possible to perform your manipulations simply

because they are going to perform the transformations in a much more efficient

manner than you can do in a high-level languag

5 Temporary Data Storage Caching

經常使用的DataSets 可以通過緩存來避免重複讀取 但是緩存帶來的問題是序列化和反序列化 如果不是經常使用的Dataset沒必要緩存起來

Caching is a lazy operation, meaning that things will be cached only as they are accessed.

RDD cache

cache the actual ,physical data (the bits),When this data is accessed again, Spark returns the proper data.

This is done through the RDD reference

rdd緩存的是物理數據,再次訪問能返回正確的數據

Structured API Cache

based on the physical plan.

This means that we effectively store

the physical plan as our key (as opposed to the object reference) and perform a

lookup prior to the execution of a Structured job. This can cause confusion

because sometimes you might be expecting to access raw data but because

someone else already cached the data, you’re actually accessing their cached version

有點迷惑 如果有人把數據換存了 訪問的是緩存的數據

6 Joins

 

7 Aggregations

 

8 Broadcast Variables

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章