1 Indirect Performance Enhancements 間接
1 Scala Versus Java Versus Python Versus R
2 DataFrame Versus SQL versus Datasets versus RDDS
2 Object Serialization in RDDs RDD對象的序列化
1 Cluster/Application sizing and sharing 資源共享
1 File-based long-term data storage 基於文件存儲的長期數據
2 Splittable file types and compression 可分割文件類型和壓縮
7 Memory Pressure and Garbage Collection
2 Measuring the impact of garbage collection
2 Direct Performance Enhancements 直接
3 Repartitioning and Coalescing
5 Temporary Data Storage Caching
前言
For instance, if you had an extremely fast network, that would make
many of your Spark jobs faster because shuffles are so often one of the costlier
steps in a Spark job. Most likely, you won’t have much ability to control such
things; therefore, we’re going to discuss the things you can control through code
choices or configuration
也就是說如果有充分好的硬件,那麼spark會運行的很好,這個是我們不可控的,我們主要討論哪些可控。
調優之前
One of the best things you can do to figure out how to improve performance is to
implement good monitoring and job history tracking. Without this information, it
can be difficult to know whether you’re really improving job performance
調優大致方向
- Code-level design choices (e.g., RDDs versus DataFrames)
- Data at rest
- Joins
- Aggregations
- Data in flight
- Individual application properties
- Inside of the Java Virtual Machine (JVM) of an executor
- Worker nodes
- Cluster and deployment propertie
We can either do so indirectly by setting configuration values or changing
the runtime environment. These should improve things across Spark
Applications or across Spark jobs. Alternatively, we can try to directly change
execution characteristic or design choices at the individual Spark job, stage, or
task level. These kinds of fixes are very specific to that one area of our
application and therefore have limited overall impact.
間接的通過調節Spark 配置 使程序運行更優 或者直接調節spark執行特點以及設計spark job
1 Indirect Performance Enhancements 間接
We’ll skip the obvious ones like “improve
your hardware” and focus more on the things within your control
1 Design Choices
好的設計選擇可以讓程序更穩定運行並且可以更好應對外部變化
1 Scala Versus Java Versus Python Versus R
適合R語言:For instance, if you want to perform some singlenode machine learning after performing a large ETL job, we might recommend running your Extract, Transform, and Load (ETL) code as SparkR code and then
using R’s massive machine learning ecosystem to run your single-node machine learning algorithms
2 DataFrame Versus SQL versus Datasets versus RDDS
Across all languages, DataFrames, Datasets, and SQL are equivalent in speed.
DataFrames SQL and DataSet code compiles down to RDDS ,Spark 優化這些執行
如果使用RDD開發 推薦使用scala或者java
2 Object Serialization in RDDs RDD對象的序列化
Kryo序列化更加簡潔和高效比起java的序列化,但是相應地會更加麻煩,需要手動指定spark.kryo.classesToRegister configuration 而且對於需要使用的類庫 需要註冊
conf.registerKryoClasses()
3 Cluster Configurations
1 Cluster/Application sizing and sharing 資源共享
there are a lot of options for how you want to share resources at the
cluster level or at the application level
2 Dynamic allocation 動態申請資源
This means that your application
can give resources back to the cluster if they are no longer used, and request
them again later when there is demand.
set spark.dynamicAllocation.enabled to true.
4 Scheduling 調度
spark.scheduler.mode to FAIR to allow better sharing of resources across
multiple users, or setting --max-executor-cores, which specifies the
maximum number of executor cores that your application will need. Specifying
this value can ensure that your application does not take up all the resources on
the cluster.
5 Data at Rest 靜態數據
Making sure that you’re storing your data for effective reads later on is
absolutely essential to successful big data projects.
1 File-based long-term data storage 基於文件存儲的長期數據
One of the easiest ways to optimize your Spark jobs is to follow best practices when storing
data and choose the most efficient storage format possible
The most efficient file format you can generally choose is Apache Parquet.
2 Splittable file types and compression 可分割文件類型和壓縮
不管採用什麼方式存儲,文件一定是可切分的,這樣不同的tasks可以並行的讀取數據
壓縮格式:
Zip Tar 格式不支持split 也就是說有10files在一個zip文件裏,並且有10cores那麼只有一個cores能讀取數據因爲不能並行讀取
gzip bzip2 lz4 格式支持切分
For your own input data, the simplest way to make it splittable is to upload it as separate files,
ideally each no larger than a few hundred megabytes
3 Table partitioning
表按照字段分區,比如經常使用的date customerID 這樣會減少不必要的查詢
分區弊端是,可能會導致很多小文件
4 Bucketing
bucketing your data allows Spark to “pre-partition” data according to how joins or aggregations
are likely to be performed by readers
提高性能和穩定性,因爲數據可以數據可以劃分到不同Partitions 避免都傾斜到一個或兩個分區裏
This can help prevent a shuffle before a join and therefore help speed up data acces
5 The number of files
寫數據方式
One way of controlling data partitioning when you write your data is through a
write option introduced in Spark 2.2. To control how many records go into each
file, you can specify the maxRecordsPerFile option to the write operation
6 Data locality 數據本地化
7 Statistics collection 分析集合
分析表或者列數據
There are two kinds of
statistics: table-level and column-level statistics. Statistics collection is available
only on named tables, not on arbitrary DataFrames or RDDs
6 Shuffle Configurations
Configuring Spark’s external shuffle service (discussed in Chapters 16 and 17)
can often increase performance because it allows nodes to read shuffle data from
remote machines even when the executors on those machines are busy (e.g., with
garbage collection).
7 Memory Pressure and Garbage Collection
1 何時會內存壓力?
當一個spark app在運行時佔用太多的內存或者垃圾收集頻繁發生或者是運行緩慢因爲有太多的JVM對象已經不用了還沒有回收 儘量使用Structured APIs 不僅能提高性能,而且可以減少內存壓力,因爲減少對JVM對象依賴
2 Measuring the impact of garbage collection
垃圾回收第一步 是分析統計垃圾回收發生的頻率和每次回收所用的時間
具體可以通過設置
spark.executor.extraJavaOptions -verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
設置後再運行spark程序就能夠看到垃圾回收再發生的時候在wokrer‘s 節點的日誌中打印
在worker nodes 的 stdout files 不是driver節點
3 Garbage collection tuning
要調優 首先要懂JVM內存管理
1 JVM 內存管理
java 堆內存分爲 兩個區域:Young And Old
- Java heap space is divided into two regions: Young and Old. The Young generation is meant to hold short-lived objects whereas the Old generation is intended for objects with longer lifetimes.
- The Young generation is further divided into three regions: Eden, Survivor1, and Survivor2
2 JVM 垃圾回收步驟
- When Eden is full, a minor garbage collection is run on Eden and objects that are alive from Eden and Survivor1 are copied to Survivor2
- The Survivor regions are swapped.
- If an object is old enough or if Survivor2 is full, that object is moved to Old.
- Finally, when Old is close to full, a full garbage collection is invoked.This involves tracing through all the objects on the heap, deleting the unreferenced ones, and moving the others to fill up unused space, so it is generally the slowest garbage collection operation
3 垃圾回收目標
垃圾回收就是要確保只有長期緩存的dataset 存儲在老年代,而年輕代要保證充足的空間來存儲短期存活的對象,這樣能夠避免full garbage collectionsa的發生
The goal of garbage collection tuning in Spark is to ensure that only long-lived
cached datasets are stored in the Old generation and that the Young generation is
sufficiently sized to store all short-lived objects. This will help avoid full
garbage collections to collect temporary objects created during task execution
4 合理垃圾回收處理
- 如果觀察到一個任務在完成前頻繁地進行full garbage collection 那麼說明沒有充足的內存來執行任務,應該減少緩存內存 spark.memory.fraction 基於同一內存管理 UnifiedMemoryManager
簡單介紹一下統一內存管理分爲三部分 假設JVM總內存1300Mb
300Mb 保留內存空間,這部分不允許修改,用戶Spark系統內部使用
應用內存 (總內存-300)*0.4 =400Mb
數據存儲於計算內存 (總內存-300)*0.6 =600Mb 細分爲(數據存儲 spark.memory.storageFraction 和執行內存 這兩部分可以動態佔用,具體細節可以參看SparkSQL 內核解析 162頁)
- 如果是頻繁發生minor collections 而 major garbage collections比較少 可以考慮把Eden區域的內存調大,比如Eden計劃內存爲E可以適當調爲4/3 E
2 Direct Performance Enhancements 直接
1 Parallelism
提高stage的運行速度,可以增加並行度;
建議是每個CPU Core對應 2-3個tasks 如果這個stage 運行大量任務
設置方式
把以下兩個參數調節到相適應CPU core
spark.default.parallelism
spark.sql.shuffle.partitions
2 Improved Filtering
moving filters to the earliest part of your Spark job that you can. Sometimes, these filters can be
pushed into the data sources themselves and this means that you can avoid
reading and working with data that is irrelevant to your end result.
把filter提前可以及早在數據源就可以過濾不需要的數據,避免讀取無用數據
3 Repartitioning and Coalescing
shuffle伴隨着repartition
1coalesce
if you’re reducing the number of overall partitions in a DataFrame or RDD, first try coalesce method, which will not perform a shuffle but rather merge partitions on the same node into one partition.
2repartition
The slower repartition method will also shuffle data across the network to achieve even
load balancing. Repartitions can be particularly helpful when performing joins or prior to a cache call.
3 custom partitioning
如果想更好的將數據分區 可以在RDD級別 按照自定義規則 將數據分區
4 User-Defined Functions UDFS
避免UDF函數使用 儘量使用Structured API
UDFs are expensive because they force representing data as objects in the JVM and
sometimes do this multiple times per record in a query. You should try to use the
Structured APIs as much as possible to perform your manipulations simply
because they are going to perform the transformations in a much more efficient
manner than you can do in a high-level languag
5 Temporary Data Storage Caching
經常使用的DataSets 可以通過緩存來避免重複讀取 但是緩存帶來的問題是序列化和反序列化 如果不是經常使用的Dataset沒必要緩存起來
Caching is a lazy operation, meaning that things will be cached only as they are accessed.
RDD cache
cache the actual ,physical data (the bits),When this data is accessed again, Spark returns the proper data.
This is done through the RDD reference
rdd緩存的是物理數據,再次訪問能返回正確的數據
Structured API Cache
based on the physical plan.
This means that we effectively store
the physical plan as our key (as opposed to the object reference) and perform a
lookup prior to the execution of a Structured job. This can cause confusion
because sometimes you might be expecting to access raw data but because
someone else already cached the data, you’re actually accessing their cached version
有點迷惑 如果有人把數據換存了 訪問的是緩存的數據
6 Joins
7 Aggregations
8 Broadcast Variables