spark中通過rdd、dataframe和spark sql實現相同sql運行速度對比（實測）

決定做一個非常無聊的實驗，衆所周知現在使用spark進行數據分析一般採用rdd分佈式編程、dataframe接口和使用spark sql執行的方式，那麼在忽略數據加載速度的情況下，究竟哪種方式的運行速度最快呢？

至於rdd和dataframe數據集的原理和區別，我就不在這裏介紹了，可以看RDD DataFrame Dataset 三者的優缺點 , 三者之間的創建 , 以及相互轉換這篇文章。

數據集介紹

本文使用的數據集是某省市某網約車公司的交通出行大數據集，由於發佈方要求，筆者不會傳播該數據集。感興趣的同學可以前往海口市-交通流量時空演變特徵可視分析自行查看和下載數據集。

該數據集包含某平臺2017年5月1日到10月31日在某市每天的訂單數據，包含訂單的起終點經緯度以及訂單的類型、出行品類、乘車人數等訂單屬性數據。

數據加載

首先加載本地數據文件，由於我們是測試，當然是採用local-thread模式運行spark應用。

val path = "data/dwv_order_make_1.txt"  // Current fold file
val textRdd = sc.textFile(path,2)

通過schema,Row構造dataframe

val structFields = Array(StructField("order_id", StringType, nullable = false), // 0
      StructField("product_id", IntegerType, nullable = true),// 1
      StructField("city_id", StringType, nullable = true),// 2
      StructField("district", StringType, nullable = true),// 3
      StructField("county", StringType, nullable = true),// 4
      StructField("type", IntegerType, nullable = true),// 5
      StructField("combo_type", IntegerType, nullable = true),// 6
      StructField("traffic_type", IntegerType, nullable = true),// 7
      StructField("passenger_count", IntegerType, nullable = true),// 8
      StructField("driver_product_id", StringType, nullable = true),// 9
      StructField("start_dest_distance", FloatType, nullable = true),// 10
      StructField("arrive_time", StringType, nullable = true),// 11
      StructField("departure_time", StringType, nullable = true),// 12
      StructField("others", StringType, nullable = true)// 13
    )
    val structType = StructType(structFields) // create schema struct
    val orderRdd = textRdd.map(_.split("\t")).map(x=>Row(x(0), x(1).toInt, x(2), x(3), x(4), x(5).toInt,
      x(6).toInt, x(7).toInt, x(8).toInt, x(9), x(10).toFloat, x(11), x(12), x(13)))
    val df = spark.createDataFrame(orderRdd, structType)
    df.show

這段代碼的運行結果如下：

+--------------+----------+-------+--------+------+----+----------+------------+---------------+-----------------+-------------------+-------------------+-------------------+------+
|      order_id|product_id|city_id|district|county|type|combo_type|traffic_type|passenger_count|driver_product_id|start_dest_distance|        arrive_time|     departure_time|others|
+--------------+----------+-------+--------+------+----+----------+------------+---------------+-----------------+-------------------+-------------------+-------------------+------+
|17592725309572|         3|     83|    0898|460107|   0|         0|           0|              0|                3|             5770.0|2017-05-19 11:04:02|2017-05-19 11:01:35|    16|
|17592725855561|         3|     83|    0898|460106|   0|         0|           0|              0|                3|             3839.0|2017-05-19 11:35:04|2017-05-19 11:31:29|    11|
|17592726010947|         3|     83|    0898|460106|   0|         0|           0|              0|                3|             6780.0|2017-05-19 11:51:08|2017-05-19 11:47:13|    17|
+--------------+----------+-------+--------+------+----+----------+------------+---------------+-----------------+-------------------+-------------------+-------------------+------+

spark sql 運行速度測試

我選擇了幾種常用的sql語句以提高計算複雜度，更貼合業務情況。

我們的sql執行效果是選取某兩分鐘內，城市編號爲’460107’的拼車（或無訂單狀態）並按照初始上車距離進行排序的訂單號。

編寫sql語句如下：

SELECT orders.order_id FROM orders WHERE orders.departure_time >= '2017-05-19 17:41:00' AND 
      orders.departure_time <= '2017-05-19 17:43:00' AND orders.county = '460107' AND orders.traffic_type = 0 OR
      orders.traffic_type = 4 
      ORDER BY orders.start_dest_distance

運行結果如下(只保留前五條)：

+--------------+
|      order_id|
+--------------+
|17592733902055|
|17592733986620|
|17592733839582|
|17592734004880|
|17592733964943|

使用spark sql 解釋並執行該sql語句，共用時9610毫秒。

dataframe 運行速度測試

Spark SQL中的DataFrame類似於一張關係型數據表。在關係型數據庫中對單表或進行的查詢操作，在DataFrame中都可以通過調用其API接口來實現。

調用代碼如下：

df.where("departure_time >='2017-05-19 17:41:00'")
      .where("departure_time <= '2017-05-19 17:43:00'")
      .where("county = '460107'")
      .where("traffic_type = 0 OR traffic_type = 4")
      .orderBy("start_dest_distance")
      .select("order_id")
      .show(1000)

運行結果如下(只保留前五條)：

+--------------+
|      order_id|
+--------------+
|17592733902055|
|17592733986620|
|17592733839582|
|17592734004880|
|17592733964943|

可以看到運行結果與spark sql是一致的，使用dataframe提供的api接口共用時8771毫秒。

RDD運行速度測試

下面是一段我自己寫的RDD Api編程實現，通過構建一個元組類型的RDD在上面進行過濾、排序等操作。

    val tRDD = textRdd.map(_.split("\t"))
    val f1RDD = tRDD.map(x => (x(12), x(4), x(7).toInt, x(10).toFloat, x(0)))
    val rRDD = f1RDD.filter(_._1 >= "2017-05-19 17:41:00").filter(_._1 <= "2017-05-19 17:43:00")
      .filter(_._2 == "460107").filter(x => x._3 == 4 || x._3 == 0).sortBy(_._4).map(_._5)
    rRDD.collect().foreach {println}

運行結果如下(只保留前五條)，可以看出運行結果與前面兩種是一致的：

17592733902055
17592733986620
17592733839582
17592734004880
17592733964943

這段代碼運行用時共8925毫秒。

總結

引用RDD、DataFrame和DataSet的區別中的一句話:

RDD API是函數式的，強調不變性，在大部分場景下傾向於創建新對象而不是修改老對象。這一特點雖然帶來了乾淨整潔的API，卻也使得Spark應用程序在運行期傾向於創建大量臨時對象，對GC造成壓力。在現有RDD API的基礎之上，我們固然可以利用mapPartitions方法來重載RDD單個分片內的數據創建方式，用複用可變對象的方式來減小對象分配和GC的開銷，但這犧牲了代碼的可讀性，而且要求開發者對Spark運行時機制有一定的瞭解，門檻較高。另一方面，Spark SQL在框架內部已經在各種可能的情況下儘量重用對象，這樣做雖然在內部會打破了不變性，但在將數據返回給用戶時，還會重新轉爲不可變數據。利用 DataFrame API進行開發，可以免費地享受到這些優化效果。

同時從代碼和測試結果中可以看出，用spark.sql比較方便快捷，而使用dataframe的api相較而言有些許速度提升，至於自己去進行rdd api編程如果不是爲了解決功能上的問題，並沒有例如速度或可維護性方面的優勢。在日常（dataframe api 可滿足的）應用中我更傾向於通過dataframe api接口進行spark應用開發。

spark中通過rdd、dataframe和spark sql實現相同sql運行速度對比（實測）

數據集介紹

數據加載

spark sql 運行速度測試

dataframe 運行速度測試

RDD運行速度測試

總結

容器中nginx無法使用同一個網絡下的容器域名

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

nodejs學習07——API

避免DbContext同時在多個線程調用

Python: SunMoonTimeCalculator

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

free AI online tools All In One

C# Xmlserializer 程序集內存泄露

Spark SQL內核剖析（三）

spark中通過rdd、dataframe和spark sql實現相同sql運行速度對比（實測）

Java技術點速記 | JVM性能優化

Java技術點速記 | 垃圾收集與內存分配

boltdb數據庫Update代碼流程分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結