Spark DataFrame VS DataSet 如何創建DataFrame DataFrame 詳解

一概念預覽

1、對應關係

非結構化數據如 log files -----> Datasets

半結構化數據如CSV files -----> DataFrames

結構化數據如Parquet files -----> SQL tables and views

所有的Structured APIS 可以使用於 batch和streaming computation

2、DataFrames and Datasets

spark中有兩種結構集合:

DataFrames

DataSets

這兩種結構有微妙的區別

dataframes 和dataset 在saprk中代表不可變懶處理計劃對於特定的數據集，可以用特定的操作用於輸出

Tables 和 views 實際上和 DataFrames是一樣的東西，只不過用SQL 來處理它們代替DataFrame Codes

Chapter10 詳細討論 Spark SQL

3、Schemas

定義：

schema 用來定義DataFrame的列名和類型（colomns names and types)

4、Structured Spark Types

spark 有自己高效執行的語言。spark 用自己的Catalyst引擎來維護執行計劃信息並且執行任務。換句話說，不管你用什麼語言開發，spark都會轉化爲自己Catalyst執行器語言來執行，比如map操作很多語言都支持

1、DataFrame Vs DataSet

DataFrames Untyped

說它們無類型也不精確，主要是DataFrames 只會在運行時檢查類型

對於Spark而言 DataFrames 就是Type Row類型的DataSet

DataSets Typed

會在編譯時檢查類型是否正確

DataSet 僅僅作用於JVM語言有效例如Scala case classes 和Java Beans

摘錄書中原話

In essence, within the Structured APIs, there are two more APIs, the “untyped”

DataFrames and the “typed” Datasets. To say that DataFrames are untyped is

aslightly inaccurate; they have types, but Spark maintains them completely and

only checks whether those types line up to those specified in the schema at

runtime. Datasets, on the other hand, check whether types conform to thespecification at compile time.

Datasets are only available to Java Virtual

Machine (JVM)–based languages (Scala and Java) and we specify types with

case classes or Java beans.

For the most part, you’re likely to work with DataFrames. To Spark (in Scala),

DataFrames are simply Datasets of Type Row. The “Row” type is Spark’s internal

representation of its optimized in-memory format for computation. This format

makes for highly specialized and efficient computation because rather than using

JVM types, which can cause high garbage-collection and object instantiation

costs, Spark can operate on its own internal format without incurring any of

those costs. To Spark (in Python or R), there is no such thing as a Dataset:

everything is a DataFrame and therefore we always operate on that optimized format

小貼士

使用DataFrame的時候相當於在，在利用spark內部調優機制

Chapter11 會對此詳細解答

2、Columns

如果簡單理解的話把Spark Column想象爲表格中的columns

Columns 可以代表簡單類型：如Integer,String

複雜類型：array map 或者null

3、Rows

代表一行數據記錄，在DataFrame中的每行記錄必須是Row類型的

可以通過SQL ,通過RDD ,通過數據源，或者直接創建

4、Spark Types

Spark 內部type和各種開發語言之間的映射

Java Types reference 注意SparkSQL 一直在優化，這些映射關係可能會變更

5、Structured Api Execution

預覽

steps:

1. Write DataFrame/Dataset/SQL Code.

2. If valid code, Spark converts this to a Logical Plan.

3. Spark transforms this Logical Plan to a Physical Plan, checking for

optimizations along the way.

4. Spark then executes this Physical Plan (RDD manipulations) on the cluster

1、邏輯計劃 Logical Planning

執行計劃的第一步是把用戶代碼轉換爲邏輯計劃

Spark利用 catalog 來分析用戶所寫的原始的邏輯執行計劃，如果不符合規範，將會拒絕執行計劃

2、物理計劃 Physical Planning

在創建好邏輯執行計劃後，Spark 會根據Physical plan 來制定不同執行策略並且通過成本模型來找出性能最高的執行方案

3、執行 Execution

在選擇好物理執行計劃後，spark會基於RDD進行及計算，並且會在運行時調優並把結果返回給用戶

二、實際操作

接下來進入實際操作環節

創建DataFrame

1、Schemas 操作

1、schhema 具體是讀取還是自己創建

schema 可以通過數據源定義schema 也可以自己顯示定義建議如下

json文件地址

https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/data/flight-data

2、schema 是有StructFields 組成的StructType

StructFields 有名字、類型或者Boolean標誌是否可以爲空

spark可以指定column的元數據類型，元數據是存儲列信息的方式

Schemas 可以包含其他StructTypes Chaper6詳細介紹

2、Columns and Expressions

1、獲取column方式

有很多種方式可以構造或者獲取columns

但是兩種最簡單的方式是通過col()或者column()方法

scala提供語法糖來創建column $"myColumn"

2、明確column引用

df.col("count")

2、Expressions

什麼是表達式？

對於DataFrame中的一行數據進行的一系列轉換操作

expr("someCol") 和 col("comeCol")

Columns 作爲表達式

落地

(((col("someCol") + 5) * 200) - 6) < col("otherCol"）

和

效果是一樣的

不管你使用 DataFrame方式寫還是SQL方式寫最終執行都是一樣的

獲取一個DataFrame的columns

cloumns property

spark.read.format("json").load("/data/flight-data/json/2015-summary.json")

.columns

2、Records And Rows

定義

在spark中，each row in a DataFrame is a single record ;

Spark 用Row 類型代表一行數據，Spark用列表達式來表達Row 對象，Row Objects內部用數組字節表示。

創建Rows

只有DataFrames 擁有schemas Rows 自己是沒有schemas 。如果手動創建Row,必須按照Schema的順序指定Row中的數據類型

// in Scala

import org.apache.spark.sql.Row

val myRow = Row("Hello", null, 1, false）

獲取 row中的數據

// in Scala

myRow(0) // type Any

myRow(0).asInstanceOf[String] // String

myRow.getString(0) // String

myRow.getInt(2) // Int

3、DataFrame Transformations

上面章節已經簡單概述了DataFrame的組成，接下來，重點關注如何生成DataFrames

總結如下

添加行或者列
移除行或者列
將一行轉一列
根據列值進行排序

1、創建DataFrames

①可以通過數據源創建DataFrames ，在Chapter9會詳細介紹

並且可以創建臨時視圖

// in Scala

val df = spark.read.format("json")

.load("/data/flight-data/json/2015-summary.json")

df.createOrReplaceTempView("dfTable“）

②通過一組rows並且轉換爲DataFrame

// in Scala

import org.apache.spark.sql.Row

import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType}

val myManualSchema = new StructType(Array(

new StructField("some", StringType, true),

new StructField("col", StringType, true),

new StructField("names", LongType, false)))

val myRows = Seq(Row("Hello", null, 1L))

val myRDD = spark.sparkContext.parallelize(myRows)

val myDf = spark.createDataFrame(myRDD, myManualSchema)

myDf.show()

如何操作DataFrames

select
selectExpr
org.apache.spark.sql.functions package 提供的函數

2、Select and SelectExpr

這兩個操作等效於你用SQL 查詢一張表的數據

Because select followed by a series of expr is such a common pattern, Spark

has a shorthand for doing this efficiently: selectExpr. This is probably the most

convenient interface for everyday use:

// in Scala

df.selectExpr("DEST_COUNTRY_NAME as newColumnName", "DEST_COUNTRY_NAME").show(2）

並且，通過selectExpr 可以創建new DataFrames, 實際上可以添加任意非聚合sql 聲明，

// in Scala

df.selectExpr(

"*", // include all original columns

"(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry")

.show(2）

-- in SQL

SELECT *, (DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry

FROM dfTable LIMIT 2

Giving an output of:

3、Converting to Spark Types

4、Adding columns

5、Renaming Columns

6、Reserved Characters and Keywords

7、Case Sensitivity

8、Remove Clolumns

9、Changing a Column's Type

10、Filtering rows

11、Getting Unique Rows

12、Random samples

13、Random Splits

14、Concatenating and appending rows

15、Sorting rows

16、Limit

17、Repartition and Coalesce

Spark DataFrame VS DataSet 如何創建DataFrame DataFrame 詳解

一個開源且全面的C#算法實戰教程

C語言--右移左移

12款高效開源Wiki系統推薦，打造團隊知識管理利器

dotnet 基於 DirectML 控制檯運行 Phi-3 模型

常用的 Git 指令

sm4加密工具類

整理一下Eclipse中安裝PlantUml流程

windows 安裝msi 出現報錯 2503 無權限使用cmd模式安裝

Kubernetes 深入理解Pod 上

Kubernetes 網絡模型

linux 內核組成內核空間用戶空間頁表

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結