RDD
Datasets and DataFrames
官網對於Datasets and DataFrames的解釋是這樣的:
A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized
execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional
transformations (map
, flatMap
, filter
,
etc.). The Dataset API is available in Scala and Java.
Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName
).
The case for R is similar.
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed
from a wide array of sources such as: structured data files, tables in Hive, external databases, or
existing RDDs. The DataFrame API is available in Scala, Java, Python, and R.
In Scala and Java, a DataFrame is represented by a Dataset of Row
s. In the
Scala API, DataFrame
is simply a type alias of Dataset[Row]
.
While, in Java API, users need to use Dataset<Row>
to
represent a DataFrame
.
翻譯上面兩段:
DataSet是一種分佈式的數據集合。DataSet是在Spark 1.6中增加的一個新接口,它不但提供了RDD(強類型,強使用lambda函數)的優勢,而且具有Spark SQL被優化的執行引擎的優點。 DataSet可以從JVM對象構建數據集,然後使用功能轉換transformations(map,flatMap,filter等)進行操作。DataSet的API可用於Scala和Java。 Python不支持Dataset API。 但是由於Python的動態特性,Dataset
API的許多優點已經可用(即您可以通過名稱自然地訪問行的字段row.columnName)。 R的情況類似。
DataFrame是一個組織成命名列的DataSet。 它在概念上等同於關係數據庫中的表或R / Python中的數據框架,但在引擎下更加優化。 DataFrames可以從各種各樣的源構建,例如:結構化數據文件,Hive中的表,外部數據庫或現有RDD。 DataFrame API在Scala,Java,Python和R中可用。在Scala和Java中,DataFrame由“數據集”行表示。 在Scala API中,DataFrame只是Dataset
[Row]的一個類型別名。 而在Java API中,用戶需要使用Dataset <Row>來表示DataFrame。