寫在前面
總結關於pyspark sql 裏面關於table 的知識點
- SparkSession
SparkSession 可以用來創建DF,註冊DF爲table,對table執行SQL,緩存table 以及讀parwuet文件。 - table(tableName)
特定的table返回DF
>>> df.createOrReplaceTempView("table1")
>>> df2 = spark.table("table1")
>>> sorted(df.collect()) == sorted(df2.collect())
True
- SQLContext(sparkContext,spakrSession=None,jsqlCOntext=None)
同SparkSession,已經被SparkSession代替。 - cacaheTable(TableName)
把指定table緩存到內存中 - clearCache()
把所有在內存中的table移除 - createExternalTable(tableName,path=None,source=None,schema=None,**options)
- dropTempTable(tableName)
把臨時的table從目錄中移除
>>> sqlContext.registerDataFrameAsTable(df, "table1")
>>> sqlContext.dropTempTable("table1")
- registerDataFrameAsTable(df,tableName)
把DF註冊爲臨時table
sqlContext.registerDataFrameAsTable(df,'table1
- sql(sqlQuery)
根據指定的查詢語句返回DF
>>> sqlContext.registerDataFrameAsTable(df, "table1")
>>> df2 = sqlContext.sql("SELECT field1 AS f1, field2 as f2 from table1")
>>> df2.collect()
[Row(f1=1, f2='row1'), Row(f1=2, f2='row2'), Row(f1=3, f2='row3')]
- table(tableName)
指定table或者view轉換成DF返回
sqlContext.regiserDataFrameAsTable(df,'table1')
df2=sqlContext.table("table1")
sorted(df.collect())==sorted(df2.collect())
True
- tableNames(dbName=None)
返回數據庫dbName 裏面所有的tables,並以列表形式返回 - tables(dbName=None)
指定數據庫dbName中,返回所有的tables,並以DF的形式返回
類似
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | user| true|
+--------+---------+-----------+
- uncheTable(tableName)
指定table從內存中釋放 - createGlobalTempView(name)
DF創建爲全局臨時的VIEW
>>> df.createGlobalTempView("people")
>>> df2 = spark.sql("select * from global_temp.people")
>>> sorted(df.collect()) == sorted(df2.collect())
True
>>> df.createGlobalTempView("people")
Traceback (most recent call last):
...
AnalysisException: u"Temporary table 'people' already exists;"
>>> spark.catalog.dropGlobalTempView("people")
- createOrReplaceTempView(name)
- createTempView(name)
df.createTempView("people")
- regtisterTempTable(name)
把DF註冊爲指定名字的臨時table - cacheTable(tableName)
- clearcache()
- createExternalTable()
- createTable()
- dropGlobalTempView(viewName)
>>> spark.createDataFrame([(1, 1)]).createGlobalTempView("my_table")
>>> spark.table("global_temp.my_table").collect()
[Row(_1=1, _2=1)]
>>> spark.catalog.dropGlobalTempView("my_table")
>>> spark.table("global_temp.my_table")
Traceback (most recent call last):
...
AnalysisException: ..
- dropTempView(viewName)
- isCached(tableName)