pyspark入門系列 --pyspark.sql.Column函數彙總與實戰

from pyspark.sql import SparkSession

spark = SparkSession.Builder().master('local').appName('sparksqlColumn').getOrCreate()

df = spark.read.csv('../data/data.csv', header='True')

df.show(3)

+---+----+----+------+----+------+----------+-------------------+----+----+----+
|_c0|對手|勝負|主客場|命中|投籃數|投籃命中率|          3分命中率|籃板|助攻|得分|
+---+----+----+------+----+------+----------+-------------------+----+----+----+
|  0|勇士|  勝|    客|  10|    23|     0.435|              0.444|   6|  11|  27|
|  1|國王|  勝|    客|   8|    21|     0.381|0.28600000000000003|   3|   9|  27|
|  2|小牛|  勝|    主|  10|    19|     0.526|              0.462|   3|   7|  29|
+---+----+----+------+----+------+----------+-------------------+----+----+----+
only showing top 3 rows

df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- 對手: string (nullable = true)
 |-- 勝負: string (nullable = true)
 |-- 主客場: string (nullable = true)
 |-- 命中: string (nullable = true)
 |-- 投籃數: string (nullable = true)
 |-- 投籃命中率: string (nullable = true)
 |-- 3分命中率: string (nullable = true)
 |-- 籃板: string (nullable = true)
 |-- 助攻: string (nullable = true)
 |-- 得分: string (nullable = true)

df.show(3)

+---+----+----+------+----+------+----------+-------------------+----+----+----+
|_c0|對手|勝負|主客場|命中|投籃數|投籃命中率|          3分命中率|籃板|助攻|得分|
+---+----+----+------+----+------+----------+-------------------+----+----+----+
|  0|勇士|  勝|    客|  10|    23|     0.435|              0.444|   6|  11|  27|
|  1|國王|  勝|    客|   8|    21|     0.381|0.28600000000000003|   3|   9|  27|
|  2|小牛|  勝|    主|  10|    19|     0.526|              0.462|   3|   7|  29|
+---+----+----+------+----+------+----------+-------------------+----+----+----+
only showing top 3 rows

from pyspark.sql.types import IntegerType, FloatType

df = df.withColumn('命中', df['命中'].cast(IntegerType()))
df = df.withColumn('投籃數', df['投籃數'].cast(IntegerType()))
df = df.withColumn('投籃命中率', df['投籃命中率'].cast(FloatType()))
df = df.withColumn('3分命中率', df['3分命中率'].cast(FloatType()))
df = df.withColumn('籃板', df['籃板'].cast(IntegerType()))
df = df.withColumn('助攻', df['助攻'].cast(IntegerType()))
df = df.withColumn('得分', df['得分'].cast(IntegerType()))

df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- 對手: string (nullable = true)
 |-- 勝負: string (nullable = true)
 |-- 主客場: string (nullable = true)
 |-- 命中: integer (nullable = true)
 |-- 投籃數: integer (nullable = true)
 |-- 投籃命中率: float (nullable = true)
 |-- 3分命中率: float (nullable = true)
 |-- 籃板: integer (nullable = true)
 |-- 助攻: integer (nullable = true)
 |-- 得分: integer (nullable = true)

alias

爲列取個別名

df.select(df['對手'].alias('比賽對手')).show(3)

+--------+
|比賽對手|
+--------+
|    勇士|
|    國王|
|    小牛|
+--------+
only showing top 3 rows

asc

升序排列一個列

asc_nulls_first() 空值在前
asc_nulls_last() 空值在後

# 根據得分升序排列，並打印前5個對手和得分
df.select('對手', '得分').orderBy(df['得分'].asc()).show(5)

+------+----+
|  對手|得分|
+------+----+
|  灰熊|  20|
|  掘金|  21|
|  灰熊|  22|
|  鵜鶘|  26|
|步行者|  26|
+------+----+
only showing top 5 rows

astype()

轉換數據類型，是cast的別名

between

一個布爾表達式，如果該表達式的值在給定列之間，則計算爲true。可用於篩選滿足條件的Row

# 篩選出得分在15-10之間數據(包含邊界)
df1 = df.select('對手', df['得分'].between(15, 20).alias('selected_df'))
df1.filter(df1['selected_df'] == True).show()

+----+-----------+
|對手|selected_df|
+----+-----------+
|灰熊|       true|
+----+-----------+

bitwiseAND：二進制與操作

bitwiseOR：二進制或操作

bitwiseOR：二進制異或操作

contains(other)

包含其他元素。根據字符串匹配返回一個布爾列

df.filter(df['對手'].contains('小')).show()

+---+----+----+------+----+------+----------+---------+----+----+----+
|_c0|對手|勝負|主客場|命中|投籃數|投籃命中率|3分命中率|籃板|助攻|得分|
+---+----+----+------+----+------+----------+---------+----+----+----+
|  2|小牛|  勝|    主|  10|    19|     0.526|    0.462|   3|   7|  29|
+---+----+----+------+----+------+----------+---------+----+----+----+

desc()

desc_nulls_first()

desc_nulls_last()

降序排列

df.select('對手', '得分').orderBy(df['得分'].desc()).show(5)

+------+----+
|  對手|得分|
+------+----+
|  爵士|  56|
|開拓者|  48|
|  太陽|  48|
|  猛龍|  38|
|  灰熊|  38|
+------+----+
only showing top 5 rows

endswith(other)

boolen值，以other結尾的字符串

df.filter(df['對手'].endswith('熊')).show()

+---+----+----+------+----+------+----------+---------+----+----+----+
|_c0|對手|勝負|主客場|命中|投籃數|投籃命中率|3分命中率|籃板|助攻|得分|
+---+----+----+------+----+------+----------+---------+----+----+----+
|  3|灰熊|  負|    主|   8|    20|       0.4|     0.25|   5|   8|  22|
|  6|灰熊|  負|    客|   6|    19|     0.316|    0.222|   4|   8|  20|
| 12|灰熊|  勝|    主|  11|    25|      0.44|    0.429|   4|   8|  38|
| 16|灰熊|  勝|    客|   9|    20|      0.45|      0.5|   5|   7|  29|
+---+----+----+------+----+------+----------+---------+----+----+----+

eqNullSafe(other)

空值/指定值判斷

from pyspark.sql import Row

df1 = spark.createDataFrame([Row(id=1, value='foo'), Row(id=2, value=None)])
df1.select('id', 'value', df1.value.eqNullSafe('foo'), df1.value.eqNullSafe(None)).show()

+---+-----+---------------+----------------+
| id|value|(value <=> foo)|(value <=> NULL)|
+---+-----+---------------+----------------+
|  1|  foo|           true|           false|
|  2| null|          false|            true|
+---+-----+---------------+----------------+

isNotNull()

當前表達式非空，返回True

df1.select('id', 'value', df1.value.isNotNull()).show()

+---+-----+-------------------+
| id|value|(value IS NOT NULL)|
+---+-----+-------------------+
|  1|  foo|               true|
|  2| null|              false|
+---+-----+-------------------+

isNull()

當前表達式爲空，返回True

df1.select('id', 'value', df1.value.isNull()).show()

+---+-----+---------------+
| id|value|(value IS NULL)|
+---+-----+---------------+
|  1|  foo|          false|
|  2| null|           true|
+---+-----+---------------+

isin()

一個布爾表達式，如果自變量的求值包含該表達式的值，則該表達式爲true。

# 取出對手爲['灰熊', '76人', '騎士']的數據
df.filter(df['對手'].isin(['灰熊', '76人', '騎士'])).show()

+---+----+----+------+----+------+----------+---------+----+----+----+
|_c0|對手|勝負|主客場|命中|投籃數|投籃命中率|3分命中率|籃板|助攻|得分|
+---+----+----+------+----+------+----------+---------+----+----+----+
|  3|灰熊|  負|    主|   8|    20|       0.4|     0.25|   5|   8|  22|
|  4|76人|  勝|    客|  10|    20|       0.5|     0.25|   3|  13|  27|
|  6|灰熊|  負|    客|   6|    19|     0.316|    0.222|   4|   8|  20|
|  7|76人|  負|    主|   8|    21|     0.381|    0.429|   4|   7|  29|
| 11|騎士|  勝|    主|   8|    21|     0.381|    0.429|  11|  13|  35|
| 12|灰熊|  勝|    主|  11|    25|      0.44|    0.429|   4|   8|  38|
| 16|灰熊|  勝|    客|   9|    20|      0.45|      0.5|   5|   7|  29|
+---+----+----+------+----+------+----------+---------+----+----+----+

like(other)

類似於SQL中的like，返回基於SQL LIKE匹配的布爾列。

# 返回以‘灰’開頭的
df.select('對手', '勝負', '主客場', '得分').where(df['對手'].like('灰%')).show()

+----+----+------+----+
|對手|勝負|主客場|得分|
+----+----+------+----+
|灰熊|  負|    主|  22|
|灰熊|  負|    客|  20|
|灰熊|  勝|    主|  38|
|灰熊|  勝|    客|  29|
+----+----+------+----+

otherwise(value)

計算條件列表，並返回多個可能的結果表達式之一。

# 增加標誌列flag,將灰熊標誌爲1，其他對手標誌爲0
from pyspark.sql import functions as F
df.withColumn('flag', F.when(df['對手'] == '灰熊', 1).otherwise(0)).show(5)

+---+----+----+------+----+------+----------+---------+----+----+----+----+
|_c0|對手|勝負|主客場|命中|投籃數|投籃命中率|3分命中率|籃板|助攻|得分|flag|
+---+----+----+------+----+------+----------+---------+----+----+----+----+
|  0|勇士|  勝|    客|  10|    23|     0.435|    0.444|   6|  11|  27|   0|
|  1|國王|  勝|    客|   8|    21|     0.381|    0.286|   3|   9|  27|   0|
|  2|小牛|  勝|    主|  10|    19|     0.526|    0.462|   3|   7|  29|   0|
|  3|灰熊|  負|    主|   8|    20|       0.4|     0.25|   5|   8|  22|   1|
|  4|76人|  勝|    客|  10|    20|       0.5|     0.25|   3|  13|  27|   0|
+---+----+----+------+----+------+----------+---------+----+----+----+----+
only showing top 5 rows

rlike(other)

SQL RLIKE表達式（與Regex相似）。根據正則表達式匹配返回布爾列。

df.filter(df['對手'].rlike('熊$')).show()

+---+----+----+------+----+------+----------+---------+----+----+----+
|_c0|對手|勝負|主客場|命中|投籃數|投籃命中率|3分命中率|籃板|助攻|得分|
+---+----+----+------+----+------+----------+---------+----+----+----+
|  3|灰熊|  負|    主|   8|    20|       0.4|     0.25|   5|   8|  22|
|  6|灰熊|  負|    客|   6|    19|     0.316|    0.222|   4|   8|  20|
| 12|灰熊|  勝|    主|  11|    25|      0.44|    0.429|   4|   8|  38|
| 16|灰熊|  勝|    客|   9|    20|      0.45|      0.5|   5|   7|  29|
+---+----+----+------+----+------+----------+---------+----+----+----+

startswith(other)

返回一個boolen列，以other開始的返回爲True

df.select('對手', df['對手'].startswith('灰').alias('灰%')).show(5)

+----+-----+
|對手|  灰%|
+----+-----+
|勇士|false|
|國王|false|
|小牛|false|
|灰熊| true|
|76人|false|
+----+-----+
only showing top 5 rows

substr(startPos, length)

返回一個Column，它是該列的子字符串。

# 返回對手名稱的第一個字符, 並命名爲‘子串’
df.select('對手', df['對手'].substr(1, 1).alias('子串')).show(3)

+----+----+
|對手|子串|
+----+----+
|勇士|  勇|
|國王|  國|
|小牛|  小|
+----+----+
only showing top 3 rows

when(condition, value)

計算條件列表，並返回多個可能的結果表達式之一。如果未調用Column.otherwise（），則對於不匹配的條件，將不返回None。

# 查找得分大於25的對手,標記爲1，否則標記爲0，標記列名爲‘score_flag’
from pyspark.sql import functions as F

df.select('對手','得分', F.when(df['得分'] > 25, 1).otherwise(0).alias('score_flag')).show(5)

+----+----+----------+
|對手|得分|score_flag|
+----+----+----------+
|勇士|  27|         1|
|國王|  27|         1|
|小牛|  29|         1|
|灰熊|  22|         0|
|76人|  27|         1|
+----+----+----------+
only showing top 5 rows

getItem(key)

該表達式從列表中的第一個位置獲取項目，或從字典中通過鍵獲取一個項目

df1 = spark.createDataFrame([([1, 2], {'key': 'value'})], ['l', 'd'])
df1.select(df1.l.getItem(0), df1.d.getItem('key')).show()

+----+------+
|l[0]|d[key]|
+----+------+
|   1| value|
+----+------+

附上官方文檔的連接：http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

pyspark入門系列 --pyspark.sql.Column函數彙總與實戰

alias

asc

astype()

between

bitwiseAND：二進制與操作

bitwiseOR：二進制或操作

bitwiseOR：二進制異或操作

contains(other)

desc()

desc_nulls_first()

desc_nulls_last()

endswith(other)

eqNullSafe(other)

isNotNull()

isNull()

isin()

like(other)

otherwise(value)

rlike(other)

startswith(other)

substr(startPos, length)

when(condition, value)

getItem(key)

pyspark入門系列 - 02 pyspark.sql入口 SparkSession簡介與實踐

觀察期與表現期

信用評分模型建模流程

一文搞懂基於用戶的協同過濾推薦算法

pyspark入門系列 - 03 pyspark.sql.DataFrame函數彙總與實踐

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

pyspark入門系列 --pyspark.sql.Column函數彙總與實戰

alias

asc

astype()

between

bitwiseAND： 二進制與操作

bitwiseOR： 二進制或操作

bitwiseOR： 二進制異或操作

contains(other)

desc()

desc_nulls_first()

desc_nulls_last()

endswith(other)

eqNullSafe(other)

isNotNull()

isNull()

isin()

like(other)

otherwise(value)

rlike(other)

startswith(other)

substr(startPos, length)

when(condition, value)

getItem(key)

bitwiseAND：二進制與操作

bitwiseOR：二進制或操作

bitwiseOR：二進制異或操作