PySpark SQL 加載使用 tab 鍵分隔的文件
數據文件準備
爲了方便後面的實驗, 先生成數據文件 data.txt
, Python 代碼如下:
data = [
'x1\t1\t2',
'x2\t2\t2',
'x3\t3\t2',
'x4\t4\t2',
'x5\t5\t2',
]
with open('data.txt', 'w') as f:
for i in data:
f.write('{}\n'.format(i))
PySpark SQL 加載數據文件
import os
from os.path import abspath
from pyspark.sql import SparkSession, Row
# warehouse_location points to the default location for managed databases and tables
warehouse_location = abspath('spark-warehouse')
os.system("rm -rf {}".format(warehouse_location))
os.system("rm -rf {}".format('metastore_db'))
spark = SparkSession.builder \
.appName("sql") \
.config('master', 'local') \
.config('spark.sql.warehouse.dir', warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
data_file = 'data.txt'
table = 'Info'
cmd = "create table if not exists {}".format(table) + \
" (id STRING, age INT, number INT)" + \
r" row format delimited fields terminated by '\t'"
print(cmd)
spark.sql(cmd)
spark.sql("load data local inpath '{}' into table {}".format(data_file, table))
spark.sql("select * from {} limit 2".format(table)).show()
spark.stop()
結果如下: