AttributeError: 'DataFrame' object has no attribute 'map'

[root@master pyspark]# spark-submit spark_python_sql.py
19/05/04 17:03:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
19/05/04 17:03:16 INFO SparkContext: Running Spark version 2.4.1
19/05/04 17:03:16 INFO SparkContext: Submitted application: python Spark Wordcount…
19/05/04 17:03:17 INFO SecurityManager: Changing view acls to: root
19/05/04 17:03:17 INFO SecurityManager: Changing modify acls to: root
19/05/04 17:03:17 INFO SecurityManager: Changing view acls groups to:
19/05/04 17:03:17 INFO SecurityManager: Changing modify acls groups to:
19/05/04 17:03:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
19/05/04 17:03:17 INFO Utils: Successfully started service ‘sparkDriver’ on port 40129.
19/05/04 17:03:17 INFO SparkEnv: Registering MapOutputTracker
19/05/04 17:03:17 INFO SparkEnv: Registering BlockManagerMaster
19/05/04 17:03:17 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/05/04 17:03:17 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/05/04 17:03:17 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-d8b28dd5-9cd1-4837-8eb7-77e63976c377
19/05/04 17:03:17 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
19/05/04 17:03:17 INFO SparkEnv: Registering OutputCommitCoordinator
19/05/04 17:03:17 INFO Utils: Successfully started service ‘SparkUI’ on port 4040.
19/05/04 17:03:18 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://master:4040
19/05/04 17:03:18 INFO Executor: Starting executor ID driver on host localhost
19/05/04 17:03:18 INFO Utils: Successfully started service ‘org.apache.spark.network.netty.NettyBlockTransferService’ on port 46628.
19/05/04 17:03:18 INFO NettyBlockTransferService: Server created on master:46628
19/05/04 17:03:18 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/05/04 17:03:18 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, master, 46628, None)
19/05/04 17:03:18 INFO BlockManagerMasterEndpoint: Registering block manager master:46628 with 366.3 MB RAM, BlockManagerId(driver, master, 46628, None)
19/05/04 17:03:18 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master, 46628, None)
19/05/04 17:03:18 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, master, 46628, None)
Traceback (most recent call last):
File “/home/badou/pyspark/spark_python_sql.py”, line 53, in
“”").map(lambda output: output.session_id + “\t” + str(output.cnt))
File “/root/anaconda3/envs/py27/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/dataframe.py”, line 1300, in getattr
AttributeError: ‘DataFrame’ object has no attribute 'map’

對於上面的問題,查了很多資料,發現在Spark2.0之前,spark_df.map是spark_df.rdd.map()的別名,但在spark2.2的環境中,就會報DataFrame' object has no attribute 'map' 的錯誤,所以必須顯式調用,將其轉換爲RDD並通過執行spark_df.rdd.map();由於本人使用的spark2.0版本,因此出現了這個問題。

下面是代碼
#coding=utf-8
import os
import time
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, Row

if name == “main”:
sparkConf = SparkConf().setAppName(“python Spark Wordcount…”).setMaster(“local[2]”)

sc = SparkContext(conf=sparkConf)  # 創建Sparkontext
sc.setLogLevel("WARN")  # 設置日誌級別

# create SQLContext
sqlcontext = SQLContext(sparkContext=sc)  # 主要這裏的參數sparContext,不是Sparkontext類

# transform function
def map_func(line):
    # split line
    arr = line.split("\t")
    # return  這裏是Row類型,需要去學習一下
    return Row(track_time=arr[0], url=arr[1], session_id=arr[2]
    , referer=arr[3], ip=arr[4], end_user_id=arr[5], city_id=arr[6])

# read data from local/hdfs and transform RDD[Row]
page_views_rdd = sc\
    .textFile("/home/badou/pyspark/page_views.data")\
    .map(map_func)
# create dataFrame
page_views_df = sqlcontext.createDataFrame(page_views_rdd)

# # print
# print page_views_df.count()
# page_views_df.show()

"""
    基於SQL進行數據分析
"""
# register temp table(首先註冊爲一張臨時表)
page_views_df.registerTempTable("temp_page_views")

# 需求:按照session_id進行分組,統計次數,會話PV
session_pv = sqlcontext.sql("""
SELECT
session_id, count(1) AS cnt
FROM
temp_page_views
GROUP BY
session_id
ORDER BY
cnt DESC
LIMIT
10
""").rdd.map(lambda output: output.session_id + "\t" + str(output.cnt))

for result in session_pv.collect():
    print result
# 這種方式使用的並不多




time.sleep(100000)  # 休眠一段時間,爲web ui進行監控
sc.close()
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章