pycharm遠程連接服務器調用spark的配置及異常處理

spark的運行環境配置:

第一步:
pycharm遠程連接服務器之後,配置python解釋器的路徑(可以是自己本地服務器的Python路徑或者虛擬環境中的Python解釋器路徑),樣例是遠程連接的虛擬環境
第二步:
將spark安裝目錄python目錄下面的pyspark文件夾複製到python的解釋器所在的安裝目錄的site-packages包中:

本地服務器python解釋器的site-packages包路徑
cd /usr/local/python3/lib/python3.6/site-packages
虛擬環境中的python解釋器的site-packages包路徑
cd /root/miniconda3/envs/ai/lib/python3.6/site-packages

spark的安裝目錄中的python目錄:
/export/servers/spark-2.3.4-bin-hadoop2.7/python
進入spark的Python目錄下:

# 複製pyspark到虛擬環境中python解釋器的lib安裝路徑
cp -r pyspark /root/miniconda3/envs/ai/lib/python3.6/site-packages

第三步:
配置pycharm的環境變量,可以在代碼中添加到最前面或者在Configuration中添加環境變量:
法1:直接添加代碼:

import os
# spark_home 的環境變量
SPARK_HOME = "/export/servers/spark-2.3.4-bin-hadoop2.7"
# 由於系統中存在兩種不同版本的python,需要配置PYSPARK_PYTHON和PYSPARK_DRIVER_PYTHON指定運行爲python3
# python3的路徑使用本地或者虛擬機的均可
# PYSPARK_PYTHON = "/root/miniconda3/envs/ai/bin/python3"
# PYSPARK_DRIVER_PYTHON = "/root/miniconda3/envs/ai/bin/python3"
PYSPARK_PYTHON = "/usr/local/python3/bin/python3"
PYSPARK_DRIVER_PYTHON = "/usr/local/python3/bin/python3"
# 添加到環境變量中
os.environ['SPARK_HOME'] = SPARK_HOME
os.environ['PYSPARK_PYTHON'] = PYSPARK_PYTHON
os.environ['PYSPARK_DRIVER_PYTHON'] = PYSPARK_DRIVER_PYTHON

或者法2:配置環境變量:
點擊編輯配置
在這裏插入圖片描述
在這裏插入圖片描述
在這裏插入圖片描述

第三步:運行測試樣例
測試樣例代碼:

import os
# spark_home 的環境變量
SPARK_HOME = "/export/servers/spark-2.3.4-bin-hadoop2.7"
# 由於系統中存在兩種不同版本的python,需要配置PYSPARK_PYTHON和PYSPARK_DRIVER_PYTHON指定運行爲python3
PYSPARK_PYTHON = "/usr/local/python3/bin/python3"
PYSPARK_DRIVER_PYTHON = "/usr/local/python3/bin/python3"
os.environ['SPARK_HOME'] = SPARK_HOME
os.environ['PYSPARK_PYTHON'] = PYSPARK_PYTHON
os.environ['PYSPARK_DRIVER_PYTHON'] = PYSPARK_DRIVER_PYTHON

from pyspark.sql import SparkSession
# 創建SparkSession實例
spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
# 統計文件中的詞的頻率
counts = sc.textFile('file:///root/tmp/word.txt').map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
output = counts.collect()
print(output)
sc.stop()

異常處理方式
1.如果出現No module named 'py4j’異常,請安裝py4j模塊

# 安裝py4j
pip install py4j

異常代碼

ssh://root@192.168.141.130:22/root/miniconda3/envs/ai/bin/python3.6 -u /root/Desktop/kafka_test.py
Traceback (most recent call last):
  File "/root/Desktop/kafka_test.py", line 13, in <module>
    from pyspark.sql import SparkSession
  File "/root/miniconda3/envs/ai/lib/python3.6/site-packages/pyspark/__init__.py", line 46, in <module>
    from pyspark.context import SparkContext
  File "/root/miniconda3/envs/ai/lib/python3.6/site-packages/pyspark/context.py", line 29, in <module>
    from py4j.protocol import Py4JError
ModuleNotFoundError: No module named 'py4j'

2.如果出現異常“No module named ‘pyspark’”,需要把spark下面python目錄下的pyspak目錄,複製到python解釋器所在路徑的site-packages中;
異常代碼:

ssh://root@192.168.141.130:22/root/miniconda3/envs/ai/bin/python3.6 -u /root/Desktop/kafka_test.py
Traceback (most recent call last):
  File "/root/Desktop/kafka_test.py", line 13, in <module>
    from pyspark.sql import SparkSession
ModuleNotFoundError: No module named 'pyspark'

3.如果出現下面這種異常“Exception: Python in worker has different version 3.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.”,說明python解釋器版本衝突,PYSPARK_PYTHON和PYSPARK_DRIVER_PYTHON 的路徑配置存在問題,請檢查自己第一步和第二部的配置是否正確。
異常代碼

2019-10-16 18:23:00 ERROR TaskSetManager:70 - Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "/root/Desktop/kafka_test.py", line 13, in <module>
    output = counts.collect()
  File "/export/servers/spark-2.3.4-bin-hadoop2.7/python/pyspark/rdd.py", line 814, in collect
    sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/export/servers/spark-2.3.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/export/servers/spark-2.3.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/export/servers/spark-2.3.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 181, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 3.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:336)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:475)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:458)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:290)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1126)
	at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章