Python--spark job--linux

Just for recording

Mission:

Make python project as a spark job,  triggered by upstream spark job.

python run spark job:

spark submit --py-files pkg.zip main.py

package:

Python package with:

Zip: zip -r HiveValidation.zip ./

libs: bin/pip install --install-option --install-lib=./ pyspark

virtualenv : pip install virtualenv

 URL:https://stackoverflow.com/questions/17486578/how-can-you-bundle-all-your-python-code-into-a-single-zip-file

之前看的一個鏈接,通過virtualenv打包,但是如果有anaconda會更加方便,一個管理python的工具,可以指定python版本

Steps:

0.  查看conda env list:查看電腦上的python環境 

conda env list

1.  創建乾淨的python環境,根據python2/3,默認安裝2/3裏的最高版本,2.7裝的是2.7.16

conda create -n env_name python=2.7

2.  激活環境env_name

conda activate env_name 

3.  把需要的依賴放到requirements.txt

pandas
numpy
tomorrow
thrift

4.  安裝.txt下的所有依賴到deps文件夾下

pip install -r requirements.txt -t deps   

5.  打包當前路徑下除了dist下的所有文件

zip -r pkg.zip ./ -x "dist/*"     

Questions:

Question0 java.lang.IllegalStateException: User did not initialize spark context!

Solution:Change Yarn cluster to Yarn client   (my python project don't need spark, just for trigger)

Yarn cluster and yarn client: https://blog.csdn.net/kaaosidao/article/details/77948121 

# SparkContext
from pyspark import SparkContext, SparkConf
sc = SparkContext()

Question1: cannot find/load modules[standard libs or custom modules] Python3.7

Solution:

Step1: create two test envs, [full deps for run/pack, empty deps for test/unpack]

Step2: pip install -r requirements.txt -t deps

  SubQuestion1.1: cannot pip install sasl

  Solution: pip install git+https://github.com/JoshRosen/python-sasl.git@fix-build-with-newer-xcode <a fix version of sasl>

Step3: pack and upload to run, but still cannot find custom libs<python.log>, perhaps  due to python version

Step4: check remote spark python env, go to Q2 

Question2: translate envs(from python3.7[local machine] to python2.7.5[remote spark])

Solution:

Step1: print remote python [email protected], but the local python is 3.7, so create a new python env python2.7.16 by anaconda

Step2: go through Solution1, but still cannot find modules/libs, so doubt if the uploaded packages are correctly referred

Tips:

print('current path: {}'.format(os.getcwd()))  
print('Dirs: {}'.format(os.listdir(os.getcwd())))

os.system('echo Current python: `python --version`')     # terminal output: Current python: 'python 2.7.5'

Question3: uploaded zip file is not unzipped, so the deps cannot be referred accounting for the exceptions

Solution: unzip the archieve by main.py (or read file from zip) 

os.system('unzip -q -o ./pkg.zip')  #-q quiet unzip

Question4: remote python prefers to load/read its system file while exist, instead of upload deps

 

Solution: insert uploaded deps ahead of system libs by sys.path.insert   

import sys;
sys.path.insert(0, 'deps')

Tips: Add current path\ deps\ python to system path , then we can import .py from deps\

sys.path.insert(0, os.path.join(os.getcwd()))
sys.path.insert(0, os.path.join(os.getcwd(), 'deps'))
sys.path.insert(0, os.path.join(os.getcwd(), 'python'))

 Question5: failed to load/restore pyspark enviroment on remote machine

 

Solution: set key null and reload

```python

if 'PYSPARK_GATEWAY_PORT' in os.environ.keys():

    del os.environ['PYSPARK_GATEWAY_PORT']

if 'PYSPARK_GATEWAY_SECRET' in os.environ.keys():

    del os.environ['PYSPARK_GATEWAY_SECRET']

if 'SPARK_HOME' in os.environ.keys():

    del os.environ['SPARK_HOME']

```

Question6: python2.7.5 cannot import pandas due to numpy, pytz, dateutil version not matched<perhaps version too high>

Solution:  remove -info, recompile in linux, you need a server with root

Question7: can't find module sasl in linux

Solution: Re-compile in linux use root

pip install sasl

 Question8: Hive query failed:'ascii' codec can't encode character u'\xb0' in position 63: ordinal not in range(128)

(python2.7 diff with python3) 

os.environ['LC_ALL'] = 'en_US.utf8'
reload(sys)
sys.setdefaultencoding('utf8')

 https://blog.csdn.net/crazyhacking/article/details/39375535

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章