TensorflowOnSpark 安裝

1. spark 集羣環境 

spark的安裝配置參考《Spark 安裝》。本環境是用了6臺工作站,規劃如下:

序號 主機名 IP 用途
1 bdml-c01 192.168.200.170 客戶端
2 bdml-m01 192.168.200.171 namenode
resourcemanager
master
3 bdml-s01 192.168.200.172 datanode
nodemanager
worker
4 bdml-s01 192.168.200.173 datanode
nodemanager
worker
5 bdml-s01 192.168.200.174 datanode
nodemanager
worker
6 bdml-s01 192.168.200.175 datanode
nodemanager
worker

TensorflowOnSpark 的安裝參考了《Getting Started TensorFlowOnSpark on Hadoop Cluster》。這篇文章也有誤導,以至於我專門裝了一個虛擬機去編譯tensorflow,實際上是如果你不需要RDMA這個特性的話,完全不需要編譯。爲編譯tensorflow,配置google的Bazel編譯環境,費了不少時間。

2. 軟件版本 

redhat 7.2 / centOS 7.2

hadoop 2.6.0

spark 1.6.0

scala 2.10.6

python 2.7.12 

tensorflow 1.0.1

TensorFlowOnSpark master

3. 安裝

安裝中最怕的就是版本衝突,當經過數次嘗試,版本對了。再重新裝一次,把步驟整理出來,事情就變得容易了。

1)軟件列表:

Python-2.7.12.tgz

setuptools-23.0.0.tar.gz

pip-8.1.2.tar.gz

pbr-0.11.0-py2.py3-none-any.whl

funcsigs-1.0.2-py2.py3-none-any.whl

six-1.10.0-py2.py3-none-any.whl

mock-2.0.0-py2.py3-none-any.whl

protobuf-3.1.0.post1-py2.py3-none-any.whl

pydoop-1.2.0.tar.gz

numpy-1.11.1-cp27-cp27mu-manylinux1_x86_64.whl

scipy-0.17.1-cp27-cp27mu-manylinux1_x86_64.whl

wheel-0.29.0-py2.py3-none-any.whl

tensorflow-1.0.0-cp27-none-linux_x86_64.whl

2)編譯安裝 python 2.7.12,先設置好環境 

yum install zlib-devel -y
yum install bzip2-devel -y
yum install openssl-devel -y 
yum install ncurses-devel -y 
yum install sqlite-devel -y
yum install readline-devel -y 

解壓縮Python-2.7.12.tgz,執行make和make install

./configure --prefix="/home/hadoop/Python" --enable-unicode=ucs4
make & make install
3) 將相關包安裝到Python

按順序先安裝setuptools-23.0.0.tar.gz和pip-8.1.2.tar.gz,其它的包可以用pip install安裝,如果版本裝錯了,用pip uninstall刪除,查看用pip list。但是pydoop需要編譯安裝,目前這個包不支持python3。安裝這邊包需要設置jvm環境,否則會包找不到jni.h。

安裝pydoop

ln -s /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.91-2.6.2.3.el7.x86_64/include/jni.h ./jni.h
ln -s /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.91-2.6.2.3.el7.x86_64/include/linux/jni_md.h ./jni_md.h
tar -xvf pydoop-1.2.0.tar.gz
cd pydoop-1.2.0
python setup.py build
python setup.py install

完成後的結果


製作Python.zip 並上傳到hdfs

cd ~/Python
zip Python.zip *
mv Python.zip ../
hdfs dfs -put Python.zip /user/hadoop/
hdfs dfs -ls /user/hadoop/

4)編譯tensorflow-hadoop-1.0-SNAPSHOT.jar

到https://github.com/tensorflow/ecosystem下載源碼編譯打包,example文件編譯不通過,需跳過

mvn package -Dmaven.test.skip=true
上傳tensorflow-hadoop-1.0-SNAPSHOT.jar到hdfs

hdfs dfs -put tensorflow-hadoop-1.0-SNAPSHOT.jar
hdfs dfs -ls /user/hadoop
5)編譯TensorflowOnSaprk安裝

到https://github.com/yahoo/TensorFlowOnSpark下載,解壓縮到/home/hadoop目錄(hadoop用戶的home目錄)就可以了

製作tfspark.zip

cd TensorFlowOnSpark/src
zip -r ../tfspark.zip *
支持安裝基本完成
4. 運行mnist測試程序

1)準備數據

下載mnist數據集 ,拷貝到/home/hadoop目錄的MLdata/mnist目錄

t10k-images-idx3-ubyte.gz

t10k-images-idx3-ubyte.gz

train-images-idx3-ubyte.gz

train-labels-idx1-ubyte.gz

製作zip文件

cd MLdata/mnist
zip -r mnist.zip *
2)feed_dic方式運行,步驟如下

# step 1 設置環境變量
export PYTHON_ROOT=~/Python
export LD_LIBRARY_PATH=${PATH}
export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python"
export PATH=${PYTHON_ROOT}/bin/:$PATH
export QUEUE=default

# step 2 上傳文件到hdfs 
hdfs dfs -rm /user/${USER}/mnist/mnist.zip
hdfs dfs -put ${HOME}/MLdata/mnist/mnist.zip /user/${USER}/mnist/mnist.zip

# step 3 將圖像文件(images)和標籤(labels)轉換爲CSV文件
hdfs dfs -rm -r /user/${USER}/mnist/csv
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 4 \
--executor-memory 4G \
--archives hdfs:///user/${USER}/Python.zip#Python,hdfs:///user/${USER}/mnist/mnist.zip#mnist \
TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \
--output mnist/csv \
--format csv


# step 4 訓練(train)
hadoop fs -rm -r mnist_model
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 3 \
--executor-memory 8G \
--py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.executor.memoryOverhead=6144 \
--archives hdfs:///user/${USER}/Python.zip#Python \
--conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \
${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \
--images mnist/csv/train/images \
--labels mnist/csv/train/labels \
--mode train \
--model mnist_model


# step 5 推斷(inference)
hadoop fs -rm -r predictions
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 3 \
--executor-memory 8G \
--py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.executor.memoryOverhead=6144 \
--archives hdfs:///user/${USER}/Python.zip#Python \
--conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \
${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \
--images mnist/csv/test/images \
--labels mnist/csv/test/labels \
--mode inference \
--model mnist_model \
--output predictions


# step 6 查看結果(可能有多個文件)
hdfs dfs -ls predictions
hdfs dfs -cat predictions/part-00001
hdfs dfs -cat predictions/part-00002
hdfs dfs -cat predictions/part-00003

#網頁方式,查看spark作業運行情況
http://bdml-m01:8088/cluster/apps/

3) queuerunner方式運行,步驟如下

# step 1 設置環境變量
export PYTHON_ROOT=~/Python
export LD_LIBRARY_PATH=${PATH}
export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python"
export PATH=${PYTHON_ROOT}/bin/:$PATH
export QUEUE=default

# step 2 上傳文件到hdfs 
hdfs dfs -rm /user/${USER}/mnist/mnist.zip
hdfs dfs -rm -r /user/${USER}/mnist/tfr
hdfs dfs -put ${HOME}/MLdata/mnist/mnist.zip /user/${USER}/mnist/mnist.zip

# step 3 將圖像文件(images)和標籤(labels)轉換爲TFRecords
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 4 \
--executor-memory 4G \
--archives hdfs:///user/${USER}/Python.zip#Python,hdfs:///user/${USER}/mnist/mnist.zip#mnist \
--jars hdfs:///user/${USER}/tensorflow-hadoop-1.0-SNAPSHOT.jar \
${HOME}/TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \
--output mnist/tfr \
--format tfr

# step 4 訓練(train)
hadoop fs -rm -r mnist_model
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 4 \
--executor-memory 4G \
--py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.executor.memoryOverhead=4096 \
--archives hdfs:///user/${USER}/Python.zip#Python \
--conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \
${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py \
--images mnist/tfr/train \
--format tfr \
--mode train \
--model mnist_model

# step 5 推斷(inference)
hadoop fs -rm -r predictions
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 4 \
--executor-memory 4G \
--py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.executor.memoryOverhead=4096 \
--archives hdfs:///user/${USER}/Python.zip#Python \
--conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \
${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py \
--images mnist/tfr/test \
--mode inference \
--model mnist_model \
--output predictions

# step 6 查看結果(可能有多個文件)
hdfs dfs -ls predictions
hdfs dfs -cat predictions/part-00001
hdfs dfs -cat predictions/part-00002
hdfs dfs -cat predictions/part-00003

#網頁方式,查看spark作業運行情況
http://bdml-m01:8088/cluster/apps/










發佈了96 篇原創文章 · 獲贊 26 · 訪問量 20萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章