1. spark 集羣環境
spark的安裝配置參考《Spark 安裝》。本環境是用了6臺工作站,規劃如下:
序號 | 主機名 | IP | 用途 |
1 | bdml-c01 | 192.168.200.170 | 客戶端 |
2 | bdml-m01 | 192.168.200.171 | namenode resourcemanager master |
3 | bdml-s01 | 192.168.200.172 | datanode nodemanager worker |
4 | bdml-s01 | 192.168.200.173 | datanode nodemanager worker |
5 | bdml-s01 | 192.168.200.174 | datanode nodemanager worker |
6 | bdml-s01 | 192.168.200.175 | datanode nodemanager worker |
TensorflowOnSpark 的安裝參考了《Getting Started TensorFlowOnSpark on Hadoop Cluster》。這篇文章也有誤導,以至於我專門裝了一個虛擬機去編譯tensorflow,實際上是如果你不需要RDMA這個特性的話,完全不需要編譯。爲編譯tensorflow,配置google的Bazel編譯環境,費了不少時間。
2. 軟件版本
redhat 7.2 / centOS 7.2
hadoop 2.6.0
spark 1.6.0
scala 2.10.6
python 2.7.12
tensorflow 1.0.1
TensorFlowOnSpark master
3. 安裝
安裝中最怕的就是版本衝突,當經過數次嘗試,版本對了。再重新裝一次,把步驟整理出來,事情就變得容易了。
1)軟件列表:
Python-2.7.12.tgz
setuptools-23.0.0.tar.gz
pip-8.1.2.tar.gz
pbr-0.11.0-py2.py3-none-any.whl
funcsigs-1.0.2-py2.py3-none-any.whl
six-1.10.0-py2.py3-none-any.whl
mock-2.0.0-py2.py3-none-any.whl
protobuf-3.1.0.post1-py2.py3-none-any.whl
pydoop-1.2.0.tar.gz
numpy-1.11.1-cp27-cp27mu-manylinux1_x86_64.whl
scipy-0.17.1-cp27-cp27mu-manylinux1_x86_64.whl
wheel-0.29.0-py2.py3-none-any.whl
tensorflow-1.0.0-cp27-none-linux_x86_64.whl
2)編譯安裝 python 2.7.12,先設置好環境
yum install zlib-devel -y
yum install bzip2-devel -y
yum install openssl-devel -y
yum install ncurses-devel -y
yum install sqlite-devel -y
yum install readline-devel -y
解壓縮Python-2.7.12.tgz,執行make和make install
./configure --prefix="/home/hadoop/Python" --enable-unicode=ucs4
make & make install
3) 將相關包安裝到Python
按順序先安裝setuptools-23.0.0.tar.gz和pip-8.1.2.tar.gz,其它的包可以用pip install安裝,如果版本裝錯了,用pip uninstall刪除,查看用pip list。但是pydoop需要編譯安裝,目前這個包不支持python3。安裝這邊包需要設置jvm環境,否則會包找不到jni.h。
安裝pydoop
ln -s /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.91-2.6.2.3.el7.x86_64/include/jni.h ./jni.h
ln -s /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.91-2.6.2.3.el7.x86_64/include/linux/jni_md.h ./jni_md.h
tar -xvf pydoop-1.2.0.tar.gz
cd pydoop-1.2.0
python setup.py build
python setup.py install
完成後的結果
製作Python.zip 並上傳到hdfs
cd ~/Python
zip Python.zip *
mv Python.zip ../
hdfs dfs -put Python.zip /user/hadoop/
hdfs dfs -ls /user/hadoop/
4)編譯tensorflow-hadoop-1.0-SNAPSHOT.jar
到https://github.com/tensorflow/ecosystem下載源碼編譯打包,example文件編譯不通過,需跳過
mvn package -Dmaven.test.skip=true
上傳tensorflow-hadoop-1.0-SNAPSHOT.jar到hdfs
hdfs dfs -put tensorflow-hadoop-1.0-SNAPSHOT.jar
hdfs dfs -ls /user/hadoop
5)編譯TensorflowOnSaprk安裝
到https://github.com/yahoo/TensorFlowOnSpark下載,解壓縮到/home/hadoop目錄(hadoop用戶的home目錄)就可以了
製作tfspark.zip
cd TensorFlowOnSpark/src
zip -r ../tfspark.zip *
支持安裝基本完成4. 運行mnist測試程序
1)準備數據
下載mnist數據集 ,拷貝到/home/hadoop目錄的MLdata/mnist目錄
t10k-images-idx3-ubyte.gz
t10k-images-idx3-ubyte.gz
train-images-idx3-ubyte.gz
train-labels-idx1-ubyte.gz
製作zip文件
cd MLdata/mnist
zip -r mnist.zip *
2)feed_dic方式運行,步驟如下
# step 1 設置環境變量
export PYTHON_ROOT=~/Python
export LD_LIBRARY_PATH=${PATH}
export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python"
export PATH=${PYTHON_ROOT}/bin/:$PATH
export QUEUE=default
# step 2 上傳文件到hdfs
hdfs dfs -rm /user/${USER}/mnist/mnist.zip
hdfs dfs -put ${HOME}/MLdata/mnist/mnist.zip /user/${USER}/mnist/mnist.zip
# step 3 將圖像文件(images)和標籤(labels)轉換爲CSV文件
hdfs dfs -rm -r /user/${USER}/mnist/csv
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 4 \
--executor-memory 4G \
--archives hdfs:///user/${USER}/Python.zip#Python,hdfs:///user/${USER}/mnist/mnist.zip#mnist \
TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \
--output mnist/csv \
--format csv
# step 4 訓練(train)
hadoop fs -rm -r mnist_model
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 3 \
--executor-memory 8G \
--py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.executor.memoryOverhead=6144 \
--archives hdfs:///user/${USER}/Python.zip#Python \
--conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \
${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \
--images mnist/csv/train/images \
--labels mnist/csv/train/labels \
--mode train \
--model mnist_model
# step 5 推斷(inference)
hadoop fs -rm -r predictions
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 3 \
--executor-memory 8G \
--py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.executor.memoryOverhead=6144 \
--archives hdfs:///user/${USER}/Python.zip#Python \
--conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \
${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \
--images mnist/csv/test/images \
--labels mnist/csv/test/labels \
--mode inference \
--model mnist_model \
--output predictions
# step 6 查看結果(可能有多個文件)
hdfs dfs -ls predictions
hdfs dfs -cat predictions/part-00001
hdfs dfs -cat predictions/part-00002
hdfs dfs -cat predictions/part-00003
#網頁方式,查看spark作業運行情況
http://bdml-m01:8088/cluster/apps/
3) queuerunner方式運行,步驟如下
# step 1 設置環境變量
export PYTHON_ROOT=~/Python
export LD_LIBRARY_PATH=${PATH}
export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python"
export PATH=${PYTHON_ROOT}/bin/:$PATH
export QUEUE=default
# step 2 上傳文件到hdfs
hdfs dfs -rm /user/${USER}/mnist/mnist.zip
hdfs dfs -rm -r /user/${USER}/mnist/tfr
hdfs dfs -put ${HOME}/MLdata/mnist/mnist.zip /user/${USER}/mnist/mnist.zip
# step 3 將圖像文件(images)和標籤(labels)轉換爲TFRecords
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 4 \
--executor-memory 4G \
--archives hdfs:///user/${USER}/Python.zip#Python,hdfs:///user/${USER}/mnist/mnist.zip#mnist \
--jars hdfs:///user/${USER}/tensorflow-hadoop-1.0-SNAPSHOT.jar \
${HOME}/TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \
--output mnist/tfr \
--format tfr
# step 4 訓練(train)
hadoop fs -rm -r mnist_model
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 4 \
--executor-memory 4G \
--py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.executor.memoryOverhead=4096 \
--archives hdfs:///user/${USER}/Python.zip#Python \
--conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \
${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py \
--images mnist/tfr/train \
--format tfr \
--mode train \
--model mnist_model
# step 5 推斷(inference)
hadoop fs -rm -r predictions
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 4 \
--executor-memory 4G \
--py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.executor.memoryOverhead=4096 \
--archives hdfs:///user/${USER}/Python.zip#Python \
--conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \
${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py \
--images mnist/tfr/test \
--mode inference \
--model mnist_model \
--output predictions
# step 6 查看結果(可能有多個文件)
hdfs dfs -ls predictions
hdfs dfs -cat predictions/part-00001
hdfs dfs -cat predictions/part-00002
hdfs dfs -cat predictions/part-00003
#網頁方式,查看spark作業運行情況
http://bdml-m01:8088/cluster/apps/