1. 需求描述
在Centos6系統上安裝Hadoop、Spark集羣,並使用TensorFlowOnSpark的 YARN運行模式下執行TensorFlow的代碼。(最好可以在不聯網的集羣中進行配置並運行)
2. 系統環境(拓撲)
操作系統:Centos6.5 Final ; Hadoop:2.7.4 ; Spark:1.5.1-Hadoop2.6; TensorFlow 1.3.0;TensorFlowOnSpark (github最新下載);Python:2.7.12;
s0.centos.com: memory:1.5G namenode/resourcemanager ; 1核<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>2</value>
</property>
3. 參考
https://blog.abysm.org/2016/06/building-tensorflow-centos-6/: Centos6 build TensorFlow
TensorFlow github wiki :https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN ; installTensorFlowOnSpark ;
TensorFlow github wiki: https://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion-Guide ;conversionTensorFlow code ;
4. 步驟
1. 安裝devtoolset-6 及Python:
安裝repo庫: yum install -y centos-release-scl
安裝 devtoolset: yum install -y devtoolset-6
安裝Python:
yum install python27 python27-numpy python27-python-devel python27-python-wheel
安裝一些常用包:yum install –y vim zip unzip openssh-clients
2. 下載bazel,這裏下載的是0.5.1(雖然也下載了0.4.X的版本,下載包難下)
先執行:
export CC=/opt/rh/devtoolset-6/root/usr/bin/gcc
接着進入編譯環境:
scl enable devtoolset-6 python27 bash
接着以此執行:
unzip bazel-0.5.1-dist.zip -d bazel-0.5.1-dist
cd bazel-0.5.1-dist
# compile
./compile.sh
# install
mkdir -p ~/bin
cp output/bazel ~/bin/
exit //退出scl環境
// 耗時較久
3. 下載TensorFlow1.3.0源碼並解壓
4. 進入tensorflow-1.3.0 ,修改tensorflow/tensorflow.bzl文件中的tf_extension_linkopts函數如下形式:(添加一個-lrt)
def tf_extension_linkopts():
return ["-lrt"] # No extension link opts
5. 編譯安裝TensorFlow:
安裝基本軟件: yum install –y patch
接着,進入編譯環境:
scl enable devtoolset-6 python27 bash
cd tensorflow-1.3.0
./configure
# build
~/bin/bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
exit // 退出編譯環境
// 耗時同樣很久,同樣使用bazel0.4.X的版本編譯TensorFlow1.3提示版本過低
編譯後在/tmp/tensorflow_pkg則會生成一個TensorFlow的 安裝包 ,並且是屬於當前系統也就是Centos系統的安裝包;
6. 安裝Python自定義包(保持在聯網狀態下);
由於想在未聯網的情況下使用TensorFlow以及TensorFlowOnSpark,所以參考TensorFlowOnSpark github WIKI,直接編譯一個Python包,並且把TensorFlow、TensorFlowOnSpark及其他常用module安裝在這個Python包中,後面就可以直接把這個包上傳到HDFS,使得各個子節點都可以共享共同一個Python.zip包的環境變量。
export PYTHON_ROOT=~/Python // 設置環境變量,並下載Python
curl -O https://www.python.org/ftp/python/2.7.12/Python-2.7.12.tgz
tar -xvf Python-2.7.12.tgz
編譯並安裝Python:
pushd Python-2.7.12
./configure --prefix="${PYTHON_ROOT}" --enable-unicode=ucs4
make
make install
popd
安裝Pip:
pushd "${PYTHON_ROOT}"
curl -O https://bootstrap.pypa.io/get-pip.py
bin/python get-pip.py
popd
安裝TensorFlow:
pushd "${PYTHON_ROOT}" bin/pip install /tmp/tensorflow_pkg/tensorflow-1.3.0-cp27-none-linux_x86_64.whl popd
在安裝TensorFlow的時候會自動安裝諸如 numpy等常用Python包;
安裝TensorFlowOnSpark:pushd "${PYTHON_ROOT}"
bin/pip install tensorflowonspark
popd
把“武裝”好的Python打包並上傳到HDFS:
pushd "${PYTHON_ROOT}"
zip -r Python.zip *
popd
hadoop fs -put ${PYTHON_ROOT}/Python.zip
現在就可以使用TensorFlow了;
7. 修改TensorFlow代碼,比如下面的TensorFlow代碼是可以在TensorFlow環境中運行的:
# from __future__ import absolute_import
# from __future__ import division
# from __future__ import print_function
import numpy as np
import tensorflow as tf
X_FEATURE = 'x' # Name of the input feature.
train_percent = 0.8
def load_data(data_file_name):
data = np.loadtxt(open(data_file_name), delimiter=",", skiprows=0)
return data
def data_selection(iris, train_per):
data, target = np.hsplit(iris[np.random.permutation(iris.shape[0])], np.array([-1]))
row_split_index = int(data.shape[0] * train_per)
x_train, x_test = (data[1:row_split_index], data[row_split_index:])
y_train, y_test = (target[1:row_split_index], target[row_split_index:])
return x_train, x_test, y_train.astype(int), y_test.astype(int)
def run():
# Load dataset.
data_file = 'iris01.csv'
iris = load_data(data_file)
# x_train, x_test, y_train, y_test = model_selection.train_test_split(
# iris.data, iris.target, test_size=0.2, random_state=42)
x_train, x_test, y_train, y_test = data_selection(iris,train_percent)
# print(x_test)
# print(y_test)
#
# # Build 3 layer DNN with 10, 20, 10 units respectively.
feature_columns = [
tf.feature_column.numeric_column(
X_FEATURE, shape=np.array(x_train).shape[1:])]
classifier = tf.estimator.DNNClassifier(
feature_columns=feature_columns, hidden_units=[10, 20, 10], n_classes=3)
#
# # Train.
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={X_FEATURE: x_train}, y=y_train, num_epochs=None, shuffle=True)
classifier.train(input_fn=train_input_fn, steps=200)
#
# # Predict.
test_input_fn = tf.estimator.inputs.numpy_input_fn(
x={X_FEATURE: x_test}, y=y_test, num_epochs=1, shuffle=False)
predictions = classifier.predict(input_fn=test_input_fn)
y_predicted = np.array(list(p['class_ids'] for p in predictions))
y_predicted = y_predicted.reshape(np.array(y_test).shape)
# #
# # # Score with sklearn.
# score = metrics.accuracy_score(y_test, y_predicted)
# print('Accuracy (sklearn): {0:f}'.format(score))
print(np.concatenate(( y_predicted, y_test), axis= 1))
# Score with tensorflow.
scores = classifier.evaluate(input_fn=test_input_fn)
print('Accuracy (tensorflow): {0:f}'.format(scores['accuracy']))
print(classifier.params)
if __name__ == '__main__':
run()
其中iris01.csv 數據如下:
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
5.0,3.6,1.4,0.2,0
5.4,3.9,1.7,0.4,0
4.6,3.4,1.4,0.3,0
5.0,3.4,1.5,0.2,0
4.4,2.9,1.4,0.2,0
4.9,3.1,1.5,0.1,0
5.4,3.7,1.5,0.2,0
4.8,3.4,1.6,0.2,0
4.8,3.0,1.4,0.1,0
4.3,3.0,1.1,0.1,0
5.8,4.0,1.2,0.2,0
5.7,4.4,1.5,0.4,0
5.4,3.9,1.3,0.4,0
5.1,3.5,1.4,0.3,0
5.7,3.8,1.7,0.3,0
5.1,3.8,1.5,0.3,0
5.4,3.4,1.7,0.2,0
5.1,3.7,1.5,0.4,0
4.6,3.6,1.0,0.2,0
5.1,3.3,1.7,0.5,0
4.8,3.4,1.9,0.2,0
5.0,3.0,1.6,0.2,0
5.0,3.4,1.6,0.4,0
5.2,3.5,1.5,0.2,0
5.2,3.4,1.4,0.2,0
4.7,3.2,1.6,0.2,0
4.8,3.1,1.6,0.2,0
5.4,3.4,1.5,0.4,0
5.2,4.1,1.5,0.1,0
5.5,4.2,1.4,0.2,0
4.9,3.1,1.5,0.1,0
5.0,3.2,1.2,0.2,0
5.5,3.5,1.3,0.2,0
4.9,3.1,1.5,0.1,0
4.4,3.0,1.3,0.2,0
5.1,3.4,1.5,0.2,0
5.0,3.5,1.3,0.3,0
4.5,2.3,1.3,0.3,0
4.4,3.2,1.3,0.2,0
5.0,3.5,1.6,0.6,0
5.1,3.8,1.9,0.4,0
4.8,3.0,1.4,0.3,0
5.1,3.8,1.6,0.2,0
4.6,3.2,1.4,0.2,0
5.3,3.7,1.5,0.2,0
5.0,3.3,1.4,0.2,0
7.0,3.2,4.7,1.4,1
6.4,3.2,4.5,1.5,1
6.9,3.1,4.9,1.5,1
5.5,2.3,4.0,1.3,1
6.5,2.8,4.6,1.5,1
5.7,2.8,4.5,1.3,1
6.3,3.3,4.7,1.6,1
4.9,2.4,3.3,1.0,1
6.6,2.9,4.6,1.3,1
5.2,2.7,3.9,1.4,1
5.0,2.0,3.5,1.0,1
5.9,3.0,4.2,1.5,1
6.0,2.2,4.0,1.0,1
6.1,2.9,4.7,1.4,1
5.6,2.9,3.6,1.3,1
6.7,3.1,4.4,1.4,1
5.6,3.0,4.5,1.5,1
5.8,2.7,4.1,1.0,1
6.2,2.2,4.5,1.5,1
5.6,2.5,3.9,1.1,1
5.9,3.2,4.8,1.8,1
6.1,2.8,4.0,1.3,1
6.3,2.5,4.9,1.5,1
6.1,2.8,4.7,1.2,1
6.4,2.9,4.3,1.3,1
6.6,3.0,4.4,1.4,1
6.8,2.8,4.8,1.4,1
6.7,3.0,5.0,1.7,1
6.0,2.9,4.5,1.5,1
5.7,2.6,3.5,1.0,1
5.5,2.4,3.8,1.1,1
5.5,2.4,3.7,1.0,1
5.8,2.7,3.9,1.2,1
6.0,2.7,5.1,1.6,1
5.4,3.0,4.5,1.5,1
6.0,3.4,4.5,1.6,1
6.7,3.1,4.7,1.5,1
6.3,2.3,4.4,1.3,1
5.6,3.0,4.1,1.3,1
5.5,2.5,4.0,1.3,1
5.5,2.6,4.4,1.2,1
6.1,3.0,4.6,1.4,1
5.8,2.6,4.0,1.2,1
5.0,2.3,3.3,1.0,1
5.6,2.7,4.2,1.3,1
5.7,3.0,4.2,1.2,1
5.7,2.9,4.2,1.3,1
6.2,2.9,4.3,1.3,1
5.1,2.5,3.0,1.1,1
5.7,2.8,4.1,1.3,1
6.3,3.3,6.0,2.5,2
5.8,2.7,5.1,1.9,2
7.1,3.0,5.9,2.1,2
6.3,2.9,5.6,1.8,2
6.5,3.0,5.8,2.2,2
7.6,3.0,6.6,2.1,2
4.9,2.5,4.5,1.7,2
7.3,2.9,6.3,1.8,2
6.7,2.5,5.8,1.8,2
7.2,3.6,6.1,2.5,2
6.5,3.2,5.1,2.0,2
6.4,2.7,5.3,1.9,2
6.8,3.0,5.5,2.1,2
5.7,2.5,5.0,2.0,2
5.8,2.8,5.1,2.4,2
6.4,3.2,5.3,2.3,2
6.5,3.0,5.5,1.8,2
7.7,3.8,6.7,2.2,2
7.7,2.6,6.9,2.3,2
6.0,2.2,5.0,1.5,2
6.9,3.2,5.7,2.3,2
5.6,2.8,4.9,2.0,2
7.7,2.8,6.7,2.0,2
6.3,2.7,4.9,1.8,2
6.7,3.3,5.7,2.1,2
7.2,3.2,6.0,1.8,2
6.2,2.8,4.8,1.8,2
6.1,3.0,4.9,1.8,2
6.4,2.8,5.6,2.1,2
7.2,3.0,5.8,1.6,2
7.4,2.8,6.1,1.9,2
7.9,3.8,6.4,2.0,2
6.4,2.8,5.6,2.2,2
6.3,2.8,5.1,1.5,2
6.1,2.6,5.6,1.4,2
7.7,3.0,6.1,2.3,2
6.3,3.4,5.6,2.4,2
6.4,3.1,5.5,1.8,2
6.0,3.0,4.8,1.8,2
6.9,3.1,5.4,2.1,2
6.7,3.1,5.6,2.4,2
6.9,3.1,5.1,2.3,2
5.8,2.7,5.1,1.9,2
6.8,3.2,5.9,2.3,2
6.7,3.3,5.7,2.5,2
6.7,3.0,5.2,2.3,2
6.3,2.5,5.0,1.9,2
6.5,3.0,5.2,2.0,2
6.2,3.4,5.4,2.3,2
5.9,3.0,5.1,1.8,2
那代碼怎麼修改呢?
1). 導入必要的包:
from pyspark.context import SparkContext
from pyspark.conf import SparkConf
from tensorflowonspark import TFCluster,TFNode
#from com.yahoo.ml.tf import TFCluster, TFNode
from datetime import datetime
這裏要注意,導入TFCluster的時候,不要參考官網的導入方式,而應該從tensorflowonspark導入;
2.) 修改main函數,比如我這裏的函數run,只需要添加兩個參數即可:(argv,cxt)
3) 把原來的main函數調用,替換成下面的調用方式 ,比如我這裏原來只需要在main函數執行run即可,這裏需要調用TFCluster.run,並且把我的run函數傳遞給第二個參數值:
sc = SparkContext(conf=SparkConf().setAppName("your_app_name"))
num_executors = int(sc._conf.get("spark.executor.instances"))
num_ps = 1
tensorboard = True
cluster = TFCluster.run(sc, run, sys.argv, num_executors, num_ps, tensorboard, TFCluster.InputMode.TENSORFLOW)
cluster.shutdown()
然後就可以運行了,修改後的代碼如下:
# from __future__ import absolute_import
# from __future__ import division
# from __future__ import print_function
from pyspark.context import SparkContext
from pyspark.conf import SparkConf
from tensorflowonspark import TFCluster,TFNode
#from com.yahoo.ml.tf import TFCluster, TFNode
from datetime import datetime
import numpy as np
import sys
# from sklearn import metrics
# from sklearn import model_selection
import tensorflow as tf
X_FEATURE = 'x' # Name of the input feature.
train_percent = 0.8
def load_data(data_file_name):
data = np.loadtxt(open(data_file_name), delimiter=",", skiprows=0)
return data
def data_selection(iris, train_per):
data, target = np.hsplit(iris[np.random.permutation(iris.shape[0])], np.array([-1]))
row_split_index = int(data.shape[0] * train_per)
x_train, x_test = (data[1:row_split_index], data[row_split_index:])
y_train, y_test = (target[1:row_split_index], target[row_split_index:])
return x_train, x_test, y_train.astype(int), y_test.astype(int)
def map_run(argv, ctx):
# Load dataset.
data_file = 'iris01.csv'
iris = load_data(data_file)
# x_train, x_test, y_train, y_test = model_selection.train_test_split(
# iris.data, iris.target, test_size=0.2, random_state=42)
x_train, x_test, y_train, y_test = data_selection(iris,train_percent)
# print(x_test)
# print(y_test)
#
# # Build 3 layer DNN with 10, 20, 10 units respectively.
feature_columns = [
tf.feature_column.numeric_column(
X_FEATURE, shape=np.array(x_train).shape[1:])]
classifier = tf.estimator.DNNClassifier(
feature_columns=feature_columns, hidden_units=[10, 20, 10], n_classes=3)
#
# # Train.
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={X_FEATURE: x_train}, y=y_train, num_epochs=None, shuffle=True)
classifier.train(input_fn=train_input_fn, steps=200)
#
# # Predict.
test_input_fn = tf.estimator.inputs.numpy_input_fn(
x={X_FEATURE: x_test}, y=y_test, num_epochs=1, shuffle=False)
predictions = classifier.predict(input_fn=test_input_fn)
y_predicted = np.array(list(p['class_ids'] for p in predictions))
y_predicted = y_predicted.reshape(np.array(y_test).shape)
# #
# # # Score with sklearn.
# score = metrics.accuracy_score(y_test, y_predicted)
# print('Accuracy (sklearn): {0:f}'.format(score))
print(np.concatenate(( y_predicted, y_test), axis= 1))
# Score with tensorflow.
scores = classifier.evaluate(input_fn=test_input_fn)
print('Accuracy (tensorflow): {0:f}'.format(scores['accuracy']))
print(classifier.params)
if __name__ == '__main__':
import tensorflow as tf
import sys
sc = SparkContext(conf=SparkConf().setAppName("your_app_name"))
num_executors = int(sc._conf.get("spark.executor.instances"))
num_ps = 1
tensorboard = False
cluster = TFCluster.run(sc, map_run, sys.argv, num_executors, num_ps, tensorboard, TFCluster.InputMode.TENSORFLOW)
cluster.shutdown()
7. 設置環境變量,並運行:
1)上傳iris01.csv到HDFS: hdfs dfs -put iris01.csv
2) 設置環境變量:
export PYTHON_ROOT=./Python
export LD_LIBRARY_PATH=${PATH}
export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python"
export PATH=${PYTHON_ROOT}/bin/:$PATH
#export QUEUE=gpu
# set paths to libjvm.so, libhdfs.so, and libcuda*.so
#export LIB_HDFS=/opt/cloudera/parcels/CDH/lib64 # for CDH (per @wangyum)
export LIB_HDFS=$HADOOP_PREFIX/lib/native
export LIB_JVM=$JAVA_HOME/jre/lib/amd64/server
#export LIB_CUDA=/usr/local/cuda-7.5/lib64
# for CPU mode:
export QUEUE=default
3) 調用代碼:
/usr/local/spark-1.5.1-bin-hadoop2.6/bin/spark-submit --master yarn --deploy-mode cluster --num-executors 3 --executor-memory 1024m --archives hdfs://s0:8020/user/root/Python.zip#Python,/root/iris01.csv /root/iris_c.py
4) 查看yarn日誌,可以看到執行成功;
5. 問題及解決
File "iris_c.py", line 6, in <module>
from com.yahoo.ml.tf import TFCluster, TFNode
ImportError: No module named com.yahoo.ml.tf
from com.yahoo.ml.tf import TFCluster, TFNode
=》
from tensorflowonspark import TFCluster,TFNode
6. 總結
腳踏實地,專注
轉載請註明blog地址:http://blog.csdn.net/fansy1990