分佈式TensorFlow

在大型的數據集上進行神經網絡訓練，往往需要更大的運算資源, 而且需要耗費的時間也是很久的。因此TensorFlow提供了一個可以分佈式部署的模式，將一個訓練任務拆成若干個小任務，分配到不同的計算機來完成協同運算，這樣可以節省大量的時間。

我們先看一下簡單情況下的訓練模式：
1）單CPU單GPU

這種情況就是最簡單的，對於這種情況，可以把參數和計算都定義再gpu上，不過如果參數模型比較大，顯存不足等情況，就得放在CPU。

import  tensorflow as tf 
with tf.device('/cpu:0'):   #   也可以放在gpu上
	w=tf.get_variable('w',[1],tf.float32,initializer=tf.constant_initializer(2))
	b=tf.get_variable('b',[1],tf.float32,initializer=tf.constant_initializer(5))
with tf.device('/gpu:0'):
	add=w+b
	mut=w*b
init = tf.initialize_all_variables()
with tf.Session() as sess:
	sess.run(init)
	tensor1,tensor2=sess.run([add,mut])
	print tensor1
	print tensor2

2) 單CPU多GPU

這種情況我們就可以指定不同的GPU進行訓練了。一般共享操作定義在cpu上，然後並行操作定義在各自的gpu上，比如對於深度學習來說，我們一般參數定義、參數梯度更新統一放在cpu上，各個gpu通過各自計算各自batch 數據的梯度值，然後統一傳到cpu上，由cpu計算求取平均值，更新參數。

具體的深度學習多GPU訓練代碼，請參考：

https://github.com/tensorflow/models/blob/master/inception/inception/inception_train.py

import  tensorflow as tf
with tf.device('/cpu:0'):
	w=tf.get_variable('w',[1],tf.float32,initializer=tf.constant_initializer(2))
	b=tf.get_variable('b',[2],tf.float32,initializer=tf.constant_initializer(5))
with tf.device('/gpu:0'):
	add=w+b
with tf.device('/gpu:1'):
	mut=w*b
init = tf.initialize_all_variables()
with tf.Session() as sess:
	sess.run(init)
	print sess.run([add,mut])

3）多CPU多GPU

這個時候就會定義各自的角色，便於不同角色之間相互配合，分工明確。

Cluster、Job、task概念：

task可以看成每臺機器上的一個進程，多個task組成job

job又可分爲：ps(Parameter Server)、worker兩種，分別用於參數服務、計算服務，組成cluster。

tensorflow的分佈式有in-graph和between-gragh兩種架構模式。

in-graph 模式

in-graph模式，把計算已經從單機多GPU，已經擴展到了多機多GPU了，不過數據分發還是在一個節點。這樣的好處是配置簡單，其他多機多GPU的計算節點，只要起個join操作，暴露一個網絡接口，等在那裏接受任務就好了。這些計算節點暴露出來的網絡接口，使用起來就跟本機的一個GPU的使用一樣，只要在操作的時候指定tf.device("/job:worker/task:n")，就可以向指定GPU一樣，把操作指定到一個計算節點上計算，使用起來和多GPU的類似。但是這樣的壞處是訓練數據的分發依然在一個節點上，要把訓練數據分發到不同的機器上，嚴重影響併發訓練速度。在大數據訓練的情況下，不推薦使用這種模式。

between-graph模式

between-graph模式下，訓練的參數保存在參數服務器，數據不用分發，數據分片的保存在各個計算節點，各個計算節點自己算自己的，算完了之後，把要更新的參數告訴參數服務器，參數服務器更新參數。這種模式的優點是不用訓練數據的分發了，尤其是在數據量在TB級的時候，節省了大量的時間，所以大數據深度學習還是推薦使用between-graph模式。

同步更新和異步更新

TensorFlow的兩種模式都支持同步更新和異步更新。

同步更新：將數據拆分成多份，每份基於參數計算出各自部分的梯度；當每一份的部分梯度計算完成後，收集到一起算出總梯度，再用總梯度去更新參數。
異步更新：同步更新模式下，每次都要等各個部分的梯度計算完後才能進行參數更新操作，處理速度取決於計算梯度最慢的那個部分，其他部分存在大量的等待時間浪費；異步更新模式下，所有的部分只需要算自己的梯度，根據自己的梯度更新參數，不同部分之間不存在通信和等待。

下面通過代碼解釋各種函數及某些用法的含義：

import numpy as np
import tensorflow as tf

flags = tf.app.flags
# 定義角色名稱
flags.DEFINE_string('job_name', None, 'job name: worker or ps')
# 指定任務的編號
flags.DEFINE_integer('task_index', None, 'Index of task within the job')
# 定義ip和端口
flags.DEFINE_string('ps_hosts', 'localhost:1681', 'Comma-separated list of hostname:port pairs')
flags.DEFINE_string('worker_hosts', 'localhost:1682,localhost:1683', 'Comma-separated list of hostname:port pairs')
# 定義保存文件的目錄
flags.DEFINE_string('log_dir', 'log/super/', 'directory path')
# 訓練參數設置
flags.DEFINE_integer('training_epochs', 20, 'training epochs')
FLAGS = flags.FLAGS

上面的代碼就很好理解了，只是定義了一些參數。

1) 在運行時通過 job_name 和 task_index 傳遞參數，定義不同的角色(主要是 ps 和 worker)和任務編號

2) 通過 ps_hosts 和 worker_hosts 定義參與訓練的主機 ip 和端口，用 ' , ' 隔開。

# 生成模擬數據
train_X = np.linspace(-1, 1, 100)
train_Y = 2 * train_X + np.random.randn(*train_X.shape) * 0.3  # y=2x，但是加入了噪聲

tf.reset_default_graph()

ps_hosts = FLAGS.ps_hosts.split(',')
worker_hosts = FLAGS.worker_hosts.split(',')
cluster_spec = tf.train.ClusterSpec({'ps': ps_hosts, 'worker': worker_hosts})

上面這段代碼主要就是：

1）生成模擬數據

2）分割 ps_hosts 和 worker_hosts，然後通過 tf.train.ClusterSpec() 把你要跑這個任務的所有 ps 和 worker 節點的ip和端口的信息都包含進去，所有的節點都要執行這段代碼，大家就互相知道這個集羣裏面都有哪些成員，不同的成員的角色是什麼，是 ps 還是 worker。

# 創建server
server = tf.train.Server({'ps': ps_hosts, 'worker': worker_hosts},
                         job_name=FLAGS.job_name,
                         task_index=FLAGS.task_index)
# ps角色使用join進行等待
if FLAGS.job_name == 'ps':
    print("waiting...")
    server.join()

tf.train.Server() 將根據參數對主機進行分工。根據參數的不同，決定了這個任務是哪個任務。如果任務名字是 ps 的話，程序就join到這裏，等待其他主機的連接，作爲參數更新的服務，等待其他worker節點給他提交參數和更新的數據。如果是worker任務，就執行後面的計算任務。

with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=cluster_spec)):
    X = tf.placeholder("float")
    Y = tf.placeholder("float")
    # 模型參數
    W = tf.Variable(tf.random_normal([1]), name="weight")
    b = tf.Variable(tf.zeros([1]), name="bias")

    global_step = tf.contrib.framework.get_or_create_global_step()  # 獲得迭代次數

    # 前向結構
    z = tf.multiply(X, W) + b
    # 反向優化
    cost = tf.reduce_mean(tf.square(Y - z))
    learning_rate = 0.01
    # Gradient descent
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost, global_step=global_step)

    init = tf.global_variables_initializer()

tf.device() 函數中的任務是通過 tf.train.replica_device_setter() 來指定的。

tf.train.replica_device_setter() 可以看看文章後面的具體參數。

worker_device 定義具體的任務名稱，
cluster 指定角色及對應的IP地址，從而實現管理整個任務下的圖節點。

init = tf.global_variables_initializer() 是將前面的參數全部初始化，如果後面在再有變量，將不會被初始化。

在這個with語句之下定義的參數，會自動分配到參數服務器上去定義，如果有多個參數服務器，就輪流循環分配。

sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
                         init_op=init,
                         global_step=global_step)

tf.train.Supervisor(）類似一個監督者，在分佈式訓練中，很多機器都在運行，像什麼參數初始化，保存模型，寫summary.......，這個supervisoer幫你一起弄起來了，就不用自己手動去做這些事情了，而且在分佈式的環境下涉及到各種參數的共享，比較麻煩，所以就有了 tf.train.Supervisor(）

      is_chief 表明是否爲 chief supervisors 角色，這裏將 task_index=0 的worker設置成了 chief supervisors 。負責初始化參數， 模型的保存，summary的保存。 
      init_op 表示使用初始化變量的函數。
      global_step是可以所有計算節點共享的，在執行optimizer的minimize的時候，會自動加1

在這個函數中，已經通過 init_op 初始化參數了，所以就不需要在運行 sess.run(init) 來初始化參數了，如果用其再次初始化，會導致載入模型的變量被清空。

其他的一些參數：

logdir 就是檢查點文件和summary文件的保存路徑。 訓練啓動就會去logdir的目錄去看有沒有checkpoint的文件，有的話就自動裝載，沒有就用init_op指定的初始化參數。
saver 需要保存檢查點的saver對象傳入，supervisor就會自動保存檢查點文件。如不想自動保存設置爲None
summary_op 也是自動保存summary文件。設置爲None，表示不自動保存。
save_model_secs 爲保存檢查點文件的時間間隔。

# 連接目標角色創建session
with sv.managed_session(server.target) as sess:
    print(global_step.eval(session=sess))

    for epoch in range(global_step.eval(session=sess), FLAGS.training_epochs*len(train_X)):

        for (x, y) in zip(train_X, train_Y):
            _, epoch = sess.run([optimizer, global_step], feed_dict={X: x, Y: y})

            loss = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
            print("Epoch:", epoch + 1, "cost=", loss, "W=", sess.run(W), "b=", sess.run(b))

    print(" Finished!")
sv.stop()

上面的代碼是通過 tf.train.Supervisor() 中的managed_session來管理打開一個session。session中只負責運算，而通信協調的事情就會交給supervisor來管理。

在上面的程序中如果要保存 summary 文件，將使用sv.summary_computed(), 想要手動保存使用 sv.saver.save(),在設置自動保存檢查點文件之後，手動保存仍然有效。在程序運行中止時，在運行 supervisor 時會自動載入模型的參數，不需要手動調用saver.restore()。

但是在分佈式部署時，保存 summary 還需要注意幾點：

1）不是 chief supervisor 不能使用 sv.summary_computed() ，即使使用也無法執行，還會報錯

2）手寫控制 summary 與檢查點文件保存時，需要將chief supervisor 以外的worker全部去掉纔可以。可以使用 supervisor 按時間間隔保存的形式來管理，這樣用一套代碼就足夠了。

下面是完整的代碼：

運行時打開三個終端，分別輸入：

1）python  distribute.py  --job_name=ps  --task_index=0
2）python  distribute.py  --job_name=worker  --task_index=0
3）python  distribute.py  --job_name=worker  --task_index=1

import numpy as np
import tensorflow as tf

flags = tf.app.flags

# 定義角色名稱
flags.DEFINE_string('job_name', None, 'job name: worker or ps')
# 指定任務的編號
flags.DEFINE_integer('task_index', None, 'Index of task within the job')

# 定義ip和端口
flags.DEFINE_string('ps_hosts', 'localhost:1681', 'Comma-separated list of hostname:port pairs')
flags.DEFINE_string('worker_hosts', 'localhost:1682,localhost:1683', 'Comma-separated list of hostname:port pairs')
# 定義保存文件的目錄
flags.DEFINE_string('log_dir', 'log/super/', 'directory path')

# 參數設置
flags.DEFINE_integer('training_epochs', 20, 'training epochs')

FLAGS = flags.FLAGS

# 生成模擬數據
train_X = np.linspace(-1, 1, 100)
train_Y = 2 * train_X + np.random.randn(*train_X.shape) * 0.3  # y=2x，但是加入了噪聲

tf.reset_default_graph()

ps_hosts = FLAGS.ps_hosts.split(',')
worker_hosts = FLAGS.worker_hosts.split(',')
cluster_spec = tf.train.ClusterSpec({'ps': ps_hosts, 'worker': worker_hosts})
# 創建server
server = tf.train.Server({'ps': ps_hosts, 'worker': worker_hosts},
                         job_name=FLAGS.job_name,
                         task_index=FLAGS.task_index)

# ps角色使用join進行等待
if FLAGS.job_name == 'ps':
    print("waiting...")
    server.join()

with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=cluster_spec)):
    X = tf.placeholder("float")
    Y = tf.placeholder("float")
    # 模型參數
    W = tf.Variable(tf.random_normal([1]), name="weight")
    b = tf.Variable(tf.zeros([1]), name="bias")

    global_step = tf.contrib.framework.get_or_create_global_step()  # 獲得迭代次數

    # 前向結構
    z = tf.multiply(X, W) + b
    # 反向優化
    cost = tf.reduce_mean(tf.square(Y - z))
    learning_rate = 0.01
    # Gradient descent
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost, global_step=global_step)

    init = tf.global_variables_initializer()


sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
                         init_op=init,
                         global_step=global_step)

# 連接目標角色創建session
with sv.managed_session(server.target) as sess:
    print(global_step.eval(session=sess))

    for epoch in range(global_step.eval(session=sess), FLAGS.training_epochs*len(train_X)):

        for (x, y) in zip(train_X, train_Y):
            _, epoch = sess.run([optimizer, global_step], feed_dict={X: x, Y: y})

            loss = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
            print("Epoch:", epoch + 1, "cost=", loss, "W=", sess.run(W), "b=", sess.run(b))

    print(" Finished!")
sv.stop()

replica_device_setter(ps_tasks=0, ps_device="/job:ps",
                          worker_device="/job:worker", merge_devices=True,
                          cluster=None, ps_ops=None, ps_strategy=None):
  """Return a `device function` to use when building a Graph for replicas.

  Device Functions are used in `with tf.device(device_function):` statement to
  automatically assign devices to `Operation` objects as they are constructed,
  Device constraints are added from the inner-most context first, working
  outwards. The merging behavior adds constraints to fields that are yet unset
  by a more inner context. Currently the fields are (job, task, cpu/gpu).

  If `cluster` is `None`, and `ps_tasks` is 0, the returned function is a no-op.
  Otherwise, the value of `ps_tasks` is derived from `cluster`.

  By default, only Variable ops are placed on ps tasks, and the placement
  strategy is round-robin over all ps tasks. A custom `ps_strategy` may be used
  to do more intelligent placement, such as
  `tf.contrib.training.GreedyLoadBalancingStrategy`.

  For example,

  ```python
  # To build a cluster with two ps jobs on hosts ps0 and ps1, and 3 worker
  # jobs on hosts worker0, worker1 and worker2.
  cluster_spec = {
      "ps": ["ps0:2222", "ps1:2222"],
      "worker": ["worker0:2222", "worker1:2222", "worker2:2222"]}
  with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)):
    # Build your graph
    v1 = tf.Variable(...)  # assigned to /job:ps/task:0
    v2 = tf.Variable(...)  # assigned to /job:ps/task:1
    v3 = tf.Variable(...)  # assigned to /job:ps/task:0
  # Run compute
  ```

  Args:
    ps_tasks: Number of tasks in the `ps` job.  Ignored if `cluster` is
      provided.
    ps_device: String.  Device of the `ps` job.  If empty no `ps` job is used.
      Defaults to `ps`.
    worker_device: String.  Device of the `worker` job.  If empty no `worker`
      job is used.
    merge_devices: `Boolean`. If `True`, merges or only sets a device if the
      device constraint is completely unset. merges device specification rather
      than overriding them.
    cluster: `ClusterDef` proto or `ClusterSpec`.
    ps_ops: List of strings representing `Operation` types that need to be
      placed on `ps` devices.  If `None`, defaults to `STANDARD_PS_OPS`.
    ps_strategy: A callable invoked for every ps `Operation` (i.e. matched by
      `ps_ops`), that takes the `Operation` and returns the ps task index to
      use.  If `None`, defaults to a round-robin strategy across all `ps`
      devices.

  Returns:
    A function to pass to `tf.device()`.

  Raises:
    TypeError if `cluster` is not a dictionary or `ClusterDef` protocol buffer,
    or if `ps_strategy` is provided but not a callable.
  """

分佈式TensorFlow

in-graph 模式

between-graph模式

同步更新和異步更新

杭州的 IT 崩盤了麼？

開源高性能結構化日誌模塊NanoLog

Python 潮流週刊#55：分享 9 個高質量的技術類信息源！

WinForm應用實戰開發指南 - 表格數據錄入問題解析

Azure Virtual Network (22) 多訂閱使用Azure DNS解析問題 Windows Azure Platform 系列文章目錄

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

轉：requests 第三方庫文檔

pandas之數據轉換

pandas之數據聚合與分組運算

pandas之時間序列

pandas重塑層次化索引

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結