NVCaffe 0.16.2 多 GPU 訓練過程代碼分析

NVIDA在Caffe的基礎上對其進行了優化，這篇文章主要是針對其多 GPU 訓練過程中參數更新方式及通訊方法進行相關代碼的學習，如有不正確之處請指正。
先放主要的參考文章
1. NVCaffe github 主頁
2. 博主 @KFXW 之前寫了NVcaffe源碼閱讀系列文章，給了我很大啓發，非常感謝！！
3. 另一位博主 @漚江一流對（Caffe，LeNet）的訓練過程作了非常詳細的介紹，前後向傳播，權值更新幾篇文章中讓我學到了很多知識，同樣非常感謝！！
4. 還參考了網絡上其他博主的文章，很抱歉沒有記錄下來，但在此謝謝各位博主！

好了，進入正題，首先從主函數開始。
主函數main()

int main(int argc, char** argv) {
  // Run tool or show usage.
  caffe::GlobalInit(&argc, &argv);
  // 設置設備 
  vector<int> gpus;
  get_gpus(&gpus);
#ifndef CPU_ONLY
  if (gpus.size() > 0) {
    Caffe::SetDevice(gpus[0]);
  }
#endif
  if (argc == 2) {
      // 若訓練 caffe 的命令行爲 ./build/tools/caffe train
      // 則這裏 g_brew_map 的 key 值爲 argv[1]，也即是 'train'，則實際調用了 train() 
      return GetBrewFunction(caffe::string(argv[1]))();  // ------->
  } else {
    gflags::ShowUsageWithFlagsRestrict(argv[0], "tools/caffe");
  }
}

RegisterBrewFunction 宏在每一個實現主要功能的函數之後將這個函數的名字和其對應的函數指針添加到了 g_brew_map 中, 具體分別爲 train()，test()，device_query()，time() 這四個函數。

#define RegisterBrewFunction(func) \
namespace { \
class __Registerer_##func { \
 public: /* NOLINT */ \
  __Registerer_##func() { \
    g_brew_map[#func] = &func; \
  } \
}; \
__Registerer_##func g_registerer_##func; \
}

GetBrewFunction() 函數返回 g_brew_map[name], 即返回需要實現功能的函數。

static BrewFunction GetBrewFunction(const caffe::string& name) {
  if (g_brew_map.count(name)) {
    return g_brew_map[name];
  }
}

進入 train() 函數，首先是從 solver.prototxt 文件中讀取訓練模型參數，並設置 Caffe 的 mode（GPU 還是 CPU）以及設備 id[s]，該部分代碼省略，主要分析使用 Solver 類完成整個訓練的過程。

int train() {
  // 通過調用 SolverRegistry 類的靜態成員函數 CreateSolver() 得到一個指向 Solver 的指針來構造 shared_ptr 類型的 solver。
  // 這裏的 solver_param 就是網絡的模型及求解文件 solver.prototxt, 當多個 GPU 時，這裏創建的 Solver 爲訓練過程的 root_solver, device_id = 0 (GPU0)。
  shared_ptr<caffe::Solver> solver(caffe::SolverRegistry::CreateSolver(solver_param)); //-----> solver_factory.hpp CreateSolver()
  solver->SetActionFunction(signal_handler.GetActionFunction());
  // 多 GPU 訓練，需要涉及到 GPU 間通信與計算的異步處理問題。
  if (gpus.size() > 1)   {
    caffe::P2PManager p2p_mgr(solver, gpus.size(), solver->param()); 
                         //-----> parallel.cpp   P2PManager::P2PManager()
    p2p_mgr.Run(gpus); //-------> parallel.cpp  P2PManager::Run(const vector<int>& gpus)
  }   else {  // gpus.size() <= 1)
    LOG(INFO) << "Starting Optimization";
    // 調用 Solver 的 Solve() 方法，開始優化。
    solver->Solve();   //-------> solver.cpp  Solver::Solve(const char* resume_file)
  }
  LOG(INFO) << "Optimization Done in " << Caffe::time_from_init();
  return 0;
}

solver_factory.hpp 創建 Solver。

  static Solver* CreateSolver(const SolverParameter& param, size_t rank = 0U,
      Solver* root_solver = NULL) {
    const string& type = param.type();
    CreatorRegistry& registry = Registry();
    CHECK_EQ(registry.count(type), 1) << "Unknown solver type: " << type
        << " (known types: " << SolverTypeListString() << ")";
    Solver* solver = registry[type](param, rank, root_solver);
    return solver;
  }

儘管 solver 是一個指向基類 Solver 類型對象的指針，但由於 C++ 多態的特性，solver 這個智能指針調用各個成員函數時會調用到各個子類的函數。
由於 caffe.proto 文件中默認的優化方法爲 SGD，所以會實例化一個 SGDSolver 的對象（sgd_solvers.hpp）, SGDSolver 類繼承於 Solver 類。
class SGDSolver : public Solver
構造函數爲：

  explicit SGDSolver(const SolverParameter& param,
      size_t rank = 0U, Solver *root_solver = NULL)
      : Solver(param, rank, root_solver) { PreSolve(); }

因此，需要先調用父類 Solver 的構造函數，而 Solver 類中包含 Net 類對象，而 Net 類對象又包含了 Layers 類對象和 Blob 類對象。最終整個初始化的工作大概是：

新建一個 SGDSolver 對象 -> 調用 SGDSolver 類的構造函數 -> 調用 Solver 類的構造函數 -> 新建 Net 類實例 -> 調用 Net 類的構造函數 -> 新建各個 Layer 的實例 -> 調用各個 Layer 類的構造函數 -> 設置每個 Blob，也由此完成整個網絡的初始化。

parallel.cpp P2PManager 構造函數
注意： caffe.cpp 中創建的 solver 即爲 root_solver

P2PManager::P2PManager(shared_ptr<Solver> root_solver,
    int nranks, const SolverParameter& solver_param) :
      nranks_(nranks),
      syncs_(nranks),
      root_solver_(root_solver)

parallel.cpp Run() 函數

void P2PManager::Run(const vector<int>& gpus) {
  ......
  SolverParameter param = root_solver_->param();
  this->shared_ = make_shared<SharedScores<float>>(nranks_);
  for (int i = 0; i < gpus.size(); ++i) {
    param.set_device_id(gpus[i]);
    // 返回一個 P2PSync 類型的 shared_ptr 智能指針 syncs_[i]
    // 每個 GPU 對應一個 P2PSync, 用於多 GPU 間的 P2P 異步
    syncs_[i] = make_shared<P2PSync>(this, root_solver_, i, gpus.size(), param);
                                 // -------> parallel.cpp  P2PSync::P2PSync()
#ifndef CPU_ONLY
#ifdef USE_NCCL
    syncs_[i]->aux_ = &nccl_id_;
#else
    LOG(FATAL) << "Multi-GPU execution not available - rebuild with USE_NCCL";
#endif  // USE_NCCL
#endif  // CPU_ONLY
    syncs_[i]->shared_ = this->shared_;
  }
  LOG(INFO)<< "Starting Optimization";
  for (int i = 0; i < syncs_.size(); ++i) {
    // 開始內部線程
    syncs_[i]->StartInternalThread(true, static_cast<uint64_t>(param.random_seed())); // -------> internal_thread.cpp 
       // InternalThread::StartInternalThread(bool set_cpu_affinity, uint64_t random_seed)
  }
  for (int i = 0; i < syncs_.size(); ++i) {
    syncs_[i]->WaitAll();
  }
  ......
}

P2PSync 類繼承自 Solver::Callback 和 InternalThread

class P2PSync : public Solver::Callback, public InternalThread
構造函數：

P2PSync::P2PSync(P2PManager* mgr, shared_ptr<Solver> root_solver,
    int rank, int nranks, const SolverParameter& solver_param)
    : InternalThread(solver_param.device_id(), rank, 1, false),
      mgr_(mgr),
      rank_(rank),
      nranks_(nranks),
      initial_iter_(root_solver->iter()),
      solver_(),
      root_solver_(root_solver),
      solver_param_(solver_param)

InternalThread 構造函數：

InternalThread::InternalThread(int target_device, size_t rank, size_t threads, bool delayed)
    : target_device_(target_device),
      rank_(rank),
      aux_(nullptr),
      threads_(threads),
      delay_flags_(threads, make_shared<Flag>(!delayed))

InternalThread 開啓線程函數
注意：創建 InternalThread 實例時傳進的參數 threads = 1，因此這裏 threads_.size() = 1

void InternalThread::StartInternalThread(bool set_cpu_affinity, uint64_t random_seed) {
  ......
  const int solver_count = Caffe::solver_count();
  try {
    for (size_t id = 0; id < threads_.size(); ++id) {
      // 實例化一個 boost::thread 對象給 thread_[id] 指針，該線程的執行的是 entry 函數，實際只有一個線程 
      threads_[id] = boost::thread(&InternalThread::entry, this, id, target_device_, mode,
          random_seed, solver_count, rank_, set_cpu_affinity);  // ------>
    }
  } catch (std::exception& e) {
    LOG(FATAL) << "Thread exception: " << e.what();
  }
}

線程所要執行的函數 InternalThread::entry

void InternalThread::entry(int thread_id, int device, Caffe::Brew mode, uint64_t random_seed,
    int solver_count, size_t rank, bool set_cpu_affinity) {
  ......
  Caffe::set_mode(mode);
  Caffe::set_random_seed(random_seed);
  Caffe::set_solver_count(solver_count);
  if (threads_.size() == 1) {
    InternalThreadEntry();   // ---------> internal_thread.hpp 虛函數，由其子類實現
  } else {
    InternalThreadEntryN(thread_id);  // ---------> internal_thread.hpp 虛函數，由其子類實現
  }
}

由於 threads_.size() = 1，因此下一步執行的是 InternalThreadEntry() 函數，該函數由其子類 P2PSync 實現。

parallel.cpp

void P2PSync::InternalThreadEntry() {
  if (rank_ == 0) { // GPU0 爲 root_solver, root_solver 在 caffe.cpp 的 train() 函數中創建
    Caffe::set_root_solver(true);
    solver_ = root_solver_; 
    solver_->root_add_callback(this);
  } else { // 爲其他 GPU 創建 Solver
    Caffe::set_root_solver(false);
    solver_.reset(caffe::SolverRegistry::CreateSolver(solver_param_, rank_, root_solver_.get()));
  }
  solver_->set_callback(this);
#ifndef CPU_ONLY
#ifdef USE_NCCL
  ncclUniqueId* nccl_id = reinterpret_cast<ncclUniqueId*>(this->aux_);
  soft_barrier();
  NCCL_CHECK(ncclCommInitRank(&nccl_comm_, nranks_, *nccl_id, rank_));
  soft_barrier();
#endif
#endif
  init_streams();
  //  調用 Solver 的 Solve() 方法，開始優化。
  if (solver_->Solve()) {
    mgr_->EarlyCancel(this);
  }
}

solver.cpp Solver::Solve() 主要是調用了Step函數完成迭代

bool Solver::Solve(const char* resume_file) {
  ......
  int start_iter = iter_;  
  ......
  // 核心代碼
  // 參數 param.max_iter() 爲 solver.prototxt 中的 max_iter, 參數 iter_ 在初始化 Solver 時被初始化爲 0。
  Step(param_.max_iter() - iter_);
  ......
  return false;
}

進入 Step() 函數

void Solver::Step(int iters) {
  //設置開始的迭代次數和結束的迭代次數
  const int start_iter = iter_;
  const int stop_iter = iter_ + iters;
  // 輸出的 loss 爲前 average_loss 次 loss 的平均值，在 solver.prototxt 裏設置，默認爲 1，
  // losses 存儲之前的 average_loss 個 loss, smoothed_loss_ 爲最後要輸出的均值  
  int average_loss = this->param_.average_loss();
  losses_.clear();
  smoothed_loss_ = 0;
  const Caffe::Brew mode = Caffe::mode();
  const int solver_count = Caffe::solver_count();
  const bool root_solver = this->is_root();
  net_->set_solver(this);
#ifndef CPU_ONLY
  for (const shared_ptr<Blob>& param : net_->learnable_params()) {
    // To prevent allocations inside on_start call:
    param->allocate_data(mode == Caffe::GPU);
  }
  // 初始化網絡的可學習(更新)參數的梯度數值存儲空間，全部清0
  net_->InitializeLearnableDiffSpace();
  // 當有多個 GPU 設備時
  if (solver_count > 1) { 
    // we need to sync all threads before starting, otherwise some cuda init,
    // malloc or other cuda stuff could interlock with in-loop cuda GPU sync
    // called in on_start.
    // 需要在開始前同步所有的線程
    callback_soft_barrier();
    {
      unique_ptr<unique_lock<shared_mutex>> lock;
      if (root_solver) {
        lock.reset(new unique_lock<shared_mutex>(GPUMemory::read_write_mutex()));
      }
      callback_soft_barrier();
      // on_start() 使用 ncclBcast 並輔以一些同步函數，將 net 分發到各個 GPU 設備上
      callback_->on_start(net_->learnable_params()); // -------> parallel.cpp P2PSync::on_start()
    }
    callback_soft_barrier();
    LOG(INFO) << "Starting Optimization on GPU " << Caffe::current_device();
  }
  const bool use_multi_gpu_testing = Caffe::solver_count() > 1;
  const string mgpu_str = use_multi_gpu_testing ? "[MultiGPU] " : "";
#else
  const bool use_multi_gpu_testing = false;
  const string mgpu_str;
#endif
  uint64_t random_seed = param_.random_seed() >= 0 ?
      static_cast<uint64_t>(param_.random_seed()) : Caffe::next_seed();
  // *** 在循環迭代之前開啓了一個新線程 reduce_thread_, 專門負責權重的更新，該線程調用 Solver::Reduce() 函數，以及進一步的 Net::ReduceAndUpdate() 函數，實現多 GPU 之間異步並行更新權重。
  reduce_thread_.reset(new boost::thread(&Solver::Reduce, this,
      Caffe::current_device(), mode, random_seed, solver_count, root_solver));
  // 開始迭代
  while (iter_ < stop_iter) {
    if (param_.snapshot_diff()) {
      net_->ClearParamDiffs();  // 權值梯度清 0
    }  // we clean them in ApplyUpdate otherwise
    // Just started or restored?
    const bool first_loop = iter_ == 0 || iterations_last_ < 0;
    // 測試
    ......TestAll(); // 代碼略
    const bool display = this->display();
    net_->set_debug_info(display && param_.debug_info());
    // accumulate the loss and gradient
    float loss = 0.F;
    if (first_loop) {
      iterations_last_ = iter_;
      iteration_timer_.Start();
      init_flag_.set();
    }
    iteration_start_signal();
    // iter_size 是在 solver.prototxt 裏設置（默認爲 1），每次迭代都會以 batch_size 大小計算梯度和 loss，最後再取 iter_size 次迭代的平均。
    // 當進行了 iter_size 次迭代時，參數 apply_update = true。可以看成進行 iter_size 次迭代訓練，或者說訓練 iter_size*batch_size 張圖片時會更新一次參數。
    // 這樣的好處是比一次使用大的 batch_size 要節省存儲。可在當 batch_size 設置的過大，導致GPU的顯存不夠（出現 out_of_memory）的時候使用。
    for (int i = 0; i < param_.iter_size(); ++i) {
      // *** 前向傳播和反向傳播，前向用於計算模型的最終輸出和 Loss, 後向用於計算每一層網絡和參數的梯度。 
      loss += net_->ForwardBackward(i + 1 == param_.iter_size());  
            //-------> net.cpp  Net::ForwardBackward(bool apply_update)
      if (i == 0) {
        if (first_loop) {
          iter0_flag_.set();
          net_->wait_layers_init();
        }
        iter_size_complete_ = true;
      }
    }
    loss /= param_.iter_size(); // 最終的 loss 爲 iter_size 次迭代的平均
    iteration_wait();
    if (requested_early_exit_) {
      total_lapse_ += iteration_timer_.Seconds();
      break;
    }
    // average the loss across iterations for smoothed reporting
    // 對 loss 作平滑
    // 由於 Caffe 的訓練方式是 SGD, 我們無法把所有的數據同時放入模型進行訓練，
    // 那麼部分數據產生的 Loss 就可能會和全樣本的平均 Loss 不同，
    // 在必要時候將 loss 和歷史過程中更新的 loss 求平均就可以減少 Loss 的震盪問題
    UpdateSmoothedLoss(loss, start_iter, average_loss);
    if (display || iter_ <= 2 || iter_ + 1 >= stop_iter) {
      ...... display // 代碼省略
    }
    // Increment the internal iter_ counter -- its value should always indicate
    // the number of times the weights have been updated.
    ++iter_;
    SolverAction::Enum request = GetRequestedAction();
    // Save a snapshot if needed.
    if ((param_.snapshot()
         && iter_ % param_.snapshot() == 0
         && Caffe::root_solver()) ||
         (request == SolverAction::SNAPSHOT)) {
      Snapshot();
    }
    if (SolverAction::STOP == request) {
      requested_early_exit_ = true;
      total_lapse_ += iteration_timer_.Seconds();
      // Break out of training loop.
      break;
    }
  }
  Finalize();
}

在 Step() 函數中實現了多 GPU 間計算與權重更新的異步模型，表現在以下
1. 在循環迭代之前開啓了一個新線程 reduce_thread_, 專門負責權重的更新，該線程調用 Solver::Reduce() 函數，以及進一步的 Net::ReduceAndUpdate() 函數，實現多 GPU 之間異步並行更新權重。

reduce_thread_.reset(new boost::thread(&Solver::Reduce, this, Caffe::current_device(), mode, random_seed, solver_count, root_solver));

2. 對 net.cpp 中 ForwardBackward(bool apply_update) 函數，主要是 Backward(bool apply_update) 的改進。net.cpp維護了一個異步隊列，該隊列存儲的元素是需要更新的參數的id。
BlockingQueue<int> reduction_queue_;

首先來看前向計算和反向計算部分，然後再看權重更新部分。
net.cpp 前向計算和反向計算函數

float Net::ForwardBackward(bool apply_update) {
  float loss;
  Forward(&loss); // 前向計算
  Backward(apply_update); // 反向計算
  return loss;
}

重點關注反向計算 Backward() 函數進而調用 BackwardFromToAu() 函數。經過網絡反向計算完一個層的梯度之後，且該層的參數需要被更新時，將需要更新的參數的 id 存入隊列 reduction_queue_ 中。

void Net::Backward(bool apply_update) {
  BackwardFromToAu(layers_.size() - 1, 0, apply_update);
}
void Net::BackwardFromToAu(int start, int end, bool apply_update) {
  for (int i = start; i >= end; --i) {
    // 對每一層進行反向計算，調用不同層的 Backward() 函數來計算每層的梯度。  
    layers_[i]->Backward(top_vecs_[i], bottom_need_backward_[i], bottom_vecs_[i]);
    if (!apply_update) {
      continue;
    }
    for (int j = 0; j < layers_[i]->blobs().size(); ++j) {
      if (layers_[i]->skip_apply_update(j)) {
        continue;
      }
      int param_id = layer_index_params_[make_pair(i, j)];
      if (param_owners_[param_id] < 0) {
        // 計算完一個層的數據，且該層的參數需要被更新時，將需要更新的參數的 id 存入隊列 reduction_queue_ 中。
        // 比如 LeNet 需要更新的參數有 4 個，id 爲 0-3，reduction_queue_ 隊列中將 push 進 0-3。AlexNet 需要更新的參數有 16 個，id 爲 0-15，reduction_queue_ 隊列中將 push 進 0-15。
        reduction_queue_.push(learnable_param_ids_[param_id]); 
      }  // leave it to the owner otherwise
    }
  }
  if (apply_update) {
    // 在訓練完 batch_size * iter_size 張圖片後，插入 END_OF_ITERATION 標識符
    reduction_queue_.push(END_OF_ITERATION);
  }
}

下面再來看參數更新的過程，線程 reduce_thread_ 負責權重的更新，調用 solver.cpp 中的 Reduce() 函數

void Solver::Reduce(int device, Caffe::Brew mode, uint64_t random_seed,
    int solver_count, bool root_solver) {
  Caffe::set_mode(mode);
#ifndef CPU_ONLY
  if (Caffe::mode() == Caffe::GPU) {
    CUDA_CHECK(cudaSetDevice(device));
#ifndef NO_NVML
    nvml::setCpuAffinity(rank_);
#endif
  }
#endif
  Caffe::set_random_seed(random_seed);
  Caffe::set_solver_count(solver_count);
  Caffe::set_root_solver(root_solver);
  net_->ReduceAndUpdate(); // ---------> net.cpp  Net::ReduceAndUpdate()
}

進一步的 net.cpp 中的 ReduceAndUpdate() 函數。
在使用多個 GPUs 時, 我們必須在每次迭代後進行歸約。爲了達到更好的性能, 我們將多個 layers 組合到 buckets 中。Net 的參數 reduce_buckets 用於設置 buckets 的大概數量（默認爲 6 ）。
reduce_buckets 的定義 caffe.proto 文件中：

  // While using multiple GPUs we have to run reduction process after every iteration.
  // For better performance we unify multiple layers in buckets.
  // This parameter sets approximate number of buckets to combine layers to.
  // Default value is good for majority of nets.
  // 在使用多個 GPUs 時, 我們必須在每次迭代後進行歸約。
  // 爲了更好的性能, 我們將多個 layers 組合到 buckets 中。
  // 此參數設置要組合 layers 的 buckets 的大概數量。
  // 默認值（default = 6）可以適用於大多數網絡。
  optional int32 reduce_buckets = 18 [default = 6];

隨後會利用這個參數進一步得到參數 max_params_per_bucket （每個 bucket 中最多可存的參數的個數）和參數 bucket_space_count （每個 bucket 所佔的空間大小）來設置當 reduction_queue_ 累計了多少待處理參數時調用一次權重更新函數。
ReduceAndUpdate() 線程輪詢 reduction_queue_ 中的元素，並記錄所到達的元素所佔空間大小（參數 received_count）。發現隊列中有待處理參數信息，且滿足一定要求（比如 received_count >= bucket_space_count，詳細條件見代碼中 if 語句）時便調用歸約函數 ReduceBucket() ，並調用實例化的 solver 中的 ApplyUpdate() 函數（例如 sgd_solver.cpp 中的實現）進行參數更新。

Net::ReduceAndUpdate() 函數

void Net::ReduceAndUpdate() {
#ifndef CPU_ONLY
  cudaStream_t stream;
  CUBLAS_CHECK(cublasGetStream(handle, &stream));
  int max_params_per_bucket = 0;
  size_t bucket_space_count = 0UL;
  if (Caffe::solver_count() > 1) {
    CHECK_GT(reduce_buckets_, 0);
    max_params_per_bucket = (int) (learnable_params_.size() + 1UL) / (int) reduce_buckets_; // 每個 bucket 中最多可存的參數的個數，例如，AlexNet 參數個數爲 16 個，reduce_buckets = 6，則 max_params_per_bucket = 2
    if (max_params_per_bucket < 1) {
      max_params_per_bucket = 1;
    }
    bucket_space_count =
        size_t((float)(learnable_space_count_ + 1UL) /
            learnable_params_ptrs_.size() * max_params_per_bucket); // 每個 bucket 所佔的空間大小
  }
  int id_from = -1, id_to = -1;
  size_t received_count = 0U; // reduction_queue_ 隊列中待處理參數的所佔空間大小
  std::list<int> au_ids;
#endif
  const bool clear_grads = !solver_->param().snapshot_diff();
  while (true) {
    int param_id = reduction_queue_.pop(); // 將隊列 reduction_queue_ 中的元素取出
    SolverAction::Enum request = solver_->GetRequestedAction();
    if (SolverAction::STOP == request) {
#ifndef CPU_ONLY
      CUDA_CHECK(cudaStreamSynchronize(stream));
#endif
      solver_->request_early_exit();
      break;
    }
    if (param_id == END_OF_BATCH) {
#ifndef CPU_ONLY
      CUDA_CHECK(cudaStreamSynchronize(stream));
#endif
      break;
    }
    if (param_id != END_OF_ITERATION) {
      if (Caffe::solver_count() > 1) { // 當有多個 GPU 時
#ifndef CPU_ONLY
        if (max_params_per_bucket == 1) { // 每個 bucket 中最多隻有一個參數時
          Reduce(param_id); // 調用 Reduce() 歸約函數，這裏不是很懂，因爲調用了這裏之後，仍然會調用下邊的 ReduceBucket() 函數。。。不知道爲啥要加這一步呢？？？ 
        }
#else
        NO_GPU;
#endif
      } else { // 當 Caffe::solver_count() <= 1, 即只使用 CPU 或只有一個 GPU 時，直接調用 ApplyUpdate() 權值更新函數
        if (global_grad_scale_ != 1.F) {
          this->learnable_params()[param_id]->scale_diff(1.F/global_grad_scale_, handle, true);
        }
        solver_->ApplyUpdate(param_id, handle, clear_grads);
        continue;
      }
    }
#ifndef CPU_ONLY
    //
    if (learnable_params_.size() > 0 && Caffe::solver_count() > 1) {
      // Is bucket big enough? Done with iteration? Next param_id doesn't fit?
      // Type changed?
      // 歸約及權重更新判斷條件：bucket 夠大嗎？是否完成迭代？
      // 下一個 param_id 不適合？類型是否已更改？
      if (received_count >= bucket_space_count ||
          (param_id == END_OF_ITERATION && id_from != -1) || // leftovers
          (id_from != -1 && param_id < id_from - 1) ||
          (id_to != -1 && param_id > id_to + 1) ||
          (id_from != -1 && learnable_params_[id_from]->diff_type()
                         != learnable_params_[param_id]->diff_type())) {
        Type dtype = learnable_params_[id_from]->diff_type();
        size_t count = 0U;
        for (int i = id_from; i <= id_to; ++i) {
          count += even(learnable_params_[i]->count());
        }
        ReduceBucket(count, dtype, learnable_params_ptrs_[id_from]); // 調用 ReduceBucket() Bucket歸約函數
        for (int i : au_ids) {
          if (global_grad_scale_ != 1.F) {
            this->learnable_params()[i]->scale_diff(1.F/ global_grad_scale_, handle, true);
          }
          solver_->ApplyUpdate(i, handle, clear_grads); // 調用 ApplyUpdate() 權值更新函數
        }
        au_ids.clear();
        // 歸約及權值更新後，若迭代沒有結束，則重新設置 id_from 和 id_to 以及當前 param_id 的 received_count。
        if (param_id != END_OF_ITERATION) {
          id_from = id_to = param_id;
          received_count = (size_t) even(learnable_params_[param_id]->count());
          au_ids.emplace_back(param_id);
        }
      } else if (param_id != END_OF_ITERATION) { // 不滿足權值更新條件，同時迭代沒有結束，則重新設置 id_from 和 id_to 並累計 received_count
        if (id_from == -1 || param_id < id_from) {
          id_from = param_id;
        }
        if (id_to == -1 || param_id > id_to) {
          id_to = param_id;
        }
        received_count += even(learnable_params_[param_id]->count());
        au_ids.emplace_back(param_id);
      }
    }
#endif
    // 迭代結束
    if (param_id == END_OF_ITERATION) {
#ifndef CPU_ONLY
      CUDA_CHECK(cudaStreamSynchronize(stream));
      received_count = 0U;
      id_from = id_to = -1;
      au_ids.clear();
#endif
      solver_->iteration_complete_signal();
    }
  }
  DLOG(INFO) << "[" << Caffe::current_device() << "] Leaving ReduceAndUpdate thread";
}

上述函數中涉及到兩個變量：learnable_space_count_ 和 learnable_params_ptrs_，這兩個變量是通過 Net::InitializeLearnableDiffSpace() 這個函數來設置的，先看這個函數。

void Net::InitializeLearnableDiffSpace() {
  learnable_space_count_ = 0;
  size_t workspace_size = 0UL;
  // vector<void*> learnable_params_ptrs_;
  // vector<shared_ptr<Blob>> learnable_params_
  learnable_params_ptrs_.resize(learnable_params_.size());
  for (int i = 0; i < learnable_params_.size(); ++i) {
    learnable_params_[i]->lock_diff();
    learnable_space_count_ += even(learnable_params_[i]->count()); // learnable_space_count_ 中存放的是參數的總個數
	
    workspace_size += even(learnable_params_[i]->count()) *
        tsize(learnable_params_[i]->diff_type());  // workspace_size 爲所有參數所佔空間
  }
  // Size have at least one byte, otherwise cudaMalloc fails if net has no
  // learnable parameters. Times two.
  if (workspace_size < 2) {
    workspace_size = 2;
  }
  // GPUMemory::Workspace learnable_space_;
  learnable_space_.reserve(workspace_size); // 爲 learnable_space_ 分配 workspace_size 的 GPU 內存   gpu_memory.hpp
  unsigned char* ptr = reinterpret_cast<unsigned char*>(learnable_space_.data());  // 返回 learnable_space_ 的指針
  caffe_gpu_memset(workspace_size, 0, ptr);  // 用 0 初始化空間
  for (int i = 0; i < learnable_params_.size(); ++i) {
    learnable_params_[i]->set_gpu_diff(static_cast<void*>(ptr));
    learnable_params_ptrs_[i] = ptr; // 每個參數的初始位置使用 learnable_params_ptrs_ 這個指針數組保存了起來
    ptr += even(learnable_params_[i]->count()) * tsize(learnable_params_[i]->diff_type()); // ptr 指針指向下一個參數
  }
}

gpu_memory.hpp Workspace 結構體的函數 reserve(), 進一步調用 try_reserve() 函數。

    void reserve(size_t size, int device = current_device()) {
      if (!try_reserve(size, device))  
      {
        LOG(FATAL) << "Out of memory: failed to allocate " << size
            << " bytes on device " << device;
      }
    }

gpu_memory.cpp

bool GPUMemory::Workspace::try_reserve(size_t size, int device) {
  bool status = true;
  if (size > size_ || ptr_ == nullptr) {
    release();
    if (device != INVALID_DEVICE) {
      device_ = device;  // switch from default to specific one
    }
    status = mgr_.try_allocate(&ptr_, size, device_); // 調用 try_allocate() 函數分配內存在制定的設備上
    if (status) {
      CHECK_NOTNULL(ptr_);
      size_ = size;
    }
  }
  return status;
}

最終會調用 gpu_memory.cpp 文件中的 try_allocate 函數進行分配內存操作，這部分代碼不在此進行說明。

bool GPUMemory::Manager::try_allocate(void** ptr, size_t size, int device, int group)

回到歸約函數 Reduce() 和 ReduceBucket()

#ifndef CPU_ONLY
void Net::Reduce(int param_id) {
  solver_->callback()->reduce_barrier();
  {
    unique_ptr<unique_lock<shared_mutex>> lock;
    if (solver_->is_root()) {
      lock.reset(new unique_lock<shared_mutex>(GPUMemory::read_write_mutex()));
    }
    solver_->callback()->reduce_barrier();
    solver_->callback()->allreduce(param_id); //-------->solver.hpp virtual void allreduce(int param_id) = 0
                    // -------->parallel.cpp  P2PSync::allreduce(int param_id)
    solver_->callback()->reduce_barrier();
  }
  this->learnable_params()[param_id]->gpu_scale_diff(1.F / Caffe::solver_count(),
      solver_->callback()->cublas_handle(), true);
  // Also need to barrier to make sure lock isn't undone
  // until all have completed, but the current nature of
  // NCCL makes this unnecessary.
  // solver_->callback()->reduce_barrier();
}
void Net::ReduceBucket(size_t count, Type bucket_type, void* bucket) {
  solver_->callback()->reduce_barrier();
  {
    unique_ptr<unique_lock<shared_mutex>> lock;
    if (solver_->is_root()) {
      lock.reset(new unique_lock<shared_mutex>(GPUMemory::read_write_mutex()));
    }
    solver_->callback()->reduce_barrier();
    solver_->callback()->allreduce_bucket(count, bucket, bucket_type);  //-------->solver.hpp virtual void allreduce_bucket(int count, void* bucket, Type type) = 0
// -------->parallel.cpp  P2PSync::allreduce_bucket(int count, void* bucket, Type type)
    solver_->callback()->reduce_barrier();
  }
  Tensor::gpu_scal(count, bucket_type, bucket, 1.F / Caffe::solver_count(),
      solver_->callback()->cublas_handle(), true);
}
#endif

parallel.cpp allreduce() 和 allreduce_bucket()

void P2PSync::allreduce(int param_id) {
#ifndef CPU_ONLY
#ifdef USE_NCCL
  const shared_ptr<Blob>& param = solver_->net()->learnable_params()[param_id];
  NCCL_CHECK(ncclAllReduce(param->current_diff_memory(true),
      param->current_mutable_diff_memory(true),
      even(param->count()),
      nccl::nccl_type(param->diff_type()),
      ncclSum,
      nccl_comm_,
      comm_stream_->get()));
  CUDA_CHECK(cudaStreamSynchronize(comm_stream_->get()));
#endif  // USE_NCCL
#endif  // CPU_ONLY
}
void P2PSync::allreduce_bucket(int count, void* bucket, Type type) {
#ifndef CPU_ONLY
#ifdef USE_NCCL
  NCCL_CHECK(ncclAllReduce(bucket, bucket, count, nccl::nccl_type(type),
                           ncclSum, nccl_comm_, comm_stream_->get()));
  CUDA_CHECK(cudaStreamSynchronize(comm_stream_->get()));
#endif  // USE_NCCL
#endif  // CPU_ONLY
}

ncclAllReduce() 的定義在 nccl.h 文件中，實現多個 GPU 間的全歸約通信。

/* Reduction opperation selector */
typedef enum { ncclSum        = 0,
               ncclProd       = 1,
               ncclMax        = 2,
               ncclMin        = 3,
               nccl_NUM_OPS   = 4 } ncclRedOp_t;
/* Reduces data arrays of length count in sendbuff using op operation, and leaves
 * identical copies of result on each GPUs recvbuff.
 * Sendbuff and recvbuff are assumed to reside on the same device.
 * Must be called separately for each communicator in communicator clique. */
ncclResult_t  ncclAllReduce(const void* sendbuff, void* recvbuff, int count,
    ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream);

權值更新函數，調用 SGDSolver 類的 ApplyUpdate() 函數

具體 SGD 的實現原理及公式這裏略過，只分析代碼。

sgd_solver.cpp

template<typename Dtype>
void SGDSolver<Dtype>::ApplyUpdate(int param_id, void* handle, bool clear_grads) {
  // 獲取該輪迭代的學習率(learning rate)
  float rate = GetLearningRate(); 
  / 在計算當前梯度的時候，如果該值超過了閾值 clip_gradients，則將梯度直接設置爲該閾值。  
  // clip_gradient 的引入是爲了處理 gradient explosion 的問題。
  // 當在一次迭代中權重的更新過於迅猛的話，很容易導致 loss divergence。
  // clip_gradient 的直觀作用就是讓權重的更新限制在一個合適的範圍。
  ClipGradients(handle);
  // 歸一化, iter_size 大於 1 時梯度值再除以 iter_size
  Normalize(param_id, handle);
  // 正則化
  Regularize(param_id, handle);
  // 計算更新值
  ComputeUpdateValue(param_id, handle, rate, clear_grads);
}

正則化

template<typename Dtype>
void SGDSolver<Dtype>::Regularize(int param_id, void* handle) {
  if (Caffe::mode() == Caffe::CPU) {
    // 獲取所有要優化的參數  
    const vector<shared_ptr<Blob>>& net_params = this->net_->learnable_params();
    // 獲取所有要優化的參數的權重衰減向量  
    const vector<float>& net_params_weight_decay = this->net_->params_weight_decay();
    // 獲取網絡模型整體的權重衰減
    float weight_decay = this->param_.weight_decay();
    // 獲取網絡的正則化類型，L1或者L2 
    string regularization_type = this->param_.regularization_type();
    // 每一個參數的權重衰減等於每個參數的權重衰減乘以網絡整體的權重衰減  
    float local_decay = weight_decay * net_params_weight_decay[param_id];
    if (local_decay) {
      if (regularization_type == "L2") {
        // add weight decay  
        // 執行正則化，L2的梯度 diff_= weight_decay * data_ + diff_  
        // caffe_axpy means ax_plus_y. i.e., Y = alpha*X + Y
        // template <typename Dtype> 
        // void caffe_axpy(const int N, const Dtype alpha, const Dtype* X, Dtype* Y);
        caffe_axpy<Dtype>(net_params[param_id]->count(), local_decay,
            net_params[param_id]->cpu_data<Dtype>(),
            net_params[param_id]->mutable_cpu_diff<Dtype>());
      } else if (regularization_type == "L1") {
        caffe_cpu_sign<Dtype>(net_params[param_id]->count(),
            net_params[param_id]->cpu_data<Dtype>(), temp_[param_id]->mutable_cpu_data());
        caffe_axpy<Dtype>(net_params[param_id]->count(), local_decay, temp_[param_id]->cpu_data(),
            net_params[param_id]->mutable_cpu_diff<Dtype>());
      } else {
        LOG(FATAL) << "Unknown regularization type: " << regularization_type;
      }
    }
  } else if (Caffe::mode() == Caffe::GPU) {
#ifndef CPU_ONLY
    //Fused with ComputeUpdateValue
#else
    NO_GPU;
#endif
  } else {
    LOG(FATAL) << "Unknown caffe mode: " << Caffe::mode();
  }
}

計算更新值

template<typename Dtype>
void
SGDSolver<Dtype>::ComputeUpdateValue(int param_id, void* handle, float rate, bool clear_grads) {
  shared_ptr<Blob> param = this->net_->learnable_params()[param_id];
  // history_ 存儲了上一次的梯度
  shared_ptr<TBlob<Dtype>> history = history_[param_id];
  // 獲取所有參數對應的 learning_rate 的 vector
  const vector<float>& net_params_lr = this->net_->params_lr();
  // 獲取momentum值
  float momentum = GetMomentum();
  // 實際的 learning_rate 爲全局的 learning_rate 乘以每個參數對應的 lr_mult
  // local_rate = global_rate * lr_mult
  // lr_mult 爲該層學習率因子，在 train_test.prototxt 中設置
  float local_rate = rate * net_params_lr[param_id];
  // Compute the update to history, then copy it to the parameter diff.
  if (Caffe::mode() == Caffe::CPU) {
    // history_ = learning_rate*diff_ + momentum*history_
    caffe_cpu_axpby<Dtype>(param->count(), local_rate, param->cpu_diff<Dtype>(), momentum,
        history->mutable_cpu_data());
    // 把當前的梯度拷貝給參數 Blob 的 diff_
    caffe_copy<Dtype>(param->count(), history->cpu_data(), param->mutable_cpu_diff<Dtype>());
    param->Update(); // 參數更新
    if (clear_grads) {
      param->set_diff(0.F);
    }
  } else if (Caffe::mode() == Caffe::GPU) {
#ifndef CPU_ONLY
    const std::string& regularization_type = this->param_.regularization_type();
    const float decay = local_decay(param_id);
    const Type gtype = param->diff_type();
    // 調用 sgd_reg_update_all_and_clear_gpu() 函數
    if (gtype == tp<float16>()) {
      sgd_reg_update_all_and_clear_gpu<float16, Dtype>(param->count(),
          param->mutable_gpu_diff<float16>(),
          param->mutable_gpu_data<Dtype>(),
          history->mutable_gpu_data(),
          momentum, local_rate, regularization_type, decay,  handle, clear_grads);
    } else if (gtype == tp<float>()) {
      sgd_reg_update_all_and_clear_gpu<float, Dtype>(param->count(),
          param->mutable_gpu_diff<float>(),
          param->mutable_gpu_data<Dtype>(),
          history->mutable_gpu_data(),
          momentum, local_rate, regularization_type, decay,  handle, clear_grads);
    } else if (gtype == tp<double>()) {
      sgd_reg_update_all_and_clear_gpu<double, Dtype>(param->count(),
          param->mutable_gpu_diff<double>(),
          param->mutable_gpu_data<Dtype>(),
          history->mutable_gpu_data(),
          momentum, local_rate, regularization_type, decay,  handle, clear_grads);
    } else {
      LOG(FATAL) << "Gradient type " << Type_Name(gtype) << " is not supported";
    }
#else
    NO_GPU;
#endif
  } else {
    LOG(FATAL) << "Unknown caffe mode: " << Caffe::mode();
  }
}

sgd_reg_update_all_and_clear_gpu() 函數的定義：

#ifndef CPU_ONLY
template<typename Gtype, typename Wtype>
void sgd_reg_update_all_and_clear_gpu(int N,
    Gtype* g, Wtype* w, Wtype* h,
    float momentum, float local_rate, const std::string& regularization_type, float local_decay,
    void* handle, bool clear_grads);
#endif

sgd_reg_update_all_and_clear_gpu 函數的具體實現是在 sgd_solver.cu 文件中

template<typename Gtype, typename Wtype>
void sgd_reg_update_all_and_clear_gpu(int N,
  Gtype* g, Wtype* w, Wtype* h,
  float momentum, float local_rate, const std::string& reg_type, float local_decay,
  void* handle,  bool clear_grads) {
  cublasHandle_t cublas_handle =
      handle == nullptr ? Caffe::cublas_handle() : reinterpret_cast<cublasHandle_t>(handle);
  cudaStream_t stream;
  CUBLAS_CHECK(cublasGetStream(cublas_handle, &stream));
  // NOLINT_NEXT_LINE(whitespace/operators)
  SGDRegUpdateAllAndClear<<<CAFFE_GET_BLOCKS(N), CAFFE_CUDA_NUM_THREADS, 0, stream>>> (N,
    g, w, h,
    momentum, local_rate, local_decay, reg_type == "L2",  clear_grads);
  CUDA_POST_KERNEL_CHECK;
  CUDA_CHECK(cudaStreamSynchronize(stream));
}

NVCaffe 0.16.2 多 GPU 訓練過程代碼分析

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

caffe0.16 resnet batch_size=64時出現超出gpu_memory問題

NVCaffe 0.16.2 多 GPU 訓練過程代碼分析

電腦耳機只能聽到伴奏

CNN幾種經典模型比較

Git 簡單使用學習

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結