條件隨機場(3)——學習和預測

看了兩天理論，終於輪到學習和預測上了。下載安裝了CRF++-0.58，準備程序分析來理解CRF的主要過程。
CRF++算法源程序是C++編寫的，主要的原生接口有三個：學習和預測用的crf_learn和crf_test，供其他語言調用模型的libcrfpp。官方文檔把learn過程看做是encoder，把test看作decoder。

1.學習的過程

執行crf_learn最簡潔的命令

crf_learn template_file train_file model_file

後面3個參數分別是特徵模板、要訓練的數據、模型存放文件。除了這三個必須的參數，還有如下可選參數，用於控制訓練過程。

對實際的程序，crf_learn.cpp調用了crfpp_learn函數。

int main(int argc, char **argv) {
  return crfpp_learn(argc, argv);
}

crfpp_learn在文件encoder.cpp裏，crfpp_learn上面所提的參數，然後將參數傳入Encoder::learn。

bool Encoder::learn(const char *templfile,
                    const char *trainfile,
                    const char *modelfile,
                    bool textmodelfile,
                    size_t maxitr,
                    size_t freq,
                    double eta,
                    double C,
                    unsigned short thread_num,
                    unsigned short shrinking_size,
                    int algorithm)

這個函數根據特徵模板抽取訓練數據的特徵

CHECK_FALSE(feature_index.open(templfile, trainfile))
      << feature_index.what();

然後根據傳入的參數（algorithm）選擇要執行的算法

switch (algorithm) {
    case MIRA:
      if (!runMIRA(x, &feature_index, &alpha[0],
                   maxitr, C, eta, shrinking_size, thread_num)) {
        WHAT_ERROR("MIRA execute error");
      }
      break;
    case CRF_L2:
      if (!runCRF(x, &feature_index, &alpha[0],
                  maxitr, C, eta, shrinking_size, thread_num, false)) {
        WHAT_ERROR("CRF_L2 execute error");
      }
      break;
    case CRF_L1:
      if (!runCRF(x, &feature_index, &alpha[0],
                  maxitr, C, eta, shrinking_size, thread_num, true)) {
        WHAT_ERROR("CRF_L1 execute error");
      }
      break;

最後存模型

  if (!feature_index.save(modelfile, textmodelfile)) {
    WHAT_ERROR(feature_index.what());
  }

提取特徵在feature_index的open中完成，open函數調用openTemplate和openTagSet兩個函數，前者讀模板文件生成模板，後者讀訓練文件，逐行讀數據，統計數據（但是沒有看到和特徵模板匹配）。

runCRF函數同在encoder.cpp中

/*****
 *x:訓練句子的列表
 *feature_index：特徵統計後的對象
 *alpha：特徵函數的代價
 *maxiter：可執行的最大迭代次數
 *C：跟cost相關的超參數，用於平衡過擬合和欠擬合。
 *eta:收斂閾值
 *shrinking_size：沒搞明白是啥
 *thread_num：線程數
 *orthant:選擇正則化方法，false爲L2，true爲L1
*****/
bool runCRF(const std::vector<TaggerImpl* > &x,
            EncoderFeatureIndex *feature_index,
            double *alpha,
            size_t maxitr,
            float C,
            double eta,
            unsigned short shrinking_size,
            unsigned short thread_num,
            bool orthant) {

runCRF()函數根據thread_num生成CRFEncoderThread線程，線程配置完成後，開啓線程執行CRFEncoderThread，並計算誤差diff

double diff = (itr == 0 ? 1.0 :
                   std::abs(old_obj - thread[0].obj)/old_obj);

並優化參數lbfgs.optimize優化參數。如果diff連續3次小於eta，或者迭代次數大於等於maxiter，停止訓練。

2.學習算法

上一節中提到主要的執行的線程爲CRFEncoderThread，而這個線程主要做一件事——計算梯度

 void run() {
    obj = 0.0;
    err = zeroone = 0;
    std::fill(expected.begin(), expected.end(), 0.0);
    for (size_t i = start_i; i < size; i += thread_num) {
      obj += x[i]->gradient(&expected[0]);
      int error_num = x[i]->eval();
      err += error_num;
      if (error_num) {
        ++zeroone;
      }
    }
  }

梯度計算涉及到的前向-後向算法、維特比算法等都在tagger.c中。TaggerImpl這個類包含了主要的計算和標註、預測工作，其中標註、預測相關方法作爲接口開給了其他語言，但是主要的計算並沒有對其他語言提供調用接口（java和python都是這樣的），扯遠了。開始真正的算法程序分析吧。
代碼是這樣的：

/************
 *expected：梯度向量
*/
double TaggerImpl::gradient(double *expected) {
  if (x_.empty()) return 0.0;

  buildLattice();  //構建網絡，建立結點和邊之間的聯繫
  forwardbackward();  //前向-後向算法
  double s = 0.0;

  for (size_t i = 0;   i < x_.size(); ++i) {
    for (size_t j = 0; j < ysize_; ++j) {
      node_[i][j]->calcExpectation(expected, Z_, ysize_);  //計算期望
    }
  }

  //以下爲梯度計算
  for (size_t i = 0;   i < x_.size(); ++i) {
    for (const int *f = node_[i][answer_[i]]->fvector; *f != -1; ++f) {
      --expected[*f + answer_[i]];
    }
    s += node_[i][answer_[i]]->cost;  // UNIGRAM cost
    const std::vector<Path *> &lpath = node_[i][answer_[i]]->lpath;
    for (const_Path_iterator it = lpath.begin(); it != lpath.end(); ++it) {
      if ((*it)->lnode->y == answer_[(*it)->lnode->x]) {
        for (const int *f = (*it)->fvector; *f != -1; ++f) {
          --expected[*f +(*it)->lnode->y * ysize_ +(*it)->rnode->y];
        }
        s += (*it)->cost;  // BIGRAM COST
        break;
      }
    }
  }

  viterbi();  // call for eval() 維特比算法

  return Z_ - s ;
}

主要分5部分：構建圖、前向-後向算法、期望計算、梯度計算、維特比算法。

2.1 構建圖

構建圖如《條件隨機場(2)——概率計算》中的：

需要將從start到stop之間各位置下Y的各種取值（node）通過邊（path）連接起來。個人感覺這種以圖爲表示方法會使後面表示各種情況（路徑）下的概率和期望更直觀。
node的數據結構如下：

struct Path {
  Node      *rnode; //右結點,i+1
  Node      *lnode; //左結點，i-1
  const int *fvector;  //對應的特徵向量
  double     cost;  //代價值

  Path() : rnode(0), lnode(0), fvector(0), cost(0.0) {}

  // for CRF
  void calcExpectation(double *expected, double, size_t) const;
  void add(Node *_lnode, Node *_rnode) ;

  void clear() {
    rnode = lnode = 0;
    fvector = 0;
    cost = 0.0;
  }
};

構建圖主要通過調用feature_index_->rebuildFeatures(this)構建每個位置所有結點和邊。然後計算每個結點的損失以及每個結點的左path集合的損失。（具體code在tagger.cpp下的void TaggerImpl::buildLattice()中）

2.2前向-後向算法
前後向算法很簡單，程序如下：

void TaggerImpl::forwardbackward() {
  if (x_.empty()) {
    return;
  }

  for (int i = 0; i < static_cast<int>(x_.size()); ++i) {
    for (size_t j = 0; j < ysize_; ++j) {
      node_[i][j]->calcAlpha();  //從0到n+1遞推計算每個node的alpha
    }
  }

  for (int i = static_cast<int>(x_.size() - 1); i >= 0;  --i) {
    for (size_t j = 0; j < ysize_; ++j) {
      node_[i][j]->calcBeta();  //從n到1遞推計算每個node的beta。
    }
  }

  Z_ = 0.0;
  for (size_t j = 0; j < ysize_; ++j) {
    Z_ = logsumexp(Z_, node_[0][j]->beta, j == 0);    //計算規範化因子Z。
  }

  return;
}

具體計算公式前面已經總結過了，不贅述了。

23期望計算
代碼如下：

/********
*expected 存儲梯度的向量，初始值就是期望，因此，這裏也是期望的存儲。
*Z：規範化因子
*size:y的取值數量。
*/
void Node::calcExpectation(double *expected, double Z, size_t size) const {
  const double c = std::exp(alpha + beta - cost - Z);   //計算每個節點的概率
  for (const int *f = fvector; *f != -1; ++f) {
    expected[*f + y] += c;   //按照理論，條件滿足，特徵值爲1，那麼p*1=p，所以c相加就是特徵加權和。
  }
  for (const_Path_iterator it = lpath.begin(); it != lpath.end(); ++it) {
    (*it)->calcExpectation(expected, Z, size);  //遞歸計算，算每條邊的概率和，作爲期望。
  }
}

計算公式之前也已經總結過了。

2.4梯度計算

理論是這樣的！
梯度計算是爲優化做準備，這裏總結一下梯度計算和優化算法。
《統計學習方法》中講到的CRF的學習方法有兩種：改進的迭代尺度法、擬牛頓法。
改進的迭代尺度法中，對數似然函數爲

是訓練集的經驗聯合概率。
改進的迭代尺度法是最大熵模型學習的最優化算法，CRF的概率模型和最大熵算法很相似。
假設模型當前的參數向量爲，模型當前的梯度向量爲，那麼當前次優化之後，參數向量變爲，優化前後，似然函數的差值如下：

利用不等式

建立對數似然函數改進變量的下界：

針對似然函數的下界做優化，當下界極大時，對數似然函數的值也更大。因此，取等式由端對於梯度的偏導，當偏導爲0時，對數似然函數的下界極大，得到

其中

當k=1,2,…,,將帶入方程，計算出轉移特徵梯度向量，當k= +l，l=1,2,3,… 時，將帶入方程，計算出狀態特徵梯度向量。

代碼是這樣的！！

 //以下爲梯度計算
  for (size_t i = 0;   i < x_.size(); ++i) {
    for (const int *f = node_[i][answer_[i]]->fvector; *f != -1; ++f) {  //answer應該是每個結點預測到的結果y_
      --expected[*f + answer_[i]];   //每個node的原始梯度爲整個網絡的期望。當特徵向量中，每維特徵不爲-1說明特徵匹配成功，梯度自減.
    }
    s += node_[i][answer_[i]]->cost;  // UNIGRAM cost
    const std::vector<Path *> &lpath = node_[i][answer_[i]]->lpath;
    for (const_Path_iterator it = lpath.begin(); it != lpath.end(); ++it) {  //沿着邊，將與真實y匹配成功的結點的特徵對應的邊再匹配一次，滿足條件的，梯度再自建。
      if ((*it)->lnode->y == answer_[(*it)->lnode->x]) {
        for (const int *f = (*it)->fvector; *f != -1; ++f) {
          --expected[*f +(*it)->lnode->y * ysize_ +(*it)->rnode->y];
        }
        s += (*it)->cost;  // BIGRAM COST
        break;
      }
    }
  }

是不是覺得程序和理論不一致呢？確實有點不一樣，可能是程序採用的是鬆弛特徵把。

2.5維特比算法

這塊還沒有細看，留着後面看吧。代碼如下

void TaggerImpl::viterbi() {
  for (size_t i = 0;   i < x_.size(); ++i) {
    for (size_t j = 0; j < ysize_; ++j) {
      double bestc = -1e37;
      Node *best = 0;
      const std::vector<Path *> &lpath = node_[i][j]->lpath;
      for (const_Path_iterator it = lpath.begin(); it != lpath.end(); ++it) {
        double cost = (*it)->lnode->bestCost +(*it)->cost +  
            node_[i][j]->cost;  //損失應該是負數。
        if (cost > bestc) {   //找該位置下，損失的絕對值最小的點和到此位置前面最優的路徑。
          bestc = cost;
          best  = (*it)->lnode;
        }
      }
      node_[i][j]->prev     = best;    //將最優的左結點作爲當前結點的左結點。
      node_[i][j]->bestCost = best ? bestc : node_[i][j]->cost;   //將最小損失賦給當前結點作爲該node的bestcode，以便後面的node回推最優cost和最優路徑用。
    }
  }

  double bestc = -1e37;
  Node *best = 0;
  size_t s = x_.size()-1;
  for (size_t j = 0; j < ysize_; ++j) {
    if (bestc < node_[s][j]->bestCost) {
      best  = node_[s][j];
      bestc = node_[s][j]->bestCost;
    }
  }

  for (Node *n = best; n; n = n->prev) {
    result_[n->x] = n->y;   //最優路徑存儲。
  }

  cost_ = -node_[x_.size()-1][result_[x_.size()-1]]->bestCost;  
}

2.5優化算法
程序裏面的優化是通過調用lbfgs.optimize()函數，該函數由調用了實際工作的lbfgs_optimize()完成的，lbfgs_optimize()用了LBFG優化算法，對這個算法完全不瞭解，暫時不胡說。

void LBFGS::lbfgs_optimize(int size,
                           int msize,
                           double *x,
                           double f,
                           const double *g,
                           double *diag,
                           double *w,
                           bool orthant,
                           double C,
                           double *v,
                           double *xi,
                           int *iflag) {

3.預測算法

維特比算法大致的意思是求出位置i各個取值概率最大的取值，同時記錄下非規範化概率最大的路徑，依次往後推，直到推導n，那麼最優路徑就計算出來了。（沒有系統的看過這個函數，這裏是根據程序正兒八經胡說的。）

據說預測主要是通過如下接口完成

bool TaggerImpl::parse() {
  CHECK_FALSE(feature_index_->buildFeatures(this))
      << feature_index_->what();   //構建特徵

  if (x_.empty()) {
    return true;
  }
  buildLattice();   //構建圖
  if (nbest_ || vlevel_ >= 1) {
    forwardbackward();   //前向-後向算法
  }
  viterbi();   //維特比計算最優路徑
  if (nbest_) {
    initNbest();
  }

  return true;
}

4.總結

終於大致瞭解使用CRF的過程了，接下來就是實戰了。

條件隨機場(3)——學習和預測

1.學習的過程

2.學習算法

2.1 構建圖

2.4梯度計算

2.5維特比算法

3.預測算法

4.總結

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

import openslide時報 WinErr127的解決過程

圖像風格遷移——《A Neural Algorithm of Artistic Style》

基於全卷積的圖像語義分割—《Fully Convolutional Networks for Semantic Segmentation》

大腦的功能框架

Boost和bagging算法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結