【目標檢測系列】yolov3，yolov4訓練自己的數據（pytorch 版本）+ opencv調用訓練結果方法+openvino推理引擎加速

寫在前面：坑坑窪窪，總算弄出個初步效果。適用小白學習，因爲我就白啊！！！！！！！！！！！！

主要內容：部分理論。如何準備自己的數據集，訓練預測。在windows下vs2017調用訓練好的模型方法。

PS:yolov3與yolov4統一到一起了，不同的就是修改他們的cfg文件還有加載預訓練網絡。訓練過程發現v4的gpu是起起伏伏的，v3是穩定在一個值，v4比v3的batchsize設置小很多訓練慢。

一.代碼地址：

二.原理：主要說說網絡結構吧，因爲不看這個的話，與代碼對應不上啊。網絡輸出還是一張圖吧。紅色部分可以看成是網絡主幹部分，如darknet53，藍色部分是在13*13這個層上輸出結果，橙色部分是在26*26這個層下輸出的結果，綠色部分是在52*52這個層輸出的結果。這個13*13的由來呢。原圖416*416 縮放32倍就這樣了。這張圖與cfg文件中的好些參數一一對應的。對應關係看這個https://blog.csdn.net/gbz3300255/article/details/106255335。主要就是關注輸入與輸出，因爲要與代碼對應去

之所以有三個顏色的圈圈是爲了做多尺度檢測。三次檢測，每次對應的感受野不同，32倍降採樣的感受野最大，適合檢測大的目標，16倍適合一般大小的物體，8倍的感受野最小，適合檢測小目標。具體的anchor box的大小在cfg文件中設置。如下圖紅色表示anchor中心所在。

具體怎麼算損失函數的，怎麼做先驗框的就不說了。這裏只管輸入輸出。

上面代碼的輸入輸出如下

輸入：假設是一張416*416*3的圖像。（這個輸入尺寸程序裏默認是320*640，訓練自己數據時候注意一下）

輸出：【1，（13*13 +26*26 + 52*52）*3， 85】維的一個數據

（13*13 +26*26 + 52*52 *3）是啥呢是一共有多少個檢測中心，乘3是每個中心有3種先驗框。那麼（13*13 +26*26 + 52*52）*3就是一共有這麼多個檢測框結果存在。

85是啥呢。是上面13*13或26*26或512*512的特徵圖上一個點的特徵值的維度。這個維度怎麼來的呢，網絡檢測目標有80類，那麼點對應的檢測框有80個概率值，其對應每個類的可信度，每個點的檢測框有4個關於框位置的值（x,y,w,h）,還有1個此框的置信度。那麼這個點對應的框的特徵值就是（1 + 4 + 80 ） = 85維的.

羅裏吧嗦這麼多因爲cfg文件中我們做自己的數據訓練是要改這個值的。

三.訓練自己數據集的步驟：

1.準備數據集：先明白yolov3需要的數據集長啥樣。說白了它就是要一張圖對應一個標籤文件。一堆圖和對應的一堆文件組成了圖像數據集和標籤數據集。標籤數據集名字和圖像名一一對應，標籤數據集內容爲：類別標號目標框x中心，目標框y中心，目標框寬度值，目標框高度值。注意，前面的類別編號直接0 1 2 3 4....等就可以了，後面的值是除以寬或高後的浮點值。如下圖超級實用。

因爲爲了快速上手看效果。直接下載現成數據集好了。我用的是CCTSDB 數據集。用程序將其box讀出來寫成上圖中man007.txt文本內的形式。

代碼不方便貼了，實現功能說一下很簡單。讀取CCTSDB數據集，讀取每張圖片，並讀取對應的json文件，將類別以及框讀出來，類別按0 1 2 ..編號，框數據按上圖方法計算，將其寫成一列，形如：

0 0.669 0.5785714285714286 0.032 0.08285714285714285

1.1 準備文本文件： train.txt test.txt val.txt lables的文本文件

train.txt，記錄數據集下圖片名字，類似這樣，數據集圖片存放在/data/images/目錄下。

BloodImage_00091
BloodImage_00156
BloodImage_00389
BloodImage_00030
BloodImage_00124
BloodImage_00278
BloodImage_00261

test.txt，與面形式一樣，內容是需要測試的圖的文件名

BloodImage_00258
BloodImage_00320
BloodImage_00120

val.txt，與面形式一樣，內容是驗證集內圖文件名

BloodImage_00777
BloodImage_00951

lables類文本，images中每張圖像對應一個關於lables的文本，形式如下，名字類似這樣BloodImage_00091.txt。

0 0.669 0.5785714285714286 0.032 0.08285714285714285

lables文本統一放在上面代碼的/data/lables/中

1.2 準備rbc.data文件，文件名隨便取的，記得輸入參數時候按這個文件名輸入程序就好，內容如下，

第一個就是種類個數，下面的就是參與訓練的圖片，參與測試的圖片等的路徑，以及每個類的名字的文本路徑了。

classes=4
train=data/train.txt
valid=data/test.txt
names=data/rbc.names
backup=backup/
eval=coco

1.3 準備rbc.names文件，文件名隨便取的，記得輸入參數時候按這個文件名輸入程序就好，內容如下。

四類的類型，犯懶就直接寫作a，b，c，d了根據自己的類別去改吧

a
b
c
d

1.4 準備圖片數據，訓練圖放入images裏，測試圖放入samples裏。images中的圖與lables中的文本一一對應。

最終的存儲結構類似這樣，在data文件夾下。

2.修改cfg文件：確定用哪個模型再去修改哪個cfg文件吧，例如我用yolov3做訓練，那就去cfg文件夾下找到yolov3.cfg,修改它就行，我只修改了類別數以及filters的值，因爲filters與類別數有關。yolov3看網絡結構可知需要有3處修改。其他如anchor的大小等如果原來的框與待檢測目標差異較大，建議還是重新聚類計算一組anchors出來吧

classes = 4


#filters=3 * (5 + classes )
filters= 27  #3 * (5 + 4)

修改anchors，如果自己訓練集的內容與image上的不同，那必須修改啊。計算anchors代碼如下。引用大神代碼

# -*- coding: utf-8 -*-
import numpy as np
import random
import argparse
import os
#參數名稱
parser = argparse.ArgumentParser(description='使用該腳本生成YOLO-V3的anchor boxes\n')
parser.add_argument('--input_annotation_txt_dir',required=True,type=str,help='輸入存儲圖片的標註txt文件(注意不要有中文)')
parser.add_argument('--output_anchors_txt',required=True,type=str,help='輸出的存儲Anchor boxes的文本文件')
parser.add_argument('--input_num_anchors',required=True,default=6,type=int,help='輸入要計算的聚類（Anchor boxes的個數）')
parser.add_argument('--input_cfg_width',required=True,type=int,help="配置文件中width")
parser.add_argument('--input_cfg_height',required=True,type=int,help="配置文件中height")
args = parser.parse_args()
'''
centroids 聚類點 尺寸是 numx2,類型是ndarray
annotation_array 其中之一的標註框
'''
def IOU(annotation_array,centroids):
    #
    similarities = []
    #其中一個標註框
    w,h = annotation_array
    for centroid in centroids:
        c_w,c_h = centroid
        if c_w >=w and c_h >= h:#第1中情況
            similarity = w*h/(c_w*c_h)
        elif c_w >= w and c_h <= h:#第2中情況
            similarity = w*c_h/(w*h + (c_w - w)*c_h)
        elif c_w <= w and c_h >= h:#第3種情況
            similarity = c_w*h/(w*h +(c_h - h)*c_w)
        else:#第3種情況
            similarity = (c_w*c_h)/(w*h)
        similarities.append(similarity)
    #將列表轉換爲ndarray
    return np.array(similarities,np.float32) #返回的是一維數組，尺寸爲(num,)
 
'''
k_means:k均值聚類
annotations_array 所有的標註框的寬高，N個標註框，尺寸是Nx2,類型是ndarray
centroids 聚類點 尺寸是 numx2,類型是ndarray
'''
def k_means(annotations_array,centroids,eps=0.00005,iterations=200000):
    #
    N = annotations_array.shape[0]#C=2
    num = centroids.shape[0]
    #損失函數
    distance_sum_pre = -1
    assignments_pre = -1*np.ones(N,dtype=np.int64)
    #
    iteration = 0
    #循環處理
    while(True):
        #
        iteration += 1
        #
        distances = []
        #循環計算每一個標註框與所有的聚類點的距離（IOU）
        for i in range(N):
            distance = 1 - IOU(annotations_array[i],centroids)
            distances.append(distance)
        #列表轉換成ndarray
        distances_array = np.array(distances,np.float32)#該ndarray的尺寸爲 Nxnum
        #找出每一個標註框到當前聚類點最近的點
        assignments = np.argmin(distances_array,axis=1)#計算每一行的最小值的位置索引
        #計算距離的總和，相當於k均值聚類的損失函數
        distances_sum = np.sum(distances_array)
        #計算新的聚類點
        centroid_sums = np.zeros(centroids.shape,np.float32)
        for i in range(N):
            centroid_sums[assignments[i]] += annotations_array[i]#計算屬於每一聚類類別的和
        for j in range(num):
            centroids[j] = centroid_sums[j]/(np.sum(assignments==j))
        #前後兩次的距離變化
        diff = abs(distances_sum-distance_sum_pre)
        #打印結果
        print("iteration: {},distance: {}, diff: {}, avg_IOU: {}\n".format(iteration,distances_sum,diff,np.sum(1-distances_array)/(N*num)))
        #三種情況跳出while循環：1：循環20000次，2：eps計算平均的距離很小 3：以上的情況
        if (assignments==assignments_pre).all():
            print("按照前後兩次的得到的聚類結果是否相同結束循環\n")
            break
        if diff < eps:
            print("按照eps結束循環\n")
            break
        if iteration > iterations:
            print("按照迭代次數結束循環\n")
            break
        #記錄上一次迭代
        distance_sum_pre = distances_sum
        assignments_pre = assignments.copy()
if __name__=='__main__':
    #聚類點的個數，anchor boxes的個數
    num_clusters = args.input_num_anchors
    #索引出文件夾中的每一個標註文件的名字(.txt)
    names = os.listdir(args.input_annotation_txt_dir)
    #標註的框的寬和高
    annotations_w_h = []
    for name in names:
        txt_path = os.path.join(args.input_annotation_txt_dir,name)
        #讀取txt文件中的每一行
        f = open(txt_path,'r')
        for line in f.readlines():
            line = line.rstrip('\n')
            w,h = line.split(' ')[3:]#這時讀到的w,h是字符串類型
            #eval()函數用來將字符串轉換爲數值型
            annotations_w_h.append((eval(w),eval(h)))
        f.close()
        #將列表annotations_w_h轉換爲numpy中的array,尺寸是(N,2),N代表多少框
        annotations_array = np.array(annotations_w_h,dtype=np.float32)
    N = annotations_array.shape[0]
    #對於k-means聚類，隨機初始化聚類點
    random_indices = [random.randrange(N) for i in range(num_clusters)]#產生隨機數
    centroids = annotations_array[random_indices]
    #k-means聚類
    k_means(annotations_array,centroids,0.00005,200000)
    #對centroids按照寬排序，並寫入文件
    widths = centroids[:,0]
    sorted_indices = np.argsort(widths)
    anchors = centroids[sorted_indices]
    #將anchor寫入文件並保存
    f_anchors = open(args.output_anchors_txt,'w')
    #
    for anchor in  anchors:
        f_anchors.write('%d,%d'%(int(anchor[0]*args.input_cfg_width),int(anchor[1]*args.input_cfg_height)))
        f_anchors.write('\n')

執行語句如下：

python kmean.py --input_annotation_txt_dir data/labels --output_anchors_txt 123456.txt --input_num_anchors 9 --input_cfg_width 640 --input_cfg_height 320

結果文件如下：

將它寫入cfg中吧

3.修改代碼：

這裏感覺坑多些，例如它在train.py裏寫了個超參數列表，cfg裏的某些值配置就不生效了。

我想改batch大小需要在train.py這個文件裏改。。。。默認16...

parser.add_argument('--batch-size', type=int, default=16)  # effective bs = batch_size * accumulate = 16 * 4 = 64

其他還有很多，如是否是單目標檢測還是多目標檢測的設置。

parser.add_argument('--single-cls', action='store_false', help='train as single-class dataset')

優化方法選擇sgd還是adam

parser.add_argument('--adam', action='store_true', help='use adam optimizer')

學習率，如果用adam會發現，如果看訓練效果loss降的特別慢，看上去一直很大，就把這行打開，降低學習率看看吧

#hyp['lr0'] *= 0.1  # reduce lr (i.e. SGD=5E-3, Adam=5E-4)

關鍵問題：有些設置在上面代碼裏設置了還不生效。。。。

例如，它默認是單目標檢測我給改成多目標檢測了，訓練還是不對，後來發現參數值沒生效，最後強制設置其生效了。。。。

4.訓練：自己酌情去改輸入吧

python train.py --data data/rbc.data --cfg cfg/yolov3.cfg --epochs 2000

5.訓練中最常見錯誤：

爆顯存，修改方法就是減小batchsize的大小吧。

裏面加載了預訓練模型，自己下載一個路徑根據參數設置放好了就行，例如yolov3對應的yolov3.weights文件

parser.add_argument('--weights', type=str, default='weights/yolov3.weights', help='initial weights path')

6.預測：

python detect.py --cfg cfg/yolov3-tiny.cfg --weights weights/best.pt

預測碰到的坑：用兩臺電腦幹活，一臺訓練，一臺測試，結果報錯：

RuntimeError: Error(s) in loading state_dict for Darknet

原因：pytorch版本不同引進的錯誤

修改方法：在detect.py中將加載模型部分修改下，原理沒去深學，參考大神做法

將model.load_state_dict(torch.load(weights, map_location=device)['model'])
改爲：
model.load_state_dict(torch.load(weights, map_location=device)['model'], False)

7.用opencv調起來豈不是很爽。

1.具體看參考文獻5裏面介紹很詳細，我搬運一下，裏面有一處錯誤，直接用你會發現，嘿嘿，檢測結果又可信度還挺高，框的什麼玩意O(∩_∩)O，已修改。上面訓練的結果是best.pt，而下面vs2017工程是調用的.weights文件。轉換方法代碼裏就有，在modle.py下有個save_weights函數，可以直接用它轉換。我設定的轉換完了就是best.pt變成了converted.weights。剩下的幾個就是絕對路徑了，具體讀啥文件自己去看好了。

2.注意一點，下面這個對圖像放縮的方法與yolov3的方法不一致，這個應該改一下，我犯懶沒改。。。沒改的結果就是檢測不出目標來。我用的1280*720的圖，將其按512放縮，結果放縮圖恰好是512*288 符合32的倍數。記住一點，縮放的目的是將圖的長和寬縮放成32的倍數，且不能改原圖比例關係（形變是不允許的）。那麼一般都需要按行或列縮放，然後在其中一個方向做填充，填成32的倍數。隨手寫了段代碼，但是忽然發現，我用不到，放上吧有空填充齊了。

void YoloResize(Mat in, Mat &out)
{
	int w = in.cols;
	int h = in.rows;
	int target_w = 512;
	int target_h = 512;
	float ratio0 = (float)target_w / w;
	float ratio1 = (float)target_h / h;
	float scale = min(ratio0, ratio1);//轉換的最小比例

	//保證長或寬，至少一個符合目標圖像的尺寸
	int nw = int(w * scale);
	int nh = int(h * scale);
	//縮放圖像
	cv::resize(in, out, cv::Size(nw, nh), (0, 0),(0, 0),cv::INTER_CUBIC);
	//設置輸出圖像大小，湊足32的倍數。將縮放好的圖像放在輸出圖中間。
	if (ratio0 <= ratio1)//
	{
		//上下填充
		int addh = nh % 32;
		int newh = nh + addh;
	}
	else
	{
		//左右填充
	}
}

完整調用代碼在此

// This code is written at BigVision LLC.
//It is subject to the license terms in the LICENSE file found in this distribution and at http://opencv.org/license.html

#include <fstream>
#include <sstream>
#include <iostream>
#include <opencv2/dnn.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/highgui.hpp>


using namespace cv;
using namespace dnn;
using namespace std;

// Initialize the parameters
float confThreshold = 0.5; // Confidence threshold
float nmsThreshold = 0.4;  // Non-maximum suppression threshold
int inpWidth = 512;  // Width of network's input image
int inpHeight = 192; // Height of network's input image
vector<string> classes;

// Remove the bounding boxes with low confidence using non-maxima suppression
void postprocess(Mat& frame, const vector<Mat>& out);

// Draw the predicted bounding box
void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame);

// Get the names of the output layers
vector<String> getOutputsNames(const Net& net);

int main(int argc, char** argv)
{

//*
	string classesFile = "E:\\LL\\rbc.names";
	ifstream ifs(classesFile.c_str());
	string line;
	while (getline(ifs, line)) classes.push_back(line);

	// Give the configuration and weight files for the model
	String modelConfiguration = "E:\\LL\\yolov3_new.cfg";
	String modelWeights = "E:\\LL\\converted.weights";

	// Load the network
	Net net = readNetFromDarknet(modelConfiguration, modelWeights);
	net.setPreferableBackend(DNN_BACKEND_OPENCV);
	net.setPreferableTarget(DNN_TARGET_CPU);

	// Open a video file or an image file or a camera stream.
	string str, outputFile;
	//VideoCapture cap("E:\\SSS.mp4");
	VideoWriter video;
	Mat frame, blob;



	// Create a window
	static const string kWinName = "Deep learning object detection in OpenCV";
	namedWindow(kWinName, WINDOW_NORMAL);

	// Process frames.
	while (waitKey(1) != 27)
	{
		// get frame from the video
		//cap >> frame;

		frame = imread("E:\\LL\\1.jpg");

		// Stop the program if reached end of video
		if (frame.empty()) {
			//waitKey(3000);
			break;
		}
		// Create a 4D blob from a frame.
		cout << "inpWidth = " << inpWidth << endl;
		cout << "inpHeight = " << inpHeight << endl;
		blobFromImage(frame, blob, 1 / 255.0, cv::Size(inpWidth, inpHeight), Scalar(0, 0, 0), true, false);

		//Sets the input to the network
		net.setInput(blob);

		// Runs the forward pass to get output of the output layers
		vector<Mat> outs;
		net.forward(outs, getOutputsNames(net));

		// Remove the bounding boxes with low confidence
		postprocess(frame, outs);

		// Put efficiency information. The function getPerfProfile returns the overall time for inference(t) and the timings for each of the layers(in layersTimes)
		vector<double> layersTimes;
		double freq = getTickFrequency() / 1000;
		double t = net.getPerfProfile(layersTimes) / freq;
		string label = format("Inference time for a frame : %.2f ms", t);
		putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 0, 255));

		// Write the frame with the detection boxes
		Mat detectedFrame;
		frame.convertTo(detectedFrame, CV_8U);

		imshow(kWinName, frame);
		waitKey(100000);
	}

	//cap.release();

	
	//*/
	return 0;
}

// Remove the bounding boxes with low confidence using non-maxima suppression
void postprocess(Mat& frame, const vector<Mat>& outs)
{
	vector<int> classIds;
	vector<float> confidences;
	vector<Rect> boxes;

	for (size_t i = 0; i < outs.size(); ++i)
	{
		// Scan through all the bounding boxes output from the network and keep only the
		// ones with high confidence scores. Assign the box's class label as the class
		// with the highest score for the box.
		float* data = (float*)outs[i].data;
		for (int j = 0; j < outs[i].rows; ++j, data += outs[i].cols)
		{
			Mat scores = outs[i].row(j).colRange(5, outs[i].cols);
			Point classIdPoint;
			double confidence;
			// Get the value and location of the maximum score
			minMaxLoc(scores, 0, &confidence, 0, &classIdPoint);
			if (confidence > 0)
			{
				confidence = confidence;
			}
			if (confidence > confThreshold)
			{
				int centerX = (int)(data[0] * frame.cols);
				int centerY = (int)(data[1] * frame.rows);
				int width = (int)(data[2] * frame.rows);
				int height = (int)(data[3] * frame.cols);
				int left = centerX - width / 2;
				int top = centerY - height / 2;

				classIds.push_back(classIdPoint.x);
				confidences.push_back((float)confidence);
				boxes.push_back(Rect(left, top, width, height));
			}
		}
	}

	// Perform non maximum suppression to eliminate redundant overlapping boxes with
	// lower confidences
	vector<int> indices;
	NMSBoxes(boxes, confidences, confThreshold, nmsThreshold, indices);
	for (size_t i = 0; i < indices.size(); ++i)
	{
		int idx = indices[i];
		Rect box = boxes[idx];
		drawPred(classIds[idx], confidences[idx], box.x, box.y,
			box.x + box.width, box.y + box.height, frame);
	}
}

// Draw the predicted bounding box
void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame)
{
	//Draw a rectangle displaying the bounding box
	rectangle(frame, Point(left, top), Point(right, bottom), Scalar(255, 178, 50), 3);

	//Get the label for the class name and its confidence
	string label = format("%.2f", conf);
	if (!classes.empty())
	{
		CV_Assert(classId < (int)classes.size());
		label = classes[classId] + ":" + label;
	}

	//Display the label at the top of the bounding box
	int baseLine;
	Size labelSize = getTextSize(label, FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);
	top = max(top, labelSize.height);
	rectangle(frame, Point(left, top - round(1.5*labelSize.height)), Point(left + round(1.5*labelSize.width), top + baseLine), Scalar(255, 255, 255), FILLED);
	putText(frame, label, Point(left, top), FONT_HERSHEY_SIMPLEX, 0.75, Scalar(0, 0, 0), 1);
}

// Get the names of the output layers
vector<String> getOutputsNames(const Net& net)
{
	static vector<String> names;
	if (names.empty())
	{
		//Get the indices of the output layers, i.e. the layers with unconnected outputs
		vector<int> outLayers = net.getUnconnectedOutLayers();

		//get the names of all the layers in the network
		vector<String> layersNames = net.getLayerNames();

		// Get the names of the output layers in names
		names.resize(outLayers.size());
		for (size_t i = 0; i < outLayers.size(); ++i)
			names[i] = layersNames[outLayers[i] - 1];
	}
	return names;
}

上個結果圖看看