更新說明:
作者的源碼: https://github.com/xingyizhou/CenterNet是基於pytorch0.4.1的(CUDA最高版本只能使用到CUDA9),如果想使用pytorch1.0以上版本以支持使用CUDA10.0或以上版本,可以在下載作者的源碼後,其他步驟照做,只是把DCNv2的源碼用https://github.com/CharlesShang/DCNv2這裏的源碼替換掉就可以了,已在Pytorch1.2+RTX2080TI+CUDA10.0環境裏訓練了多個模型沒任何問題:
cd CenterNet/src/lib/models/networks
rm -r DCNv2
git clone https://github.com/CharlesShang/DCNv2.git
cd DCNv2
python setup.py build develop
[原文]
CenterNet是anchor-free類型網絡,具有識別精度高且速度快的特點,根據作者的論文中列出的數據來看,指標綜合考慮來看比較牛了:
最後那個CenterNet-HG,也就是backbone使用的Hourglass-104網絡的AP值只比FSAF低一點了(但是FSAF目前貌似還沒有源碼放出來),比YOLO序列和RCNN序列都強很多,雖然FPS自有7.8,但是對一般實時性要求不是很高的視頻檢測也夠用了,所以拿來試試。
首先下載作者的源碼: git clone https://github.com/xingyizhou/CenterNet.git,根據安裝說明:https://github.com/xingyizhou/CenterNet/blob/master/readme/INSTALL.md,環境和工具軟件是:
Ubuntu 16.04, with Anaconda Python 3.6 and PyTorch v0.4.1
他這個源碼是使用的pytorch0.41版寫的,由於pytorch0.41支持的CUDA最高版本是CUDA9,不支持我們的服務器上目前安裝的CUDA10.0或CUDA10.1,我先是試了一下使用conda創建隔離環境後安裝支持CUDA10的pytorch1.3或pytorch1.0.0,然後跑了一下,結果報錯,說是CenterNet中有API是不支持的了(後面再說),但在公共服務器上又不好隨便亂裝CUDA(安裝過CUDA的應該知道它的厲害,很能折騰人,裝得不對服務器登錄進不去、黑屏之類的問題讓人三思),於是想到還是使用docker最好,首先到hub.docker.com上拉取個pytorch0.4.1+CUDA9.0的devel版鏡像:
docker pull pytorch/pytorch:0.4.1-cuda9-cudnn7-devel
然後運行創建實例(進入容器內部後默認的初始路徑是/workspace,所以把下載了CenterNet源碼的目錄work_pytorch映射到/workspace,並預留端口12000的映射,以備後面有需要時對模型做server端封裝調用,並帶上ipc=host參數,以防止做多GPU分佈式訓練的過程中出現共享內存不足的錯誤):
nvidia-docker run --ipc=host -d -it --name pytorch0.41 -v /home/fychen/AI/work_pytorch:/workspace -p 12000:12000 pytorch/pytorch:0.4.1-cuda9-cudnn7-devel bash
進入容器後,執行下面的修改(容器內的pytorch安裝在/opt/conda路徑下)把torch.nn.functional.py裏1254行的torch.backends.cuddn.enabled改爲False:
sed -i "1254s/torch\.backends\.cudnn\.enabled/False/g" /opt/conda/lib/python3.6/site-packages/torch/nn/functional.py
然後,依次執行下面的命令安裝pycocotools:
git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI
pip install cython
make
python setup.py install --user
再依次執行下面的命令完成CenterNet下面的部分代碼的編譯:
cd /workspace/CenterNet
pip install -r requirements.txt
cd src/lib/models/networks/DCNv2
./make.sh
cd /workspace/CenterNet/src/lib/external
make
再安裝一些跑CenterNet需要的支持包(不安裝這些包會報錯):
apt-get install libglib2.0-dev libsm6 libxrender1 libxext6
然後下載對應的預訓練模型,我要使用的backbone是hour-glass,模型訓練後用來做物體檢測,根據https://github.com/xingyizhou/CenterNet/blob/master/readme/MODEL_ZOO.md :
下載第一行的ctdet_coco_hg模型即可,點擊右邊的model鏈接下載模型文件ctdet_coco_hg.pth,這裏是從driver.google.com上下載文件,所以需要科學上網,所幸之前用同事的工具提前下載了這個文件,已上傳到了這裏(ctdet_coco.hg.path文件受上傳限制分成四部分:part1,part2,part3,part4),不然近期好像很多工具都用不了了。把下載到的預訓練模型文件ctdet_coco_hg.pth放到CenterNet/models/下面即可,由於coco_hg要求使用coco格式的數據集,所以如果你的是PASCAL VOC2007格式的,還得寫腳本轉成成coco2017格式(作者的那個github頁面上沒說要求coco數據集是coco2014還是coco2017格式的,不過看了coco.py文件裏的代碼後知道是coco2017格式的),多說一點,網上好像沒有找到文章說coco2014格式和coco2017格式的差別的,這裏根據我做數據集格式轉換的經歷(把pascal voc2007格式轉成coco2014和coco2017格式都弄過)以舉例形式列一下:
#COCO2014格式數據集(object detection部分):
coco/annotations/instances_train2014.json,instances_minival2014.json
coco/images/train2014/COCO_train2014_000000000001.jpg
coco/images/val2014/COCO_val2014_000000000001.jpg
#COCO2017格式數據集(object detection部分),圖片部分,coco目錄下是否增加image子目錄存放train2017和val2017,根據網絡模型的代碼定,比如CenterNet就不需要,換個別的網絡,比如EfficientDet就需要,可以選擇保持coco數據集的路徑格式不變,不符合這格式的網絡模型修改它的coco dataset的實現代碼裏的讀取數據的代碼即可:
CenterNet需要的coco數據集的路徑是這樣:
coco/annotations/instances_train2017.json, instance_val2017.json
coco/train2017/000000000001.jpg
coco/val2017/000000010001.jpg
EfficientDet需要的coco數據集的路徑是這樣:
COCO
├── annotations
│ ├── instances_train2017.json
│ └── instances_val2017.json
│── images
├── train2017
└── val2017
可以看出coco2014對圖片文件的命名也要求加上前綴,這點很蠢,coco2017把這點不合理的去掉了。
準備好自己的coco2017數據集後,拷貝或者鏈接到CenterNet/data/下,路徑如下:
CenterNet/data/coco/annotations/...
CenterNet/data/coco/train2017/...
CenterNet/data/coco/val2017/...
預訓練模型和coco數據集都準備好了,接下來就是根據自己數據集的情況合理修改CenterNet的代碼:
1)修改CenterNet/src/lib/datasets/dataset/coco.py裏這些地方:
num_classes = 1 #把針對coco數據集的80改爲1,我的數據集只有一類要識別的物體
...
self.annot_path = os.path.join(
self.data_dir, 'annotations',
#'image_info_test-dev2017.json').format(split)
'instances_val2017.json') #image_info_test-dev2017.json是coco2017數據集的文件,這裏需要改成使用我自己的文件
class_name=['__background__','fire'] #把coco數據集的80類註釋掉或刪掉,加上我自己的類
2)修改CenterNet/src/main.py:
torch.backends.cudnn.benchmark = False #不然啓動時會報錯,錯誤詳情參見後面
...
if opt.val_intervals > 0 and epoch % opt.val_intervals == 0:
...
if log_dict_val[opt.metric] < best:
best = log_dict_val[opt.metric]
save_model(os.path.join(opt.save_dir, 'model_best.pth'),epoch, model)
save_model(os.path.join(opt.save_dir, 'lite_model_last.pth'),epoch, model)
best初值爲1e10,if log_dict_val[opt.metric] < best這個條件除了第一次迭代到epoch爲opt.val_intervals(默認值5)時能滿足外(best值隨即被更新爲log_dict_val[opt.metric] 的值),後面迭代訓練過程中, log_dict_val[opt.metric] < best這個條件很難再滿足,因此model_best.pth很難再次生成,我這個數據集圖片幾千張並不算大,但訓練一個epoch也需要15分鐘左右,有時急於需要獲得一個模型做測試,所以最好增加上面那句話以便每迭代opt.val_intervals輪後,生成一個不含optimizer的輕量一點的模型文件,大約只有700多M,而每輪迭代生成的model_last.pth是含有optimizer的:
save_model(os.path.join(opt.save_dir, 'model_last.pth'), epoch, model, optimizer)
3)修改 lib/opts.py(test和demo使用,把class的數量由80改爲1):
'ctdet': {'default_resolution': [512, 512], 'num_classes': 1,
'mean': [0.408, 0.447, 0.470], 'std': [0.289, 0.274, 0.278],
'dataset': 'coco'}
接下來就可以執行下面的命令開始訓練了(使用了5個GPU, 2-6號):
nohup python -u main.py ctdet --exp_id coco_hg --arch hourglass --batch_size 24 --master_batch 4 --lr 2.5e-4 --load_model ../models/ctdet_coco_hg.pth --gpus 2,3,4,5,6 &
如果不使用預訓練模型,從頭開始訓練則執行:
nohup python -u main.py ctdet --exp_id coco_hg --arch hourglass --batch_size 24 --master_batch 4 --lr 2.5e-4 --gpus 2,3,4,5,6 &
如果訓練中途意外中斷了,再次啓動訓練時,可以使用(--resume讓腳本自動加載上次保存的model_last.pth模型文件,以便訓練從上次最後的狀態處恢復):
nohup python -u main.py ctdet --resume --exp_id coco_hg --arch hourglass --batch_size 24 --master_batch 4 --lr 2.5e-4 --gpus 2,3,4,5,6 &
如果想改變默認的訓練迭代總輪數(140),比如說改成100,可以在命令行加參數 --num_epochs=100,也可以修改 CenterNet/src/lib/opts.py
目前總共花了三天多訓練了265個epoch了,train_loss仍在穩定下降,沒有出現大幅度波動回彈情況,看來精度仍在提升中。
訓練到epoch=300時,各種Loss值都比較穩定了,停止,進行驗證測試,執行:
python -u test.py ctdet --exp_id coco_hg --arch hourglass --keep_res --resume
即可驗證測試 data/coco/val2017/下的圖片並最終輸出AP值。
把少量圖片拷貝到data/coco/test/下,執行:
python -u demo.py ctdet --arch hourglass --demo ../data/coco/test/ --load_model ../exp/ctdet/coco_hg/model_last.pth
即可根據識別後彈出的標註了識別結果圖片檢查模型的識別效果,可通過指定load_model參數測試指定的模型。
遇到的一些錯誤及解決辦法:
1. Traceback (most recent call last):
File "main.py", line 12, in <module>
from models.model import create_model, load_model, save_model
File "/home/fychen/AI/CenterNet/src/lib/models/model.py", line 12, in <module>
from .networks.pose_dla_dcn import get_pose_net as get_dla_dcn
File "/home/fychen/AI/CenterNet/src/lib/models/networks/pose_dla_dcn.py", line 16, in <module>
from .DCNv2.dcn_v2 import DCN
File "/home/fychen/AI/CenterNet/src/lib/models/networks/DCNv2/dcn_v2.py", line 11, in <module>
from .dcn_v2_func import DCNv2Function
File "/home/fychen/AI/CenterNet/src/lib/models/networks/DCNv2/dcn_v2_func.py", line 9, in <module>
from ._ext import dcn_v2 as _backend
File "/home/xfychenrt/AI/CenterNet/src/lib/models/networks/DCNv2/_ext/dcn_v2/__init__.py", line 3, in <module>
from ._dcn_v2 import lib as _lib, ffi as _ffi
ImportError: /home/fychen/AI/CenterNet/src/lib/models/networks/DCNv2/_ext/dcn_v2/_dcn_v2.so: undefined symbol: __cudaPopCallConfiguration
CUDA版本不對(CUDA10),必須要安裝對於的版本的CUDA9
2. Traceback (most recent call last):
File "main.py", line 12, in <module>
from models.model import create_model, load_model, save_model
File "/home/fychen/AI/CenterNet/src/lib/models/model.py", line 12, in <module>
from .networks.pose_dla_dcn import get_pose_net as get_dla_dcn
File "/home/fychen/AI/CenterNet/src/lib/models/networks/pose_dla_dcn.py", line 16, in <module>
from .DCNv2.dcn_v2 import DCN
File "/home/fychen/AI/CenterNet/src/lib/models/networks/DCNv2/dcn_v2.py", line 11, in <module>
from .dcn_v2_func import DCNv2Function
File "/home/fychen/AI/CenterNet/src/lib/models/networks/DCNv2/dcn_v2_func.py", line 9, in <module>
from ._ext import dcn_v2 as _backend
File "/home/fychen/AI/CenterNet/src/lib/models/networks/DCNv2/_ext/dcn_v2/__init__.py", line 2, in <module>
from torch.utils.ffi import _wrap_function
File "/home/fychen/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/utils/ffi/__init__.py", line 1, in <module>
raise ImportError("torch.utils.ffi is deprecated. Please use cpp extensions instead.")
ImportError: torch.utils.ffi is deprecated. Please use cpp extensions instead.
CenterNet代碼是基於pytorch 0.4.1的API,安裝的是pytorch1.0.0或者1.3,pytorch退回0.4.1即可
3. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)
Process Process-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 110, in _worker_loop
data_queue.put((idx, samples))
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 341, in put
obj = _ForkingPickler.dumps(obj)
File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 190, in reduce_storage
fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_1552_747694723>
Traceback (most recent call last):
File "main.py", line 102, in <module>
main(opt)
File "main.py", line 70, in main
log_dict_train, _ = trainer.train(epoch, train_loader)
File "/workspace/CenterNet/src/lib/trains/base_trainer.py", line 119, in train
return self.run_epoch('train', epoch, data_loader)
File "/workspace/CenterNet/src/lib/trains/base_trainer.py", line 61, in run_epoch
for iter_id, batch in enumerate(data_loader):
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 330, in __next__
idx, batch = self._get_batch()
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 309, in _get_batch
return self.data_queue.get()
File "/opt/conda/lib/python3.6/queue.py", line 164, in get
self.not_empty.wait()
File "/opt/conda/lib/python3.6/threading.py", line 295, in wait
waiter.acquire()
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 227, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1552) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.
共享內存不足,commit容器爲一個新的鏡像以保存當前的所有已做的更改,然後退出當前容器,然後用新的鏡像運行創建新的容器,啓動命令中注意加上 --ipc=host以讓容器使用物理機的內存
4. Traceback (most recent call last):
File "main.py", line 102, in <module>
main(opt)
File "main.py", line 70, in main
log_dict_train, _ = trainer.train(epoch, train_loader)
File "/workspace/CenterNet/src/lib/trains/base_trainer.py", line 119, in train
return self.run_epoch('train', epoch, data_loader)
File "/workspace/CenterNet/src/lib/trains/base_trainer.py", line 69, in run_epoch
output, loss, loss_stats = model_with_loss(batch)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
Arnold img_path=== /workspace/CenterNet/src/lib/../../data/coco/train2017/00001753.jpg
result = self.forward(*input, **kwargs)
File "/workspace/CenterNet/src/lib/trains/base_trainer.py", line 19, in forward
outputs = self.model(batch['input'])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/CenterNet/src/lib/models/networks/large_hourglass.py", line 255, in forward
inter = self.pre(image)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/CenterNet/src/lib/models/networks/large_hourglass.py", line 27, in forward
conv = self.conv(x)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /opt/conda/conda-bld/pytorch_1532579805626/work/aten/src/THC/THCGeneral.cpp:663
修改main.py:
torch.backends.cuddn.benchmark=False
5.運行demo.py時出現警告:
Creating model...
loaded ../exp/ctdet/coco_hg/model_last.pth, epoch 300
Skip loading parameter hm.0.1.weight, required shapetorch.Size([80, 256, 1, 1]), loaded shapetorch.Size([1, 256, 1, 1]).
If you see this, your model does not fully load the pre-trained weight. Please make sure you have correctly specified --arch xxx or set the correct --num_classes for your own dataset.
修改 lib/opts.py裏 default_dataset_info裏的num_class的值爲你的實際class數量即可(比如我的是1)
如果需要使用debugger,在命令行將--debug的值設置(或者在代碼裏設置opt.debug)爲不小於1的值,並且記得修改 src/lib/utils/debugger.py裏的coco_class_name裏的類別值改爲自己的數據集的類別值。