廢話不多說,直接開始吧!
官網tensorflow版代碼鏈接
pascal voc2012訓練的官網指南鏈接
參考博客鏈接
(一)建docker
- 建立本地目錄:
mkdir deeplabv3+
- 下載代碼:
git clone https://github.com/tensorflow/models
- 新建容器
sudo nvidia-docker run -it -v /home/mass/tzr/deeplabv3+/models-master/:/home registry.docker-cn.com/ufoym/deepo:all-py36-jupyter /bin/bash
- 修改docker的名字
sudo docker rename old_name new_name
- 開啓並進入docker
sudo docker start deeplabv3_plus
sudo docker attach deeplabv3_plus
(二)數據集準備
- 進入到deeplab/datasets文件夾,跑腳本下載voc2012數據集並轉換到TFRecord:
sh download_and_convert_voc2012.sh
-
deeplab/datasets/裏頭的數據集結構如圖所示,其中exp需要下的文件夾需要自己創建
(三)Train -
下載預訓練模型,放在deeplab/backbone/文件夾下。這邊我下的是這個
wget http://download.tensorflow.org/models/deeplabv3_pascal_train_aug_2018_01_04.tar.gz
- 訓練
python deeplab/train.py --logtostderr --training_number_of_steps=30000 --train_split="train" --model_variant="xception_65" --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --train_crop_size="513,513" --train_batch_size=1 --dataset="pascal_voc_seg" --tf_initial_checkpoint='/home/research/deeplab/backbone/deeplabv3_pascal_train_aug/model.ckpt' --train_logdir='/home/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train' --dataset_dir='/home/research/deeplab/datasets/pascal_voc_seg/tfrecord'
其實就是按照機器的性能來改參數,簡要說明幾個比較重要的參數:
- train_number_of_steps:訓練迭代次數
- train_batch_size:batchsize大小 tf_initial_checkpoint:權重文件的路徑
'/home/research/deeplab/backbone/deeplabv3_pascal_train_aug/model.ckpt'
- train_logdir:log文件的路徑
'/home/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train'
- dataset_dir:tfrecord數據集放置的路徑
'/home/research/deeplab/datasets/pascal_voc_seg/tfrecord'
劃重點來了!坑坑坑,它來了!不看後悔!
-
坑1
*train.py裏頭各種from deeplab. import ***
明明train.py就放在deeplab這個文件夾裏,啊我死了!解決辦法:train.py放在deeplab的上級目錄research中
-
坑2
No moduel named "nets"
啊我又死了!!nets是在research/slim/nets裏啊!!!
解決辦法: from nets.mobilenet import mobilenet_v2改成from slim.nets.mobilenet import mobilenet_v2
- 坑3
TypeError: MonitoredTrainingSession() got an unexpected keyword argument 'summary_dir'
啊我又又死了!!!tf版本問題,我的是1.8.0,升到1.10以上就完事了
解決辦法:pip install tensorflow-gpu==1.10
- 坑4
InvalidArgumentError (see above for traceback): Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[Node: image_pooling/BatchNorm/moving_variance_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](image_pooling/BatchNorm/moving_variance_1/tag, image_pooling/BatchNorm/moving_variance/read/_9643)]]**
啊我又又又死了!!!!batch_size設小了,有兩種解決辦法
解決辦法1:training_number_of_steps改小,並把deeplab/dataset/exp/train裏的所有文件刪除,再重新run
解決辦法2:設置train.py裏頭的–fine_tune_batch_norm=False,這樣可以保持–training_number_of_steps=30000(親測還是這個方法好用)
(四)Eval
評測結果
python deeplab/eval.py --logtostderr --eval_split="val" --model_variant="xception_65" --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --eval_crop_size="513,513" --dataset="pascal_voc_seg" --checkpoint_dir='/home/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train' --eval_logdir='/home/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/eval' --dataset_dir='/home/research/deeplab/datasets/pascal_voc_seg/tfrecord'
重要參數設置說明:
- checkpoint_dir:訓練好的模型路徑
'/home/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train'
- eval_logdir:評估好的結果存放路徑
'/home/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/eval'
- dataset_dir:數據集tfrecord的路徑
'/home/research/deeplab/datasets/pascal_voc_seg/tfrecord'
(五)Vis