废话不多说,直接开始吧!
官网tensorflow版代码链接
pascal voc2012训练的官网指南链接
参考博客链接
(一)建docker
- 建立本地目录:
mkdir deeplabv3+
- 下载代码:
git clone https://github.com/tensorflow/models
- 新建容器
sudo nvidia-docker run -it -v /home/mass/tzr/deeplabv3+/models-master/:/home registry.docker-cn.com/ufoym/deepo:all-py36-jupyter /bin/bash
- 修改docker的名字
sudo docker rename old_name new_name
- 开启并进入docker
sudo docker start deeplabv3_plus
sudo docker attach deeplabv3_plus
(二)数据集准备
- 进入到deeplab/datasets文件夹,跑脚本下载voc2012数据集并转换到TFRecord:
sh download_and_convert_voc2012.sh
-
deeplab/datasets/里头的数据集结构如图所示,其中exp需要下的文件夹需要自己创建
(三)Train -
下载预训练模型,放在deeplab/backbone/文件夹下。这边我下的是这个
wget http://download.tensorflow.org/models/deeplabv3_pascal_train_aug_2018_01_04.tar.gz
- 训练
python deeplab/train.py --logtostderr --training_number_of_steps=30000 --train_split="train" --model_variant="xception_65" --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --train_crop_size="513,513" --train_batch_size=1 --dataset="pascal_voc_seg" --tf_initial_checkpoint='/home/research/deeplab/backbone/deeplabv3_pascal_train_aug/model.ckpt' --train_logdir='/home/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train' --dataset_dir='/home/research/deeplab/datasets/pascal_voc_seg/tfrecord'
其实就是按照机器的性能来改参数,简要说明几个比较重要的参数:
- train_number_of_steps:训练迭代次数
- train_batch_size:batchsize大小 tf_initial_checkpoint:权重文件的路径
'/home/research/deeplab/backbone/deeplabv3_pascal_train_aug/model.ckpt'
- train_logdir:log文件的路径
'/home/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train'
- dataset_dir:tfrecord数据集放置的路径
'/home/research/deeplab/datasets/pascal_voc_seg/tfrecord'
划重点来了!坑坑坑,它来了!不看后悔!
-
坑1
*train.py里头各种from deeplab. import ***
明明train.py就放在deeplab这个文件夹里,啊我死了!解决办法:train.py放在deeplab的上级目录research中
-
坑2
No moduel named "nets"
啊我又死了!!nets是在research/slim/nets里啊!!!
解决办法: from nets.mobilenet import mobilenet_v2改成from slim.nets.mobilenet import mobilenet_v2
- 坑3
TypeError: MonitoredTrainingSession() got an unexpected keyword argument 'summary_dir'
啊我又又死了!!!tf版本问题,我的是1.8.0,升到1.10以上就完事了
解决办法:pip install tensorflow-gpu==1.10
- 坑4
InvalidArgumentError (see above for traceback): Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[Node: image_pooling/BatchNorm/moving_variance_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](image_pooling/BatchNorm/moving_variance_1/tag, image_pooling/BatchNorm/moving_variance/read/_9643)]]**
啊我又又又死了!!!!batch_size设小了,有两种解决办法
解决办法1:training_number_of_steps改小,并把deeplab/dataset/exp/train里的所有文件删除,再重新run
解决办法2:设置train.py里头的–fine_tune_batch_norm=False,这样可以保持–training_number_of_steps=30000(亲测还是这个方法好用)
(四)Eval
评测结果
python deeplab/eval.py --logtostderr --eval_split="val" --model_variant="xception_65" --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --eval_crop_size="513,513" --dataset="pascal_voc_seg" --checkpoint_dir='/home/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train' --eval_logdir='/home/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/eval' --dataset_dir='/home/research/deeplab/datasets/pascal_voc_seg/tfrecord'
重要参数设置说明:
- checkpoint_dir:训练好的模型路径
'/home/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train'
- eval_logdir:评估好的结果存放路径
'/home/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/eval'
- dataset_dir:数据集tfrecord的路径
'/home/research/deeplab/datasets/pascal_voc_seg/tfrecord'
(五)Vis