全系列
[win 10] maskrcnn-benchmark 上手(1)——配置環境與coco數據集介紹
[win 10] maskrcnn-benchmark 上手(2)——開始訓練
[win 10] maskrcnn-benchmark 上手(3)—— faster-rcnn 推理
博主win10
)
0. 配置環境
按照官網配置,遇到問題參考我之前的博客。這次我是在win10上完成全部的配置,demo都可以順利運行。主要的pytorch 版本雖然官網說一定要1.0.0,但是1.1.0實測後其實也可以。torchvision==0.3.0,不能是0.4.0,不然報錯,issue中有描述。
conda install pytorch==1.1.0 torchvision==0.3.0 cudatoolkit=10.0 -c pytorch
1. Coco數據集介紹
略過demo運行那,由於我是第一次使用coco,去coco官網需要自己下載數據集。這就遇到一個很煩的問題,不知道下載哪一個,找不到instance的數據集,只有stuff。
多虧了知乎文章,對coco api有了一定了解。因爲這是個公共大的數據集,所以有公共的api和規範,dataloader和平常的就不一樣。COCO的 全稱是Common Objects in COntext,是微軟團隊提供的一個可以用來進行圖像識別的數據集。MS COCO數據集中的圖像分爲訓練、驗證和測試集。COCO通過在Flickr上搜索80個對象類別和各種場景類型來收集圖像,其使用了亞馬遜的Mechanical Turk(AMT)。
object instances(目標實例)、object keypoints(目標上的關鍵點)、image captions(看圖說話)這3種類型共享這些基本類型:info、image、license。
在instance segmentation 中,包含這麼5種key。
{
"info": info,
"licenses": [license],
"images": [image],
"annotations": [annotation],
"categories": [category]
}
共有的三種key結構
info{
"year": int,
"version": str,
"description": str,
"contributor": str,
"url": str,
"date_created": datetime,
}
license{
"id": int,
"name": str,
"url": str,
}
image{
"id": int,
"width": int,
"height": int,
"file_name": str,
"license": int,
"flickr_url": str,
"coco_url": str,
"date_captured": datetime,
}
由於打開那個json太卡了。。。僅僅詳細展開了很少部分。
1.1 info
"info":
{"description": "COCO 2017 Dataset",
"url": "http://cocodataset.org",
"version": "1.0",
"year": 2017,
"contributor": "COCO Consortium",
"date_created": "2017/09/01"},
1.2 images
images 是一個數組,其中包含了很多image的實例,image的結構參照共有部分結構。
"images":
[{"license": 4,
"file_name": "000000397133.jpg",
"coco_url": "http://images.cocodataset.org/val2017/000000397133.jpg",
"height": 427,
"width": 640,
"date_captured": "2013-11-14 17:02:52",
"flickr_url": "http://farm7.staticflickr.com/6116/6255196340_da26cf2c9e_z.jpg",
"id": 397133},
{"license": 1,
"file_name": "000000037777.jpg",
"coco_url": "http://images.cocodataset.org/val2017/000000037777.jpg",
"height": 230,
"width": 352,
"date_captured": "2013-11-14 20:55:31",
"flickr_url": "http://farm9.staticflickr.com/8429/7839199426_f6d48aa585_z.jpg",
"id": 37777},
{"license": 4,
"file_name": "000000252219.jpg",
"coco_url": "http://images.cocodataset.org/val2017/000000252219.jpg",
"height": 428,
"width": 640,
...
1.3 licenses
licenses也是個數組,同images。
"licenses":
[{"url": "http://creativecommons.org/licenses/by-nc-sa/2.0/",
"id": 1,
"name": "Attribution-NonCommercial-ShareAlike License"},
{"url": "http://creativecommons.org/licenses/by-nc/2.0/",
"id": 2,
"name": "Attribution-NonCommercial License"},
{"url": "http://creativecommons.org/licenses/by-nc-nd/2.0/",
"id": 3,
"name": "Attribution-NonCommercial-NoDerivs License"},
{"url": "http://creativecommons.org/licenses/by/2.0/",
"id": 4,
"name": "Attribution License"},
{"url": "http://creativecommons.org/licenses/by-sa/2.0/",
"id": 5,
"name": "Attribution-ShareAlike License"},
{"url": "http://creativecommons.org/licenses/by-nd/2.0/",
"id": 6,"name": "Attribution-NoDerivs License"},
...
1.4 annotations
基本的annotation如下,“segmentation”: RLE or [polygon]需要解釋下。segmentation格式取決於這個實例是一個單個的對象(即iscrowd=0,將使用polygons格式)還是一組對象(即iscrowd=1,將使用RLE格式)
。當三個人重疊時候,iscrowd=1,而旁邊一個單獨的人就是0。
annotation{
"id": int,
"image_id": int,
"category_id": int,
"segmentation": RLE or [polygon],
"area": float,
"bbox": [x,y,width,height],
"iscrowd": 0 or 1,
}
polygons中文多邊形,而RLE是遊程編碼(run-length encoding)?官網對RLE的註釋如下,那麼大概翻譯一下例子。given M=[0 0 1 1 1 0 1] the RLE counts would be [2 3 1 1], or for M=[1 1 1 1 1 1 0] the counts would be [0 6 1] (note that the odd counts are always the numbers of zeros). 大家注意,M開頭是1,則編碼後第一個是0,就這麼簡單,編碼牛逼!
# RLE is a simple yet efficient format for storing binary masks. RLE
# first divides a vector (or vectorized image) into a series of piecewise
# constant regions and then for each piece simply stores the length of
# that piece. For example, given M=[0 0 1 1 1 0 1] the RLE counts would
# be [2 3 1 1], or for M=[1 1 1 1 1 1 0] the counts would be [0 6 1]
# (note that the odd counts are always the numbers of zeros). Instead of
# storing the counts directly, additional compression is achieved with a
# variable bitrate representation based on a common scheme called LEB128.
具體地:
ploygon:這是對於單個對象來說的,表示的是多邊形輪廓的寫x,y座標,肯定是偶數,如果有n個數,表示有n/2個座標
RLE:爲了表示像素標註,可以用0,1表示,1表示有對象,然後利用RLE編碼。
area是area of encoded masks,是標註區域的面積。如果是矩形框,那就是高乘寬,polygon或者RLE另算。
1.5 categories
上一章節知識點很多,消化了不少。categories就是cls的字段了。
{
"id": int,
"name": str,
"supercategory": str,
}
舉例:
{
"supercategory": "person",
"id": 1,
"name": "person"
},
{
"supercategory": "vehicle",
"id": 2,
"name": "bicycle"
},
Reference
- https://zhuanlan.zhihu.com/p/29393415