最近在用Caffe_Windows做CNN分类识别。先前数据采集这块不是由我负责的,今天突然也想把这块跑通,这样后面就可以玩一些自己的想要的识别了。由于CNN training Datasets特别重要,抓取数据必不可少。
例程数据集:wget -c https://storage.googleapis.com/openimages/2016_08/images_2016_08_v5.tar.gz
首先查看一下该数据集:
# -*- coding : utf-8 -*- import csv import os from urllib import request file = open('./validation/images.csv', 'r', encoding='gb18030', errors='ignore') imagereader = csv.DictReader(file) for item in imagereader: print(item)
这里特意选择DictReader,而不是reader,返回dict类型,便于操作,部分结果如下:
这样我们需要下载图片的话,通过调用
item['OriginalURL']
就可以了。
初步实现代码:
for item in imagereader: # print(item) filename = item['OriginalURL'].split('/')[-1] for url in item['OriginalURL'].split('\n'): print("Download:", url) renum = 3 while os.path.exists(filename) == False and renum > 0: try: web = request.urlopen(url, timeout=3) img = open(filename, 'wb') img.write(web.read()) img.close() break except IOError as e: print(e) renum -= 1又加了文件查重以及timeout。测试显示速度很慢
为了提高效率,使用多线程:
# -*- coding : utf-8 -*- import csv import os from urllib import request import threading file = open('./validation/images.csv', 'r', encoding='gb18030', errors='ignore') class CsvReaderImage(threading.Thread): def __init__(self): threading.Thread.__init__(self) self._file = file def action(self): imagereader = csv.DictReader(self._file) for item in imagereader: # print(item) filename = item['OriginalURL'].split('/')[-1] for url in item['OriginalURL'].split('\n'): print("Download:", url) renum = 3 while os.path.exists(filename) == False and renum > 0: try: web = request.urlopen(url, timeout=3) img = open(filename, 'wb') img.write(web.read()) img.close() break except IOError as e: print(e) renum -= 1 if __name__ == '__main__': for _ in range(3): D = CsvReaderImage() D.action()下载结果:
体会:
虽然功能实现了,但是还有考虑不足的地方,比如避免重复下载,需要添加cache;如何断点续传等,后面找时间再优化完善吧。