YOLOv3框架實現目標檢測之 - 爬蟲百度、google圖片，製作VOC格式數據集

原創

胡椒面er

2020-06-22 00:16

圖片數據來源於百度、google圖片

曾部分參考文章：https://blog.csdn.net/wobeatit/article/details/79559314

因爲google圖片質量較好，推薦使用方法1：
利用googleimagesdownload工具爬取google 圖片，
但需要fanqiang,能訪問goolge圖片，可以找插件/搭建亞馬遜AWS服務器解決

以下方法僅在ubuntu下測試過

1、ubuntu下使用工具google-images-download，爬取google images

若有梯子，能訪問google images,可以採用這種方式，穩定，十分推薦！
官方教程如下
項目地址：googleimagesdownload
工具安裝：安裝googleimagesdownload
使用示例：使用示例

可直接pip安裝

pip install google_images_download

googleimagesdownload -k "滅火器箱" --size medium -l 1000 --chromedriver ./chromedriver

命令行輸入參數解釋：
-k “要搜索的圖片”
–size 指定圖片大小，如medium
-l 限制下載的數量
–chromedriver 指定谷歌驅動的路徑
Chrome驅動下載安裝教程很多：https://blog.csdn.net/qq_41188944/article/details/79039690

像這樣 -l 限制下載1000張圖片，因爲圖片版權的原因，實際下載到409張，可以更改搜索詞再次下載。
google images 的圖片質量較高，基本算是標註好的圖片。
類似這種：

該工具會在終端目錄創建download文件夾以放置爬取的圖片

2、python 腳本爬取百度圖片

(1) 安裝 Chrome 瀏覽器和 Chrome驅動
Chrome驅動安裝：https://blog.csdn.net/qq_41188944/article/details/79039690

(2) pip install selenium安裝selenium庫

#*******本腳本運行時需要本機安裝 Chrome 瀏覽器以及Chrome的驅動，同時需要selenium庫的支撐********
from selenium import webdriver 
from selenium.webdriver.chrome.options import Options
import time  
import urllib.request
from bs4 import BeautifulSoup as bs
import re  
import os  
#****************************************************
#base_url_part1 = 'https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=&hs=0&xthttps=111111&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word='
#base_url_part2 = '&oq=bagua&rsp=0' # base_url_part1以及base_url_part2都是固定不變的，無需更改
base_url_part1 = 'https://www.shutterstock.com/zh/search/'
base_url_part2 = '' # base_url_part1以及base_url_part2都是固定不變的，無需更改
search_query = '滅火器' # 檢索的關鍵詞，可自行更改
location_driver = '/usr/bin/chromedriver' # Chrome驅動程序在電腦中的位置
 
class Crawler:
	def __init__(self):
		self.url = base_url_part1 + search_query + base_url_part2
 
	# 啓動Chrome瀏覽器驅動
	def start_brower(self):
		chrome_options = Options()
		chrome_options.add_argument("--disable-infobars")
		# 啓動Chrome瀏覽器  
		driver = webdriver.Chrome(executable_path=location_driver, chrome_options=chrome_options)  
		# 最大化窗口，因爲每一次爬取只能看到視窗內的圖片
		driver.maximize_window()  
		# 瀏覽器打開爬取頁面  
		driver.get(self.url)  
		return driver
 
	def downloadImg(self, driver):  
		t = time.localtime(time.time())
		foldername = str(t.__getattribute__("tm_year")) + "-" + str(t.__getattribute__("tm_mon")) + "-" + \
					 str(t.__getattribute__("tm_mday")) # 定義文件夾的名字
		picpath = '/home/hujinlei/dev/DataSet/BaiduImage/%s' %(foldername) # 下載到的本地目錄
		# 路徑不存在時創建一個 
		if not os.path.exists(picpath): os.makedirs(picpath)
		# 記錄下載過的圖片地址，避免重複下載
		img_url_dic = {} 
		x = 0  
		# 當鼠標的位置小於最後的鼠標位置時,循環執行
		pos = 0     
		for i in range(80): # 此處可自己設置爬取範圍，本處設置爲1，那麼不會有下滑出現
			pos += 500 # 每次下滾500
			js = "document.documentElement.scrollTop=%d" %pos    
			driver.execute_script(js)  
			time.sleep(2)
			# 獲取頁面源碼
			html_page = driver.page_source
			# 利用Beautifulsoup4創建soup對象並進行頁面解析
			soup = bs(html_page, "html.parser")
			# 通過soup對象中的findAll函數圖像信息提取
			imglist = soup.findAll('img', {'src':re.compile(r'https:.*\.(jpg|png)')})
 
			for imgurl in imglist:  
				if imgurl['src'] not in img_url_dic:
					target = '{}/{}.jpg'.format(picpath, x)
					img_url_dic[imgurl['src']] = '' 
					urllib.request.urlretrieve(imgurl['src'], target)  
					x += 1  
					
	def run(self):
		print ('\t\t\t**************************************\n\t\t\t**\t\tWelcome to Use Spider\t\t**\n\t\t\t**************************************')  
		driver=self.start_brower()
		self.downloadImg(driver)
		driver.close()
		print("Download has finished.")
 
if __name__ == '__main__':  
	craw = Crawler() 
	craw.run()

3、如何批量重命名下載的圖片，製作VOC COCO等數據集
以後補充

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

YOLOv3框架實現目標檢測之 - 爬蟲百度、google圖片，製作VOC格式數據集

圖片數據來源於百度、google圖片

1、ubuntu下使用工具google-images-download，爬取google images

2、python 腳本爬取百度圖片

Yolov4模型訓練規則和技巧

YOLOv3框架實現目標檢測之 - 爬蟲百度、google圖片，製作VOC格式數據集

pyspider爬蟲網頁響應過慢，爬不到數據解決

手把手教你用yolov3模型實現目標檢測(二) -VOC數據集製作

Tesseract-OCR 4.1 LSTM訓練方法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

YOLOv3框架實現目標檢測之 - 爬蟲百度、google圖片，製作VOC格式數據集

圖片數據來源於百度、google圖片

1、ubuntu下 使用工具google-images-download，爬取google images

2、python 腳本爬取百度圖片

1、ubuntu下使用工具google-images-download，爬取google images