百度指數、360指數爬蟲python版：基於selenium+chrome和圖像識別技術

原創

小天狼星666

2018-09-04 21:52

一.前言：

1、本博客主要介紹百度指數爬取，360指數獲取類似；

2、想要獲取數據必須先登錄百度指數，頻繁登陸會導致要求輸入驗證碼和手機驗證碼；

3、百度指數的數值是採用html格式+加密二進制傳輸，不能夠通過直接獲取節點進而獲取數值。

二.爬取思路：

1、首先使用selenium+chrome模擬登陸百度賬號，獲取cookie;

2、由於有時候需要驗證碼登陸，所以需要保存cookie模擬登陸；

3、模擬登陸，輸入關鍵詞進入有指數頁面，截整個圖保存本地；

4、讀取圖片，找到搜索指數所在區域，截取圖片；

5、使用Tesseract-OCR進行圖像識別，若數字識別不準確，需使用jTessBoxEditor訓練數據提高準確度。

三.主要代碼介紹：

1.登錄

url = 'http://index.baidu.com/'
driver = webdriver.Chrome(executable_path='C:/Program Files             
          (x86)/Google/Chrome/Application/chromedriver.exe')
driver.get(url)
cookieList = []
for cookie in cookieList:
    driver.add_cookie(cookie)
driver.get(url)
time.sleep(3)
driver.refresh()

此處cookieList已被我刪除，獲取方法：第一次模擬登陸時手動輸入賬號和密碼，通過driver.get_cookies()獲取，程序如下（該段程序只是獲得cookies，獲得的cookies添加到cookieList中，以後這段程序就無需放到爬蟲程序中了）：

url = 'http://index.baidu.com/'
driver = webdriver.Chrome(executable_path='C:/Program Files     
     (x86)/Google/Chrome/Application/chromedriver.exe')
driver.get(url)
time.sleep(30)
cookies=driver.get_cookies()
print(cookies)

設置中間停頓30秒，輸入賬號，把打印下來的cookies（字典形式）粘貼到原來代碼的cookieList中，這樣就可以跳過驗證碼和輸入密碼登錄

2.輸入關鍵詞並最大化界面

WebDriverWait(driver, 10, 0.5).until(
        EC.element_to_be_clickable((By.XPATH, "//input[@class='search-input']")))
driver.find_element_by_xpath("//input[@class='search-input']").send_keys(keyword)
WebDriverWait(driver, 10, 0.5).until(
        EC.element_to_be_clickable((By.XPATH, "//span[@class='search-input-cancle']")))
driver.find_element_by_xpath("//span[@class='search-input-cancle']").click()
driver.maximize_window()

3.鼠標移動到指數所在矩形框並進行移動使出現viewbox

time.sleep(2)
WebDriverWait(driver, 10, 0.5).until(
        EC.element_to_be_clickable((By.CSS_SELECTOR, '#trend > svg > rect')))
element = driver.find_elements_by_css_selector('#trend > svg > rect')[1]
time.sleep(2)
ActionChains(driver).move_to_element_with_offset(element, x, y).perform()
time.sleep(3)
driver.get_screenshot_as_file(str(index)+'.png')
WebDriverWait(driver, 10, 0.5).until(
            EC.element_to_be_clickable((By.XPATH, "//div[@id='viewbox']")))

4.獲取viewbox位置截圖並進行圖像識別

element = driver.find_element_by_xpath("//div[@id='viewbox']")
getElementImage(driver,element, str(index)+'.png', 'day'+str(index)+'.png',keyword)
time.sleep(2)
number = Image.open('day'+str(index)+'.png')
number = pytesseract.image_to_string(number,lang='fontyp')
number = re.sub(r',?\.?\s?', '', number)
number=number.replace('z','2').replace('i','7').replace('e','9')
print(number)

def getElementImage(driver,element,fromPath,toPath,keyword):
    """
    該元素所對應的截圖
    :param element: 元素
    :param fromPath: 圖片源
    :param toPath: 截圖
    """
    # 找到圖片座標
    locations = element.location
    # 跨瀏覽器兼容
    scroll = driver.execute_script("return window.scrollY;")
    top = locations['y'] - scroll
    # 找到圖片大小
    sizes = element.size
    # 構造關鍵詞長度
    add_length = (len(keyword) - 2) * sizes['width'] / 15
    # 構造指數的位置
    rangle = (
        int(locations['x'] + sizes['width'] / 4 + add_length)-2, int(top +         
             sizes['height'] / 2),
        int(locations['x'] + sizes['width'] * 2 / 3)+2, int(top + sizes['height']))
    time.sleep(2)
    image = Image.open(fromPath)
    cropImg = image.crop(rangle)
    cropImg.save(toPath)

四、優化

1.若想獲取30天的數據，則鼠標往右移動的寬度爲41.68像素較爲合適，但這個寬度不是數據所在矩形框的平均值（41.86），前者使用沒有問題，後者使用會不出現viewbox，小編也不知道爲什麼，有知道的朋友麻煩留言告知一下，非常感謝。

2.使用jTessBoxEditor訓練數據集提高識別準確率，具體見https://www.cnblogs.com/zhang-ke/p/7606572.html

五、結尾

本博客主要介紹的是爬取30天每天的百度指數，讀者可以拓展爬取其他時間段或者地區的指數。360指數爬取類似，不過到小編寫這篇博客爲止，360指數上有地區選項但仍然無法點開！

百度指數爬取代碼在github:https://github.com/kingdomrushing/SpiderbaiduIndex-python

交流QQ:2422035338

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

百度指數、360指數爬蟲python版：基於selenium+chrome和圖像識別技術

一.前言：

二.爬取思路：

三.主要代碼介紹：

1.登錄

2.輸入關鍵詞並最大化界面

3.鼠標移動到指數所在矩形框並進行移動使出現viewbox

4.獲取viewbox位置截圖並進行圖像識別

四、優化

五、結尾

AI 畫圖真刺激，手把手教你如何用 ComfyUI 來畫出刺激的圖

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

2020浙江大學軟件學院軟件工程考研經驗分享

LeetCode力扣有效的完全平方數

LeetCode力扣剪繩子

微信爬蟲

PAT甲級Numeric Keypad python實現解題思路及注意事項

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結