[Python爬蟲] Selenium + Phantomjs 實現腳本登錄百度

鄭重提示：請慎用此爬蟲腳本去爬取百度相關，更不要進行非法操作，可能造成的賬號永久封停等後果自行承擔。

本文所實現的腳本基於Selenium + Phantomjs + Python3的環境。
Selenium 和 Phantomjs 是啥以及如何安裝，這裏就不進行科普了，各位自行百度即可。Selenium + PhantomJS這樣的框架十分適合於處理需要驗證碼登錄、動態網頁爬取等應用場景，誰用誰知道😊
PhantomJS 下載鏈接

直接上代碼：

首先是導入部分：

#!/usr/bin/env python3.6
# -*- coding:UTF-8 -*- 

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys
import time
import requests

爲selenium請求添加頭以及作一個初始化，注意請求頭格式設置爲電腦版瀏覽器，否則請求的頁面會不同，導致後面的元素定位會找不到而報錯。

dcap = dict(DesiredCapabilities.PHANTOMJS)
# win10 谷歌瀏覽器請求頭 的格式
dcap['phantomjs.page.settings.userAgent']=("Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36")
driver = webdriver.PhantomJS(executable_path='F:/我的下載/Google/phantomjs-2.1.1-windows/bin/phantomjs.exe',desired_capabilities=dcap)
driver.set_page_load_timeout(30)

提取所有的xpath（幾乎所有的pc瀏覽器都支持獲取xpath）,這裏大家不需要自定義修改，除非百度把登錄頁面重做了。

# Baidu
# 百度登錄頁面地址（二維碼登錄頁面）
url = "https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5"
# 切換賬號登錄的文本標籤
gotologin_xpath = '//*[@id="TANGRAM__PSP_3__footerULoginBtn"]'
# 用戶名的輸入文本框
user_xpath = '//*[@id="TANGRAM__PSP_3__userName"]'
# 密碼的輸入文本框
pwd_xpath = '//*[@id="TANGRAM__PSP_3__password"]'
# 登錄的Button
login_xpath = '//*[@id="TANGRAM__PSP_3__submit"]'
# 驗證手機號 -> 驗證碼的輸入文本框
certify_phone_edittext_xpath = '//*[@id="TANGRAM__30__input_vcode"]'
# 獲取手機驗證碼 的 提交按鈕
certify_phone_bt_xpath = '//*[@id="TANGRAM__30__button_send_mobile"]'
# 驗證手機號 的 提交按鈕
certify_phone_submit_xpath = '//*[@id="TANGRAM__30__button_submit"]'

xpath是啥這裏也不做解釋了，不懂的自行查閱資料。獲取xpath的方式舉個例子, 要想獲取“登錄”標籤的xpath，可按如下步驟進行：
1>. 鼠標移動到到你要獲取xpath的標籤位置，然後鼠標右擊選擇 ‘審查’
2>. 如下圖，元素審查結果將自動定位到該標籤，然後繼續右擊藍色區域，選擇Copy -> Copy XPath，這個時候XPath文本就已經在你粘貼板中，可直接Ctrl+c粘貼到你想要的位置。

這段代碼可以算是整個爬蟲腳本的核心了，是不是看起來非常簡單。

driver.get(url) # 使用Selenium driver 模擬加載百度登錄頁面
time.sleep(3) # 等待3s網頁加載完畢，否則後面的 截圖 或者 元素定位無效，導致報錯。
driver.get_screenshot_as_file('./scraping.png') # 對模擬網頁實時狀態截圖
gotologin = driver.find_element_by_xpath(gotologin_xpath) # 使用Selenium driver 定位到 切換賬號登錄 標籤
gotologin.click() # 模擬點擊 切換賬號登錄 標籤
time.sleep(1) # 這裏其實可以不用sleep函數，因爲切換到賬號登錄的過程只是本地js程序執行，不需要和服務器交互。
driver.get_screenshot_as_file('./scraping_2.png') # 對模擬網頁實時狀態截圖，可與click()之前的截圖對比。

這裏是模擬點擊切換賬號登錄標籤前後狀態的對比

在上一步獲取到了賬號密碼登錄的新頁面之後，我們才能對新頁面進行元素定位：

# baidu
baidu_user_textedit=driver.find_element_by_xpath(user_xpath)
baidu_pwd_textedit=driver.find_element_by_xpath(pwd_xpath)
baidu_login_textedit=driver.find_element_by_xpath(login_xpath)
# ActionChains是一個動作鏈，使用動作鏈與否，其優劣各位自己評判
actions = ActionChains(driver).click(baidu_user_textedit).send_keys("<百度賬號名>").click(baidu_pwd_textedit).send_keys("<密碼>").send_keys(Keys.RETURN)
# 設定動作鏈之後要調用perform()函數才生效
actions.perform()
# 等待3s後，再截個圖看看當前是什麼狀態
time.sleep(3)
driver.get_screenshot_as_file('./scraping_3.png')

在第5步的時候已經模擬點擊登錄按鈕，，但是在實際運行腳本的時候，在第6步的時候會跳轉到手機驗證頁面，每次都需要手機驗證碼登錄，所以這裏多了個步驟6來處理手機驗證碼登錄的過程。

try:
        certify_phone_edittext = driver.find_element_by_xpath(certify_phone_edittext_xpath)
        certify_phone_bt = driver.find_element_by_xpath(certify_phone_bt_xpath)
        certify_phone_submit = driver.find_element_by_xpath(certify_phone_submit_xpath)

        driver.get_screenshot_as_file('./scraping_3.1.png')
        if (certify_phone_edittext):
                certify_phone_bt.click() #獲取驗證碼
                # 命令行提示用戶輸入你手機收到的驗證碼
                msg_certify = input("請輸入手機收到的驗證碼：")
                if(msg_certify):
                        certify_phone_edittext.click()
                        certify_phone_edittext.send_keys(msg_certify)
#                       certify_phone_edittext.send_keys(Keys.RETURN)
                        certify_phone_submit.click()
                        time.sleep(2)
                        driver.get_screenshot_as_file('./scraping_4.png')
                        # 提交驗證碼
                        baidu_login_textedit.click()
                        driver.get_screenshot_as_file('./scraping_5.png')
                else:
                        print("沒有輸入驗證碼。")
except Exception as e:
        print("Excetion->", e)

scraping_3.1.png：

到這裏已經大功告成了，只需要檢查是否成功登錄即可。

# 檢查模擬登錄後頁面的 用戶名 標籤，若存在此標籤則說明登錄成功。
login_check_xpath = '//*[@id="s_username_top"]/span'
login_check = driver.find_element_by_xpath(login_check_xpath)
if(driver.find_element_by_xpath(login_check_xpath)):
        print("Successful login in.")
        html=driver.page_source #獲取網頁的html數據
        # soup=BeautifulSoup(html,'lxml')#對html進行解析
        with open("baidu_login_aft.html","w") as f:
                f.write(html)
else:
        print("Failed to login Baidu.")
# 最後不要忘記關閉driver
driver.close()

運行時的命令行截圖：

成功登錄後獲取到的百度首頁：

後注：此腳本程序未添加圖形驗證碼的驗證登錄，因爲在開發調試中時尚未碰到需要圖形驗證碼的情況，而且針對於圖形驗證碼的自動識別，本人正在學習研究中，後期再更新進此登錄腳本程序中。

[Python爬蟲] Selenium + Phantomjs 實現腳本登錄百度

SQL優化-20231016

shell獲取腳本路徑

jieba分詞流程及算法學習

計算最長公共子序列（LCS）的兩種算法

Spark環境配置筆記

【樹莓派】樹莓派遠程視頻監控與Nginx代理配置

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結