[Python爬虫] Selenium + Phantomjs 实现脚本登录百度

郑重提示：请慎用此爬虫脚本去爬取百度相关，更不要进行非法操作，可能造成的账号永久封停等后果自行承担。

本文所实现的脚本基于Selenium + Phantomjs + Python3的环境。
Selenium 和 Phantomjs 是啥以及如何安装，这里就不进行科普了，各位自行百度即可。Selenium + PhantomJS这样的框架十分适合于处理需要验证码登录、动态网页爬取等应用场景，谁用谁知道😊
PhantomJS 下载链接

直接上代码：

首先是导入部分：

#!/usr/bin/env python3.6
# -*- coding:UTF-8 -*- 

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys
import time
import requests

为selenium请求添加头以及作一个初始化，注意请求头格式设置为电脑版浏览器，否则请求的页面会不同，导致后面的元素定位会找不到而报错。

dcap = dict(DesiredCapabilities.PHANTOMJS)
# win10 谷歌浏览器请求头 的格式
dcap['phantomjs.page.settings.userAgent']=("Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36")
driver = webdriver.PhantomJS(executable_path='F:/我的下载/Google/phantomjs-2.1.1-windows/bin/phantomjs.exe',desired_capabilities=dcap)
driver.set_page_load_timeout(30)

提取所有的xpath（几乎所有的pc浏览器都支持获取xpath）,这里大家不需要自定义修改，除非百度把登录页面重做了。

# Baidu
# 百度登录页面地址（二维码登录页面）
url = "https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5"
# 切换账号登录的文本标签
gotologin_xpath = '//*[@id="TANGRAM__PSP_3__footerULoginBtn"]'
# 用户名的输入文本框
user_xpath = '//*[@id="TANGRAM__PSP_3__userName"]'
# 密码的输入文本框
pwd_xpath = '//*[@id="TANGRAM__PSP_3__password"]'
# 登录的Button
login_xpath = '//*[@id="TANGRAM__PSP_3__submit"]'
# 验证手机号 -> 验证码的输入文本框
certify_phone_edittext_xpath = '//*[@id="TANGRAM__30__input_vcode"]'
# 获取手机验证码 的 提交按钮
certify_phone_bt_xpath = '//*[@id="TANGRAM__30__button_send_mobile"]'
# 验证手机号 的 提交按钮
certify_phone_submit_xpath = '//*[@id="TANGRAM__30__button_submit"]'

xpath是啥这里也不做解释了，不懂的自行查阅资料。获取xpath的方式举个例子, 要想获取“登录”标签的xpath，可按如下步骤进行：
1>. 鼠标移动到到你要获取xpath的标签位置，然后鼠标右击选择 ‘审查’
2>. 如下图，元素审查结果将自动定位到该标签，然后继续右击蓝色区域，选择Copy -> Copy XPath，这个时候XPath文本就已经在你粘贴板中，可直接Ctrl+c粘贴到你想要的位置。

这段代码可以算是整个爬虫脚本的核心了，是不是看起来非常简单。

driver.get(url) # 使用Selenium driver 模拟加载百度登录页面
time.sleep(3) # 等待3s网页加载完毕，否则后面的 截图 或者 元素定位无效，导致报错。
driver.get_screenshot_as_file('./scraping.png') # 对模拟网页实时状态截图
gotologin = driver.find_element_by_xpath(gotologin_xpath) # 使用Selenium driver 定位到 切换账号登录 标签
gotologin.click() # 模拟点击 切换账号登录 标签
time.sleep(1) # 这里其实可以不用sleep函数，因为切换到账号登录的过程只是本地js程序执行，不需要和服务器交互。
driver.get_screenshot_as_file('./scraping_2.png') # 对模拟网页实时状态截图，可与click()之前的截图对比。

这里是模拟点击切换账号登录标签前后状态的对比

在上一步获取到了账号密码登录的新页面之后，我们才能对新页面进行元素定位：

# baidu
baidu_user_textedit=driver.find_element_by_xpath(user_xpath)
baidu_pwd_textedit=driver.find_element_by_xpath(pwd_xpath)
baidu_login_textedit=driver.find_element_by_xpath(login_xpath)
# ActionChains是一个动作链，使用动作链与否，其优劣各位自己评判
actions = ActionChains(driver).click(baidu_user_textedit).send_keys("<百度账号名>").click(baidu_pwd_textedit).send_keys("<密码>").send_keys(Keys.RETURN)
# 设定动作链之后要调用perform()函数才生效
actions.perform()
# 等待3s后，再截个图看看当前是什么状态
time.sleep(3)
driver.get_screenshot_as_file('./scraping_3.png')

在第5步的时候已经模拟点击登录按钮，，但是在实际运行脚本的时候，在第6步的时候会跳转到手机验证页面，每次都需要手机验证码登录，所以这里多了个步骤6来处理手机验证码登录的过程。

try:
        certify_phone_edittext = driver.find_element_by_xpath(certify_phone_edittext_xpath)
        certify_phone_bt = driver.find_element_by_xpath(certify_phone_bt_xpath)
        certify_phone_submit = driver.find_element_by_xpath(certify_phone_submit_xpath)

        driver.get_screenshot_as_file('./scraping_3.1.png')
        if (certify_phone_edittext):
                certify_phone_bt.click() #获取验证码
                # 命令行提示用户输入你手机收到的验证码
                msg_certify = input("请输入手机收到的验证码：")
                if(msg_certify):
                        certify_phone_edittext.click()
                        certify_phone_edittext.send_keys(msg_certify)
#                       certify_phone_edittext.send_keys(Keys.RETURN)
                        certify_phone_submit.click()
                        time.sleep(2)
                        driver.get_screenshot_as_file('./scraping_4.png')
                        # 提交验证码
                        baidu_login_textedit.click()
                        driver.get_screenshot_as_file('./scraping_5.png')
                else:
                        print("没有输入验证码。")
except Exception as e:
        print("Excetion->", e)

scraping_3.1.png：

到这里已经大功告成了，只需要检查是否成功登录即可。

# 检查模拟登录后页面的 用户名 标签，若存在此标签则说明登录成功。
login_check_xpath = '//*[@id="s_username_top"]/span'
login_check = driver.find_element_by_xpath(login_check_xpath)
if(driver.find_element_by_xpath(login_check_xpath)):
        print("Successful login in.")
        html=driver.page_source #获取网页的html数据
        # soup=BeautifulSoup(html,'lxml')#对html进行解析
        with open("baidu_login_aft.html","w") as f:
                f.write(html)
else:
        print("Failed to login Baidu.")
# 最后不要忘记关闭driver
driver.close()

运行时的命令行截图：

成功登录后获取到的百度首页：

后注：此脚本程序未添加图形验证码的验证登录，因为在开发调试中时尚未碰到需要图形验证码的情况，而且针对于图形验证码的自动识别，本人正在学习研究中，后期再更新进此登录脚本程序中。

[Python爬虫] Selenium + Phantomjs 实现脚本登录百度

shell獲取腳本路徑

jieba分詞流程及算法學習

計算最長公共子序列（LCS）的兩種算法

Spark環境配置筆記

【樹莓派】樹莓派遠程視頻監控與Nginx代理配置

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結