爬蟲學習筆記(十六)Selenium 2020.5.20

前言

本節學習selenium
得說一句這節課程講的不行
很多講明白
就先記一筆
準備後續自己再看看

可參考的幾篇
官方文檔
針對python的文檔
詳解

1、簡介

瀏覽器的工作原理

網頁三元素

  • html負責內容
  • css負責樣式
  • JavaScript負責動作

從數據的角度考慮,網頁上呈現出來的數據的來源:

  • html文件
  • ajax接口
  • javascript加載

可參考
原理詳解

Selenium

誕生於2014年,創造者是ThoughtWorks公司的測試工程師Jason Huggins
目的就是做自動化測試,用以檢測網頁交互,避免重複勞動。
這個工具可以用來自動加載網頁,供爬蟲抓取數據。

支持哪些瀏覽器和系統:

  • Google Chrome
  • Internet Explorer 7, 8, 9, 10, and 11 on appropriate combinations of Vista, Windows 7, Windows 8, and Windows 8.1.
  • Firefox
  • Safari
  • Opera
  • phantomjs(一款無頭瀏覽器,已停止維護)
  • Android (with Selendroid or appium)
  • IOS (with ios-driver or appium)

2、安裝與使用

# 安裝selenium:
pip install selenium
# 引入webdriver:
from selenium import webdriver
# 設置選項:
option = webdriver.ChromeOptions()
option.add_argument('headless')
# 添加驅動
driver = webdriver.Chrome('./chromedriver',chrome_options=option)

3、頁面交互

# 查找元素:
element = driver.find_element_by_id("passwd-id")
element = driver.find_element_by_name("passwd")
element = driver.find_element_by_xpath("//input[@id='passwd-id']")
# 輸入文字:
element.send_keys("some text")
# 點擊
element.click()
# 動作鏈
from selenium.webdriver import ActionChains
action_chains = ActionChains(driver)
action_chains.drag_and_drop(element, target).perform()
# 在頁面間切換
window_handles = driver.window_handles
driver.switch_to.window(window_handles[-1])
# 保存網頁截圖
driver.save_screenshot('screen.png')

4、定位元素

# 查找一個元素
find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector
# 查找多個元素
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
# 通過id定位
<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
  </form>
 </body>
<html>
login_form = driver.find_element_by_id('loginForm')
# 通過name定位
<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
   <input name="continue" type="button" value="Clear" />
  </form>
</body>
<html>
username = driver.find_element_by_name('username')
password = driver.find_element_by_name('password')
# 通過鏈接文本定位
<html>
 <body>
  <p>Are you sure you want to do this?</p>
  <a href="continue.html">Continue</a>
  <a href="cancel.html">Cancel</a>
</body>
<html>
continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')
# 通過標籤名定位
<html>
 <body>
  <h1>Welcome</h1>
  <p>Site content goes here.</p>
</body>
<html>
heading1 = driver.find_element_by_tag_name('h1')
# 通過類名定位
<html>
 <body>
  <p class="content">Site content goes here.</p>
</body>
<html>
content = driver.find_element_by_class_name('content')
# 通過CSS選擇器定位
<html>
 <body>
  <p class="content">Site content goes here.</p>
</body>
<html>
content = driver.find_element_by_css_selector('p.content')
# 兩個私有方法
from selenium.webdriver.common.by import By
driver.find_element(By.XPATH, '//button[text()="Some text"]')
driver.find_elements(By.XPATH, '//button')
# By後面可以用來定位的屬性
ID = "id"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
NAME = "name"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"
# 推薦使用xpath定位
username = driver.find_element_by_xpath("//form[input/@name='username']")
username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")
username = driver.find_element_by_xpath("//input[@name='username']")
# 推薦使用鏈接文本定位
continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')

5、等待

# 等待
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
    )
finally:
    driver.quit()
# 條件
title_is
title_contains
presence_of_element_located
visibility_of_element_located
visibility_of
presence_of_all_elements_located
text_to_be_present_in_element
text_to_be_present_in_element_value
frame_to_be_available_and_switch_to_it
invisibility_of_element_located
element_to_be_clickable
staleness_of
element_to_be_selected
element_located_to_be_selected
element_selection_state_to_be
element_located_selection_state_to_be
alert_is_present

6、例子

與百度的交互

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

option = webdriver.ChromeOptions()
option.add_argument('headless')

# 要換成適應自己操作系統的chromedriver
driver = webdriver.Chrome(
    executable_path='/Users/seancheney/Documents/kkb_python/headless/chromedriver', #絕對路徑
    chrome_options=option #上面設置的option
)

url = 'https://www.baidu.com'
# 打開網站
driver.get(url)

# 打印當前頁面標題
print(driver.title)

# 在搜索框中輸入文字
timeout = 5
search_content = WebDriverWait(driver, timeout).until(
    # lambda d: d.find_element_by_xpath('//input[@id="kw"]')
    EC.presence_of_element_located((By.XPATH, '//input[@id="kw"]'))
)
search_content.send_keys('python')

# 等待頁面
import time
time.sleep(3)

# 模擬點擊“百度一下”
search_button = WebDriverWait(driver, timeout).until(
    lambda d: d.find_element_by_xpath('//input[@id="su"]'))
search_button.click()

# 打印搜索結果
search_results = WebDriverWait(driver, timeout).until(
    # lambda d: d.find_elements_by_xpath('//h3[@class="t c-title-en"] | //h3[@class="t"]')
    lambda e: e.find_elements_by_xpath('//h3[contains(@class,"t")]/a[1]')
)
print(search_results)
for item in search_results:
    print(item.text)

driver.close()

抓取頭條新聞

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait

option = webdriver.ChromeOptions()

driver = webdriver.Chrome(
    executable_path='/Users/seancheney/Documents/kkb_python/headless/chromedriver',
    chrome_options=option
)

# 今日頭條
url = 'https://www.toutiao.com'

driver.get(url)
print(driver.page_source)

timeout = 5
coin_links = WebDriverWait(driver, timeout).until(
    lambda d: d.find_elements_by_xpath('//div[@ga_event="article_title_click"]/a')
)

for item in coin_links:
    print(item.text)
    print(item.get_attribute('href'))

結語

selenium這個自動化工具確實好用
筆者準備後續多看看官方文檔

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章