前言

本節學習selenium
得說一句這節課程講的不行
很多講明白
就先記一筆
準備後續自己再看看

可參考的幾篇
官方文檔
 針對python的文檔
 詳解

1、簡介

瀏覽器的工作原理

網頁三元素

html負責內容
css負責樣式
JavaScript負責動作

從數據的角度考慮，網頁上呈現出來的數據的來源：

html文件
ajax接口
javascript加載

可參考
原理詳解

Selenium

誕生於2014年，創造者是ThoughtWorks公司的測試工程師Jason Huggins
目的就是做自動化測試，用以檢測網頁交互，避免重複勞動。
這個工具可以用來自動加載網頁，供爬蟲抓取數據。

支持哪些瀏覽器和系統：

Google Chrome
Internet Explorer 7, 8, 9, 10, and 11 on appropriate combinations of Vista, Windows 7, Windows 8, and Windows 8.1.
Firefox
Safari
Opera
phantomjs（一款無頭瀏覽器，已停止維護）
Android (with Selendroid or appium)
IOS (with ios-driver or appium)

2、安裝與使用

# 安裝selenium:
pip install selenium
# 引入webdriver：
from selenium import webdriver
# 設置選項：
option = webdriver.ChromeOptions()
option.add_argument('headless')
# 添加驅動
driver = webdriver.Chrome('./chromedriver',chrome_options=option)

3、頁面交互

# 查找元素：
element = driver.find_element_by_id("passwd-id")
element = driver.find_element_by_name("passwd")
element = driver.find_element_by_xpath("//input[@id='passwd-id']")
# 輸入文字：
element.send_keys("some text")
# 點擊
element.click()
# 動作鏈
from selenium.webdriver import ActionChains
action_chains = ActionChains(driver)
action_chains.drag_and_drop(element, target).perform()
# 在頁面間切換
window_handles = driver.window_handles
driver.switch_to.window(window_handles[-1])
# 保存網頁截圖
driver.save_screenshot('screen.png')

4、定位元素

# 查找一個元素
find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector
# 查找多個元素
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
# 通過id定位
<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
  </form>
 </body>
<html>
login_form = driver.find_element_by_id('loginForm')
# 通過name定位
<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
   <input name="continue" type="button" value="Clear" />
  </form>
</body>
<html>
username = driver.find_element_by_name('username')
password = driver.find_element_by_name('password')
# 通過鏈接文本定位
<html>
 <body>
  <p>Are you sure you want to do this?</p>
  <a href="continue.html">Continue</a>
  <a href="cancel.html">Cancel</a>
</body>
<html>
continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')
# 通過標籤名定位
<html>
 <body>
  <h1>Welcome</h1>
  <p>Site content goes here.</p>
</body>
<html>
heading1 = driver.find_element_by_tag_name('h1')
# 通過類名定位
<html>
 <body>
  <p class="content">Site content goes here.</p>
</body>
<html>
content = driver.find_element_by_class_name('content')
# 通過CSS選擇器定位
<html>
 <body>
  <p class="content">Site content goes here.</p>
</body>
<html>
content = driver.find_element_by_css_selector('p.content')
# 兩個私有方法
from selenium.webdriver.common.by import By
driver.find_element(By.XPATH, '//button[text()="Some text"]')
driver.find_elements(By.XPATH, '//button')
# By後面可以用來定位的屬性
ID = "id"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
NAME = "name"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"
# 推薦使用xpath定位
username = driver.find_element_by_xpath("//form[input/@name='username']")
username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")
username = driver.find_element_by_xpath("//input[@name='username']")
# 推薦使用鏈接文本定位
continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')

5、等待

# 等待
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
    )
finally:
    driver.quit()
# 條件
title_is
title_contains
presence_of_element_located
visibility_of_element_located
visibility_of
presence_of_all_elements_located
text_to_be_present_in_element
text_to_be_present_in_element_value
frame_to_be_available_and_switch_to_it
invisibility_of_element_located
element_to_be_clickable
staleness_of
element_to_be_selected
element_located_to_be_selected
element_selection_state_to_be
element_located_selection_state_to_be
alert_is_present

6、例子

與百度的交互

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

option = webdriver.ChromeOptions()
option.add_argument('headless')

# 要換成適應自己操作系統的chromedriver
driver = webdriver.Chrome(
    executable_path='/Users/seancheney/Documents/kkb_python/headless/chromedriver', #絕對路徑
    chrome_options=option #上面設置的option
)

url = 'https://www.baidu.com'
# 打開網站
driver.get(url)

# 打印當前頁面標題
print(driver.title)

# 在搜索框中輸入文字
timeout = 5
search_content = WebDriverWait(driver, timeout).until(
    # lambda d: d.find_element_by_xpath('//input[@id="kw"]')
    EC.presence_of_element_located((By.XPATH, '//input[@id="kw"]'))
)
search_content.send_keys('python')

# 等待頁面
import time
time.sleep(3)

# 模擬點擊“百度一下”
search_button = WebDriverWait(driver, timeout).until(
    lambda d: d.find_element_by_xpath('//input[@id="su"]'))
search_button.click()

# 打印搜索結果
search_results = WebDriverWait(driver, timeout).until(
    # lambda d: d.find_elements_by_xpath('//h3[@class="t c-title-en"] | //h3[@class="t"]')
    lambda e: e.find_elements_by_xpath('//h3[contains(@class,"t")]/a[1]')
)
print(search_results)
for item in search_results:
    print(item.text)

driver.close()

抓取頭條新聞

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait

option = webdriver.ChromeOptions()

driver = webdriver.Chrome(
    executable_path='/Users/seancheney/Documents/kkb_python/headless/chromedriver',
    chrome_options=option
)

# 今日頭條
url = 'https://www.toutiao.com'

driver.get(url)
print(driver.page_source)

timeout = 5
coin_links = WebDriverWait(driver, timeout).until(
    lambda d: d.find_elements_by_xpath('//div[@ga_event="article_title_click"]/a')
)

for item in coin_links:
    print(item.text)
    print(item.get_attribute('href'))

結語

selenium這個自動化工具確實好用
筆者準備後續多看看官方文檔

爬蟲學習筆記（十六）Selenium 2020.5.20

前言

1、簡介

瀏覽器的工作原理

Selenium

2、安裝與使用

3、頁面交互

4、定位元素

5、等待

6、例子

與百度的交互

抓取頭條新聞

結語

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

[轉帖]

python列出centos7內存使用前50的進程信息

Garnet：微軟官方基於.NET開源的高性能分佈式緩存存儲數據庫

Flink執行圖

Java響應式編程

評估統計算法在銀行僞造鈔票檢測中的價值

深度學習系列（八）計算性能（命令式編程和符號式編程、異步計算、多GPU計算) 2020.6.25

leetcode刷題記錄441-450 python版

深度學習系列（十）計算機視覺之目標檢測（object detection）2020.6.29

深度學習系列（三）深度卷積神經網絡（AlexNet、VGG、NiN、GoogleNet） 2020.6.18

leetcode刷題記錄431-440 python版

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結