前言

本节学习selenium
得说一句这节课程讲的不行
很多讲明白
就先记一笔
准备后续自己再看看

可参考的几篇
官方文档
 针对python的文档
 详解

1、简介

浏览器的工作原理

网页三元素

html负责内容
css负责样式
JavaScript负责动作

从数据的角度考虑，网页上呈现出来的数据的来源：

html文件
ajax接口
javascript加载

可参考
原理详解

Selenium

诞生于2014年，创造者是ThoughtWorks公司的测试工程师Jason Huggins
目的就是做自动化测试，用以检测网页交互，避免重复劳动。
这个工具可以用来自动加载网页，供爬虫抓取数据。

支持哪些浏览器和系统：

Google Chrome
Internet Explorer 7, 8, 9, 10, and 11 on appropriate combinations of Vista, Windows 7, Windows 8, and Windows 8.1.
Firefox
Safari
Opera
phantomjs（一款无头浏览器，已停止维护）
Android (with Selendroid or appium)
IOS (with ios-driver or appium)

2、安装与使用

# 安装selenium:
pip install selenium
# 引入webdriver：
from selenium import webdriver
# 设置选项：
option = webdriver.ChromeOptions()
option.add_argument('headless')
# 添加驱动
driver = webdriver.Chrome('./chromedriver',chrome_options=option)

3、页面交互

# 查找元素：
element = driver.find_element_by_id("passwd-id")
element = driver.find_element_by_name("passwd")
element = driver.find_element_by_xpath("//input[@id='passwd-id']")
# 输入文字：
element.send_keys("some text")
# 点击
element.click()
# 动作链
from selenium.webdriver import ActionChains
action_chains = ActionChains(driver)
action_chains.drag_and_drop(element, target).perform()
# 在页面间切换
window_handles = driver.window_handles
driver.switch_to.window(window_handles[-1])
# 保存网页截图
driver.save_screenshot('screen.png')

4、定位元素

# 查找一个元素
find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector
# 查找多个元素
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
# 通过id定位
<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
  </form>
 </body>
<html>
login_form = driver.find_element_by_id('loginForm')
# 通过name定位
<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
   <input name="continue" type="button" value="Clear" />
  </form>
</body>
<html>
username = driver.find_element_by_name('username')
password = driver.find_element_by_name('password')
# 通过链接文本定位
<html>
 <body>
  <p>Are you sure you want to do this?</p>
  <a href="continue.html">Continue</a>
  <a href="cancel.html">Cancel</a>
</body>
<html>
continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')
# 通过标签名定位
<html>
 <body>
  <h1>Welcome</h1>
  <p>Site content goes here.</p>
</body>
<html>
heading1 = driver.find_element_by_tag_name('h1')
# 通过类名定位
<html>
 <body>
  <p class="content">Site content goes here.</p>
</body>
<html>
content = driver.find_element_by_class_name('content')
# 通过CSS选择器定位
<html>
 <body>
  <p class="content">Site content goes here.</p>
</body>
<html>
content = driver.find_element_by_css_selector('p.content')
# 两个私有方法
from selenium.webdriver.common.by import By
driver.find_element(By.XPATH, '//button[text()="Some text"]')
driver.find_elements(By.XPATH, '//button')
# By后面可以用来定位的属性
ID = "id"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
NAME = "name"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"
# 推荐使用xpath定位
username = driver.find_element_by_xpath("//form[input/@name='username']")
username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")
username = driver.find_element_by_xpath("//input[@name='username']")
# 推荐使用链接文本定位
continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')

5、等待

# 等待
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
    )
finally:
    driver.quit()
# 条件
title_is
title_contains
presence_of_element_located
visibility_of_element_located
visibility_of
presence_of_all_elements_located
text_to_be_present_in_element
text_to_be_present_in_element_value
frame_to_be_available_and_switch_to_it
invisibility_of_element_located
element_to_be_clickable
staleness_of
element_to_be_selected
element_located_to_be_selected
element_selection_state_to_be
element_located_selection_state_to_be
alert_is_present

6、例子

与百度的交互

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

option = webdriver.ChromeOptions()
option.add_argument('headless')

# 要换成适应自己操作系统的chromedriver
driver = webdriver.Chrome(
    executable_path='/Users/seancheney/Documents/kkb_python/headless/chromedriver', #绝对路径
    chrome_options=option #上面设置的option
)

url = 'https://www.baidu.com'
# 打开网站
driver.get(url)

# 打印当前页面标题
print(driver.title)

# 在搜索框中输入文字
timeout = 5
search_content = WebDriverWait(driver, timeout).until(
    # lambda d: d.find_element_by_xpath('//input[@id="kw"]')
    EC.presence_of_element_located((By.XPATH, '//input[@id="kw"]'))
)
search_content.send_keys('python')

# 等待页面
import time
time.sleep(3)

# 模拟点击“百度一下”
search_button = WebDriverWait(driver, timeout).until(
    lambda d: d.find_element_by_xpath('//input[@id="su"]'))
search_button.click()

# 打印搜索结果
search_results = WebDriverWait(driver, timeout).until(
    # lambda d: d.find_elements_by_xpath('//h3[@class="t c-title-en"] | //h3[@class="t"]')
    lambda e: e.find_elements_by_xpath('//h3[contains(@class,"t")]/a[1]')
)
print(search_results)
for item in search_results:
    print(item.text)

driver.close()

抓取头条新闻

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait

option = webdriver.ChromeOptions()

driver = webdriver.Chrome(
    executable_path='/Users/seancheney/Documents/kkb_python/headless/chromedriver',
    chrome_options=option
)

# 今日头条
url = 'https://www.toutiao.com'

driver.get(url)
print(driver.page_source)

timeout = 5
coin_links = WebDriverWait(driver, timeout).until(
    lambda d: d.find_elements_by_xpath('//div[@ga_event="article_title_click"]/a')
)

for item in coin_links:
    print(item.text)
    print(item.get_attribute('href'))

爬虫学习笔记（十六）Selenium 2020.5.20

前言

1、简介

浏览器的工作原理

Selenium

2、安装与使用

3、页面交互

4、定位元素

5、等待

6、例子

与百度的交互

抓取头条新闻

结语

深度學習系列（八）計算性能（命令式編程和符號式編程、異步計算、多GPU計算) 2020.6.25

leetcode刷題記錄441-450 python版

深度學習系列（十）計算機視覺之目標檢測（object detection）2020.6.29

深度學習系列（三）深度卷積神經網絡（AlexNet、VGG、NiN、GoogleNet） 2020.6.18

leetcode刷題記錄431-440 python版

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結