path環境變量的意義:讓系統找到一些exe文件
1.有python和anaconda,想使用anaconda,要怎麼配置環境變量?
(1)配置一下路徑
C:\Anaconda3----python.exe
C:\Anaconda3\Scripts----pip.exe
(2)把這兩個目錄放在path環境變量的最前面,這樣系統在找python和pip的時候會先找到anaconda下面的這個
2.python2和python3如何實現兼容?
當我們在cmd中輸入python命令的時候,系統會去path環境變量下面尋找與命令相同的exe可執行文件啓動。
當我們安裝了python2和python3的時候,只需要修改兩個環境中的python.exe文件名,比如把python2的改成python2.exe,把python3的改成python3.exe。這樣在輸入命令的時候,如果想啓動python3,輸入python3即可。
pip也是一樣的原理。
一、selenium
(一)selenium操作Chrome瀏覽器的方法
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
# 1.創建一個瀏覽器驅動
driver = webdriver.Chrome()
# 2.請求url
driver.get('http://www.baidu.com/')
# 查看標題
print(driver.title)
# 查看cookie
print(driver.get_cookies())我啥都懂地對地導彈
input = driver.find_element_by_id('kw')
input.send_keys(u'爬蟲')
# 截屏
driver.save_screenshot('before_click.png')
subtim = driver.find_element_by_id('su')
subtim.click()
driver.save_screenshot('after_click.png')
# webelement對象
webele = driver.find_element_by_id('kw')
input.send_keys(Keys.CONTROL,'a')
input.send_keys(Keys.CONTROL,'x')
# 查找webelement對象的方法
# input = driver.find_element_by_id('kw')
input = driver.find_element_by_css_selector('#kw')
# driver.find_element_by_xpath()
input.send_keys('scrapy')
subtim.submit()
# 查看webelement元素座標
print(input.location)
# 查看元素的大小
print(input.size)
改爲selenium+phantomjs
無可視化瀏覽器界面,提高運行速度
(二)selenium常用方法總結
1.獲取當前頁面的Url
方法:current_url
實例:driver.current_url
2.獲取元素座標
方法:location
解釋:首先查找到你要獲取元素的,然後調用location方法
實例:driver.find_element_by_xpath("xpath").location
3.表單的提交
方法:submit
解釋:查找到表單(from)直接調用submit即可
實例:driver.find_element_by_id("form1").submit()
4.獲取CSS的屬性值
方法:value_of_css_property(css_name)
實例:driver.find_element_by_css_selector("input.btn").value_of_css_property("input.btn")
5.獲取元素的屬性值
方法:get_attribute(element_name)
實例:driver.find_element_by_id("kw").get_attribute("kw")
6.判斷元素是否被選中
方法:is_selected()
實例:driver.find_element_by_id("form1").is_selected()
7.返回元素的大小
方法:size
實例:driver.find_element_by_id("iptPassword").size 返回值:{'width': 250, 'height': 30}
8.判斷元素是否顯示
方法:is_displayed()
實例:driver.find_element_by_id("iptPassword").is_displayed()
9.判斷元素是否被使用
方法:is_enabled()
實例:driver.find_element_by_id("iptPassword").is_enabled()
10.獲取元素的文本值
方法:text
實例:driver.find_element_by_id("iptUsername").text
11.元素賦值
方法:send_keys(*values)
實例:driver.find_element_by_id("iptUsername").send_keys('admin')
12.返回元素的tagName
方法:tag_name
實例:driver.find_element_by_id("iptUsername").tag_name
13.刪除瀏覽器所有的cookies
方法:delete_all_cookies()
實例:driver.delete_all_cookies()
14.刪除指定的cookie
方法:delete_cookie(name)
實例:deriver.delete_cookie("my_cookie_name")
15.關閉瀏覽器
方法:close()
實例:driver.close()
16.關閉瀏覽器並且退出驅動程序
方法:quit()
實例:driver.quit()
17.返回上一頁
方法:back()
實例:driver.back()
18.清空輸入框
方法:clear()
實例:driver.clear()
19.瀏覽器窗口最大化
方法:maximize_window()
實例:driver.maximize_window()
20.查看瀏覽器的名字
方法:name
實例:drvier.name
21.返回當前會話中的cookies
方法:get_cookies()
實例:driver.get_cookies()
22.根據cookie name 查找映射Value值
方法:driver.get_cookie(cookie_name)
實例:driver.get_cookie("NET_SessionId")
23.截取當前頁面
方法:save_screenshot(filename)
實例:driver.save_screenshot("D:\\Program Files\\Python27\\NM.bmp")
(三)selenium種查找頁面元素的方法
1.通過id進行查找
driver.find_element_by_id('kw')
2.通過css選擇器進行查找
input = driver.find_element_by_css_selector('#kw')
3.通過xpath進行查找
driver.find_element_by_xpath()
二、selenium+phantomjs
(一)請求頁面的流程
from selenium import webdriver
1.創建driver對象
driver = webdriver.PhantomJS()
2.請求url
driver.get(url)
3.等待
time.sleep(5)
三種等待
1.強制等待
import time
time.sleep(10)
2.隱式等待
driver.implicitly_wait(10)
隱式等待就是等到頁面全部加載完成,比如js,css或者圖片全請求加載到頁面,也就是我們常看到的頁面不再轉圈圈爲止,程序纔會開始繼續運行。
3.顯示等待
導包
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
步驟
-
創建等待對象
wait = WebDriverWait( driver, # 瀏覽器驅動對象 10, # 最大等待時長 0.5, # 掃描間隔 )
-
wait.until(等待條件):等待條件成立,程序才繼續運行
等待條件在selenium中有個專門的模塊來設置,即expected_conditions as EC
最常用的條件有以下兩個:
- EC.presence_of_element_located(locator對象)
- EC.presence_of_all_elements_located(locator對象)
兩個條件都是驗證元素是否出現
第一個只要一個符合條件的元素加載出來即可
第二個必須所有符合條件的元素都加載出來纔行
傳入的參數都是元組類型的locator對象:
(通過什麼查找(By.ID,By.XPATH,By.CSS_SELECTOR),查找的內容的語法)
如(By.ID,‘kw’)
-
wait.until方法的返回值是對應定位器定位到的webelement對象
如果需要對這個webelement對象做一些操作,可以很方便的做到。
4.獲取頁面內容
html = driver.page_source
5.用lxml模塊解析頁面內容
tree = etree.HTML(html)
三、項目
(一)豆瓣讀書(面向對象,強制等待)
import time
from selenium import webdriver
from lxml import etree
base_url = 'https://search.douban.com/book/subject_search?search_text=python&cat=1001&start=%s'
driver = webdriver.PhantomJS()
def get_text(text):
if text:
return text[0]
return ''
def get_books(text):
html = etree.HTML(text)
div_list = html.xpath('//div[@id="root"]/div/div/div/div/div')
for div in div_list:
book = {}
# 圖書名稱
book_name = get_text(div.xpath('.//div[@class="detail"]/div[@class="title"]/a/text()'))
# 評分
book_score = get_text(div.xpath('.//span[@class="rating_nums"]/text()'))
# 評價數
book_appraise = get_text(div.xpath('.//span[@class="pl"]/text()'))
# 詳情頁鏈接
book_url = get_text(div.xpath('.//div[@class="title"]/a/@href'))
# 作者,出版社,價格,出版日期
book_info = get_text(div.xpath('.//div[@class="meta abstract"]/text()')).split(' /')
if all([book_name,book_url]):
book['書名'] = book_name
book['評分'] = book_score
book['評價數'] = book_appraise[1:len(book_appraise)-1]
book['詳情頁路由'] = book_url
book['作者'] = '/'.join(book_info[:-3])
book['出版社'] = book_info[-3]
book['價格'] = book_info[-1]
book['出版日期'] = book_info[-2]
print(book)
if __name__ == '__main__':
for i in range(10):
driver.get(base_url%(i*15))
time.sleep(2)
html_str = driver.page_source
get_books(html_str)
封裝,顯示等待
import time
from selenium import webdriver
from lxml import etree
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from urllib import parse
class Douban(object):
def __init__(self,url):
self.url = url
self.wait = WebDriverWait(driver,10)
self.parse()
def get_text(self,text):
if text:
return text[0]
return ''
def get_content_by_selenium(self,url,xpath):
driver.get(url)
# 等待
# time.sleep(3)
# until方法裏面是一些條件
# locator對象是一個元組
webelement = self.wait.until(EC.presence_of_element_located((By.XPATH,xpath)))
return driver.page_source
def parse(self):
html_str = self.get_content_by_selenium(self.url,'//div[@id="root"]/div/div/div/div')
html = etree.HTML(html_str)
div_list = html.xpath('//div[@id="root"]/div/div/div/div/div')
for div in div_list:
book = {}
# 圖書名稱
book_name = self.get_text(div.xpath(
'.//div[@class="detail"]/div[@class="title"]/a/text()'))
# 評分
book_score = self.get_text(
div.xpath('.//span[@class="rating_nums"]/text()'))
# 評價數
book_appraise = self.get_text(div.xpath('.//span[@class="pl"]/text()'))
# 詳情頁鏈接
book_url = self.get_text(div.xpath('.//div[@class="title"]/a/@href'))
# 作者,出版社,價格,出版日期
book_info = self.get_text(
div.xpath('.//div[@class="meta abstract"]/text()')).split(' /')
if all([book_name, book_url]):
book['書名'] = book_name
book['評分'] = book_score
book['評價數'] = book_appraise[1:len(book_appraise) - 1]
book['詳情頁路由'] = book_url
book['作者'] = '/'.join(book_info[:-3])
book['出版社'] = book_info[-3]
book['價格'] = book_info[-1]
book['出版日期'] = book_info[-2]
print(book)
if __name__ == '__main__':
driver = webdriver.PhantomJS()
base_url = 'https://search.douban.com/book/subject_search?'
kw = 'python'
for i in range(5):
params = {
'search_text': kw,
'cat': '1001',
'start': str(i * 15),
}
url = base_url + parse.urlencode(params)
Douban(url)
(二)騰訊招聘
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from lxml import etree
def wait_get_content(url,xpath):
driver.get(url)
wait.until(EC.presence_of_element_located((By.XPATH,xpath)))
return driver.page_source
def get_text(value):
if value:
return value[0]
return ''
def get_info(url):
html_str = wait_get_content(url,'//div[@class="recruit-wrap recruit-margin"]')
html = etree.HTML(html_str)
div_list = html.xpath('//div[@class="recruit-wrap recruit-margin"]/div')
for div in div_list:
item = {}
title = get_text(div.xpath('.//a/h4/text()'))
region = get_text(div.xpath('.//a/p/span[2]/text()'))
type = get_text(div.xpath('.//a/p/span[3]/text()'))
date = get_text(div.xpath('.//a/p/span[4]/text()'))
item['title'] = title
item['region'] = region
item['type'] = type
item['date'] = date
print(item)
if __name__ == '__main__':
driver = webdriver.PhantomJS()
wait = WebDriverWait(driver,10)
base_url = 'https://careers.tencent.com/search.html?index=%s'
for i in range(1,2):
get_info(base_url%i)
封裝
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from lxml import etree
class Tencent(object):
def __init__(self,url):
self.url = url
self.get_info()
def wait_get_content(self,url, xpath):
driver.get(url)
wait.until(EC.presence_of_element_located((By.XPATH, xpath)))
return driver.page_source
def get_text(self,value):
if value:
return value[0]
return ''
def get_info(self):
html_str = self.wait_get_content(self.url,
'//div[@class="recruit-wrap recruit-margin"]')
html = etree.HTML(html_str)
div_list = html.xpath('//div[@class="recruit-wrap recruit-margin"]/div')
for div in div_list:
item = {}
title = self.get_text(div.xpath('.//a/h4/text()'))
region = self.get_text(div.xpath('.//a/p/span[2]/text()'))
type = self.get_text(div.xpath('.//a/p/span[3]/text()'))
date = self.get_text(div.xpath('.//a/p/span[4]/text()'))
item['title'] = title
item['region'] = region
item['type'] = type
item['date'] = date
print(item)
if __name__ == '__main__':
driver = webdriver.PhantomJS()
wait = WebDriverWait(driver, 10)
base_url = 'https://careers.tencent.com/search.html?index=%s'
for i in range(1, 6):
Tencent(base_url%i)