16、web爬蟲講解2—PhantomJS虛擬瀏覽器+selenium模塊操作PhantomJS

原創

天降攻城獅

2019-07-04 09:35

【百度雲搜索，搜各種資料:http://www.bdyss.cn】

【搜網盤，搜各種資料:http://www.swpan.cn】

PhantomJS虛擬瀏覽器

phantomjs 是一個基於js的webkit內核無頭瀏覽器也就是沒有顯示界面的瀏覽器，利用這個軟件，可以獲取到網址js加載的任何信息，也就是可以獲取瀏覽器異步加載的信息

下載網址：http://phantomjs.org/download... 下載對應系統版本

下載後解壓PhantomJS文件，將解壓文件夾，剪切到python安裝文件夾

然後將PhantomJS文件夾裏的bin文件夾添加系統環境變量

cdm 輸入命令：PhantomJS 出現以下信息說明安裝成功

selenium模塊是一個python操作PhantomJS軟件的一個模塊

selenium模塊PhantomJS軟件

webdriver.PhantomJS()實例化PhantomJS瀏覽器對象
get('url')訪問網站
find_element_by_xpath('xpath表達式')通過xpath表達式找對應元素
clear()清空輸入框裏的內容
send_keys('內容')將內容寫入輸入框
click()點擊事件
get_screenshot_as_file('截圖保存路徑名稱')將網頁截圖，保存到此目錄
page_source獲取網頁htnl源碼
quit()關閉PhantomJS瀏覽器

#!/usr/bin/env python
# -*- coding:utf8 -*-
from selenium import webdriver  #導入selenium模塊來操作PhantomJS
import os
import time
import re

llqdx = webdriver.PhantomJS()  #實例化PhantomJS瀏覽器對象
llqdx.get("https://www.baidu.com/") #訪問網址

# time.sleep(3)   #等待3秒
# llqdx.get_screenshot_as_file('H:/py/17/img/123.jpg')  #將網頁截圖保存到此目錄

#模擬用戶操作
llqdx.find_element_by_xpath('//*[@id="kw"]').clear()                    #通過xpath表達式找到輸入框，clear()清空輸入框裏的內容
llqdx.find_element_by_xpath('//*[@id="kw"]').send_keys('叫賣錄音網')     #通過xpath表達式找到輸入框，send_keys()將內容寫入輸入框
llqdx.find_element_by_xpath('//*[@id="su"]').click()                    #通過xpath表達式找到搜索按鈕,click()點擊事件

time.sleep(3)   #等待3秒
llqdx.get_screenshot_as_file('H:/py/17/img/123.jpg')  #將網頁截圖，保存到此目錄

neir = llqdx.page_source   #獲取網頁內容
print(neir)
llqdx.quit()    #關閉瀏覽器

pat = "<title>(.*?)</title>"
title = re.compile(pat).findall(neir)  #正則匹配網頁標題
print(title)

PhantomJS瀏覽器僞裝，和滾動滾動條加載數據

有些網站是動態加載數據的，需要滾動條滾動加載數據

實現代碼

DesiredCapabilities 僞裝瀏覽器對象
execute_script()執行js代碼

current_url獲取當前的url

#!/usr/bin/env python
# -*- coding:utf8 -*-
from selenium import webdriver  #導入selenium模塊來操作PhantomJS
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities   #導入瀏覽器僞裝模塊
import os
import time
import re

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap['phantomjs.page.settings.userAgent'] = ('Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0')
print(dcap)
llqdx = webdriver.PhantomJS(desired_capabilities=dcap)  #實例化PhantomJS瀏覽器對象

llqdx.get("https://www.jd.com/") #訪問網址

#模擬用戶操作
for j in range(20):
    js3 = 'window.scrollTo('+str(j*1280)+','+str((j+1)*1280)+')'
    llqdx.execute_script(js3)  #執行js語言滾動滾動條
    time.sleep(1)

llqdx.get_screenshot_as_file('H:/py/17/img/123.jpg')  #將網頁截圖，保存到此目錄

url = llqdx.current_url
print(url)

neir = llqdx.page_source   #獲取網頁內容
print(neir)
llqdx.quit()    #關閉瀏覽器

pat = "<title>(.*?)</title>"
title = re.compile(pat).findall(neir)  #正則匹配網頁標題
print(title)

【轉載自：http://www.lqkweb.com】

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

16、web爬蟲講解2—PhantomJS虛擬瀏覽器+selenium模塊操作PhantomJS

【百度雲搜索，搜各種資料:http://www.bdyss.cn】

【搜網盤，搜各種資料:http://www.swpan.cn】

25、Python快速開發分佈式搜索引擎Scrapy精講—Requests請求和Response響應介紹

24、Python快速開發分佈式搜索引擎Scrapy精講—爬蟲和反爬的對抗過程以及策略—scrapy架構源碼分析圖

23、 Python快速開發分佈式搜索引擎Scrapy精講—craw scrapy item loader機制

22、Python快速開發分佈式搜索引擎Scrapy精講—scrapy模擬登陸和知乎倒立文字驗證碼識別

20、 Python快速開發分佈式搜索引擎Scrapy精講—編寫spiders爬蟲文件循環抓取內容

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結