前言
本文的文字及圖片來源於網絡,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯繫我們以作處理。
作者: Star_Zhao
PS:如有需要Python學習資料的小夥伴可以加點擊下方鏈接自行獲取http://t.cn/A6Zvjdun
本次爬取用到的知識點有:
-
selenium
-
pymysql
-
pyquery
正文
分析目標網站
-
打開某寶首頁, 輸入"男裝"後點擊"搜索", 則跳轉到"男裝"的搜索界面.
-
空白處"右擊"再點擊"檢查"審查網頁元素, 點擊"Network".
1)找到對應的URL, URL裏的參數正是Query String Parameters的參數, 且請求方式是GET
2) 我們請求該URL得到內容就是"Response"裏的內容, 那麼點擊它來確認信息.
3) 下拉看到"男裝"字樣, 那麼再往下找, 並沒有發現有關"男裝"的商品信息.
4)任意複製一個商品信息, 空白處右擊再點擊"查看網頁源代碼", 在源碼查找該商品, 即可看到該商品的信息.
5)對比網頁源代碼和"Response"響應內容, 發現源代碼中的商品信息被替換, 這便是採用了JS加密
6)如果去請求上面的URL, 得到的則是加密過的信息, 這時就可以利用Selenium庫來模擬瀏覽器, 進而得到商品信息.
獲取單個商品界面
- 請求網站
# -*- coding: utf-8 -*-
from selenium import webdriver #從selenium導入瀏覽器驅動
browser = webdriver.Chrome() #聲明驅動對象, 即Chrome瀏覽器
def get_one_page():
'''獲取單個頁面'''
browser.get("https://www.xxxxx.com") #請求網站
- 輸入"男裝", 在輸入之前, 需要判斷輸入框是否存在, 如果存在則輸入"男裝", 不存在則等待顯示成功.
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By #導入元素定位方法模塊
from selenium.webdriver.support.ui import WebDriverWait #導入等待判斷模塊
from selenium.webdriver.support import expected_conditions as EC #導入判斷條件模塊
browser = webdriver.Chrome()
def get_one_page():
'''獲取單個頁面'''
browser.get("https://www.xxxxx.com")
input = WebDriverWait(browser,10).until( #等待判斷
EC.presence_of_element_located((By.CSS_SELECTOR,"#q"))) #若輸入框顯示成功,則獲取,否則等待
input.send_keys("男裝") #輸入商品名稱
- 下一步就是點擊"搜索"按鈕, 按鈕具有屬性: 可點擊, 那麼加入判斷條件.
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome()
def get_one_page():
'''獲取單個頁面'''
browser.get("https://www.xxxxx.com")
input = WebDriverWait(browser,10).until(
EC.presence_of_element_located((By.CSS_SELECTOR,"#q"))) #
input.send_keys("男裝")
button = WebDriverWait(browser,10).until( #等待判斷
EC.element_to_be_clickable((By.CSS_SELECTOR,"#J_TSearchForm > div.search-button > button"))) #若按鈕可點擊, 則獲取, 否則等待
button.click() #點擊按鈕
- 獲取總的頁數, 同樣加入等待判斷.
# -*- coding: utf-8 -*-
import re
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome()
def get_one_page():
'''獲取單個頁面'''
browser.get("https://www.xxxxx.com")
input = WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "#q")))
input.send_keys("男裝")
button = WebDriverWait(browser, 10).until(
EC.element_to_be_clickable(
(By.CSS_SELECTOR, "#J_TSearchForm > div.search-button > button")))
button.click()
pages = WebDriverWait(browser, 10).until( # 等待判斷
EC.presence_of_element_located(
(By.CSS_SELECTOR, "#mainsrp-pager > div > div > div > div.total"))) # 若總頁數加載成功,則獲取總頁數,否則等待
return pages.text
def main():
pages = get_one_page()
print(pages)
if __name__ == '__main__':
main()
- 打印出來的不是我們想要的結果, 利用正則表達式獲取, 最後再利用try…except捕捉異常
# -*- coding: utf-8 -*-
import re
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome()
def get_one_page():
'''獲取單個頁面'''
try:
browser.get("https://www.xxxxx.com")
input = WebDriverWait(browser,10).until(
EC.presence_of_element_located((By.CSS_SELECTOR,"#q")))
input.send_keys("男裝")
button = WebDriverWait(browser,10).until(
EC.element_to_be_clickable((By.CSS_SELECTOR,"#J_TSearchForm > div.search-button > button")))
button.click()
pages = WebDriverWait(browser,10).until(
EC.presence_of_element_located((By.CSS_SELECTOR,"#mainsrp-pager > div > div > div > div.total")))
return pages.text
except TimeoutException:
return get_one_page() #如果超時,繼續獲取
def main():
pages = get_one_page()
pages = int(re.compile("(\d+)").findall(pages)[0]) #採用正則表達式提取文本中的總頁數
print(pages)
if __name__ == '__main__':
main()
獲取多個商品界面
採用獲取"到第 頁"輸入框方式, 切換到下一頁, 同樣是等待判斷
需要注意的是, 最後要加入判斷: 高亮是否是當前頁
def get_next_page(page):
try:
input = WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "#mainsrp-pager > div > div > div > div.form > input"))) # 若輸入框加載成功,則獲取,否則等待
input.send_keys(page) # 輸入頁碼
button = WebDriverWait(browser, 10).until(
EC.element_to_be_clickable((By.CSS_SELECTOR, "#mainsrp-pager > div > div > div > div.form > span.btn.J_Submit"))) # 若按鈕可點擊,則獲取,否則等待
button.click() # 點擊按鈕
WebDriverWait(browser,10).until(
EC.text_to_be_present_in_element((By.CSS_SELECTOR,"#mainsrp-pager > div > div > div > ul > li.item.active > span"),str(page))) # 判斷高亮是否是當前頁
except TimeoutException: # 超時, 繼續請求
return get_next_page(page)
def main():
pages = get_one_page()
pages = int(re.compile("(\d+)").findall(pages)[0])
for page in range(1,pages+1):
get_next_page(page)
if __name__ == '__main__':
main()
獲取商品信息
首先, 判斷信息是否加載成功, 緊接着獲取源碼並初始化, 進而解析.
需要注意的是, 在"get_one_page"和"get_next_page"中調用之後, 纔可執行
def get_info():
"""獲取詳情"""
WebDriverWait(browser,20).until(EC.presence_of_element_located((
By.CSS_SELECTOR,"#mainsrp-itemlist .items .item"))) #判斷商品信息是否加載成功
text = browser.page_source #獲取網頁源碼
html = pq(text) #初始化網頁源碼
items = html('#mainsrp-itemlist .items .item').items() #採用items方法會得到生成器
for item in items: #遍歷每個節點對象
data = []
image = item.find(".pic .img").attr("data-src") #用find方法查找子孫節點,用attr方法獲取屬性名稱
price = item.find(".price").text().strip().replace("\n","") #用text方法獲取文本,strip()去掉前後字符串,默認是空格
deal = item.find(".deal-cnt").text()[:-2]
title = item.find(".title").text().strip()
shop = item.find(".shop").text().strip()
location = item.find(".location").text()
data.append([shop, location, title, price, deal, image])
print(data)
保存到MySQL數據庫
def save_to_mysql(data):
"""存儲到數據庫"""
# 創建數據庫連接對象
db= pymysql.connect(host = "localhost",user = "root",password = "password",port = 3306, db = "spiders",charset = "utf8")
# 獲取遊標
cursor = db.cursor()
#創建數據庫
cursor.execute("CREATE TABLE IF NOT EXISTS {0}(shop VARCHAR(20),location VARCHAR(10),title VARCHAR(255),price VARCHAR(20),deal VARCHAR(20), image VARCHAR(255))".format("男裝"))
#SQL語句
sql = "INSERT INTO {0} values(%s,%s,%s,%s,%s,%s)".format("男裝")
try:
#傳入參數sql,data
if cursor.execute(sql,data):
#插入數據庫
db.commit()
print("********已入庫**********")
except:
print("#########入庫失敗#########")
#回滾,相當什麼都沒做
db.rollback()
#關閉數據庫
db.close()
完整代碼
# -*- coding: utf-8 -*-
import re
import pymysql
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from pyquery import PyQuery as pq
browser = webdriver.Chrome()
def get_one_page(name):
'''獲取單個頁面'''
print("-----------------------------------------------獲取第一頁-------------------------------------------------------")
try:
browser.get("https://www.xxxxx.com")
input = WebDriverWait(browser,10).until(
EC.presence_of_element_located((By.CSS_SELECTOR,"#q")))
input.send_keys(name)
button = WebDriverWait(browser,10).until(
EC.element_to_be_clickable((By.CSS_SELECTOR,"#J_TSearchForm > div.search-button > button")))
button.click()
pages = WebDriverWait(browser,10).until(
EC.presence_of_element_located((By.CSS_SELECTOR,"#mainsrp-pager > div > div > div > div.total")))
print("----即將解析第一頁信息----")
get_info(name)
print("----第一頁信息解析完成----")
return pages.text
except TimeoutException:
return get_one_page(name)
def get_next_page(page,name):
"""獲取下一頁"""
print("---------------------------------------------------正在獲取第{0}頁----------------------------------------".format(page))
try:
input = WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "#mainsrp-pager > div > div > div > div.form > input")))
input.send_keys(page)
button = WebDriverWait(browser, 10).until(
EC.element_to_be_clickable((By.CSS_SELECTOR, "#mainsrp-pager > div > div > div > div.form > span.btn.J_Submit")))
button.click()
WebDriverWait(browser,10).until(
EC.text_to_be_present_in_element((By.CSS_SELECTOR,"#mainsrp-pager > div > div > div > ul > li.item.active > span"),str(page)))
print("-----即將解析第{0}頁信息-----".format(page))
get_info(name)
print("-----第{0}頁信息解析完成-----".format(page))
except TimeoutException:
return get_next_page(page,name)
def get_info(name):
"""獲取詳情"""
WebDriverWait(browser,20).until(EC.presence_of_element_located((
By.CSS_SELECTOR,"#mainsrp-itemlist .items .item")))
text = browser.page_source
html = pq(text)
items = html('#mainsrp-itemlist .items .item').items()
for item in items:
data = []
image = item.find(".pic .img").attr("data-src")
price = item.find(".price").text().strip().replace("\n","")
deal = item.find(".deal-cnt").text()[:-2]
title = item.find(".title").text().strip()
shop = item.find(".shop").text().strip()
location = item.find(".location").text()
data.append([shop, location, title, price, deal, image])
for dt in data:
save_to_mysql(dt,name)
def save_to_mysql(data,name):
"""存儲到數據庫"""
db= pymysql.connect(host = "localhost",user = "root",password = "password",port = 3306, db = "spiders",charset = "utf8")
cursor = db.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS {0}(shop VARCHAR(20),location VARCHAR(10),title VARCHAR(255),price VARCHAR(20),deal VARCHAR(20), image VARCHAR(255))".format(name))
sql = "INSERT INTO {0} values(%s,%s,%s,%s,%s,%s)".format(name)
try:
if cursor.execute(sql,data):
db.commit()
print("********已入庫**********")
except:
print("#########入庫失敗#########")
db.rollback()
db.close()
def main(name):
pages = get_one_page(name)
pages = int(re.compile("(\d+)").findall(pages)[0])
for page in range(1,pages+1):
get_next_page(page,name)
if __name__ == '__main__':
name = "男裝"
main(name)