目錄
PyQuery允許對xml文檔進行jQuery查詢,該API儘可能類似於jQuery,PyQuery使用lxml進行快速的xml和html操作。
1.PyQuery簡介
(1)初始化PyQuery對象包括:字符串初始化、URL初始化、文件初始化
# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30
# import requests
from pyquery import PyQuery as pq
html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
print(string_doc('div'))
# URl初始化
# 如下等同:url_doc = pq(requests.get('http://news.baidu.com/').text)
url_doc = pq(url='http://news.baidu.com/')
print(url_doc('title'))
# 文件初始化
txt_doc = pq(filename='test.html')
print(txt_doc('title'))
(2)CSS選擇器:https://www.w3school.com.cn/cssref/css_selectors.asp
在 CSS 中,選擇器是一種模式,用於選擇需要添加樣式的元素,如下選取html_doc中class爲“subject-item”的所有div節點。
# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30
from pyquery import PyQuery as pq
import requests
html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
print(string_doc('.subject-item div'))
“Run”結果:
<div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div>
(3)查找節點
-
子節點:調用find_all()方法傳入CSS選擇器,選取img節點的所有子孫節點,可以用children()只篩選子節點。
print(string_doc('.nbg').find('img'))
print(type(string_doc('.nbg').find('img')))
# 輸出結果:
<img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/>
<class 'pyquery.pyquery.PyQuery'>
# ------------------------------------------
print(string_doc('.cart-actions').children())
# 輸出結果:
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span>
-
父節點:用parent() 查詢直接父節點,parents() 查找祖先節點 傳入CSS選擇器即可,如下,用.buy-info選取class爲buy-info的節點,然後調用parent() 方法得到其直接父節點,用parents() 查找所有的祖先節點,篩選某個祖先節點的話,可以向parents() 方法傳入CSS選擇器,如下篩選class爲cart-actions的父節點。
print(string_doc('.buy-info').parent())
# 輸出結果:
<div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div>
print(string_doc('.buy-info').parents())
# 輸出結果:
<html><body><li class="subject-item"> <div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li></body></html><body><li class="subject-item"> <div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li></body><li class="subject-item"> <div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li><div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> <div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div>
print(string_doc('.buy-info').parents('.cart-actions'))
# 輸出結果:
<div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div>
-
兄弟節點:siblings() 篩選兄弟節點,同樣的,也可以傳入CSS選擇器來篩選指定的兄弟節點。
print(string_doc('.cart-actions').siblings())
# 輸出結果:
<div class="collect-info"> </div>
(4)遍歷:對於多個節點的結果就需要調用items()方法 遍歷,如下:string_doc('span').items()遍歷所有的div標籤元素。
# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30
from pyquery import PyQuery as pq
import requests
html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
divs = string_doc('span').items()
# print(divs)
print(type(divs))
for div in divs:
print(div)
輸出結果:
<class 'generator'>
<span class="allstar45"/>
<span class="rating_nums">9.0</span>
<span class="pl"> (
561845人評價) </span>
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span>
(5)獲取屬性:調用attr()方法來獲取屬性,對於返回的結果爲多節點時,調用attr()方法只會得到第一個節點的屬性,需要使用for循環來實現每個節點的遍歷。
# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30
from pyquery import PyQuery as pq
import requests
html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
a = string_doc('div')
# print(a, type(a))
for item in a.items():
print(item.attr('class'))
運行結果:
pic
info
pub
star clearfix
ft
collect-info
cart-actions
(6)獲取文本:調用text()方法來實現,不需要遍歷即可獲得所有節點內部的文本,如下:
# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30
from pyquery import PyQuery as pq
import requests
html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
print(string_doc.text())
“Run”結果:
小王子
[法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元
9.0 ( 561845人評價)
小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...
紙質版47.30元起
當然,也可以通過CSS選擇器來篩選輸出指定節點的文本:
a = string_doc('span')
print(a.text())
輸出結果: 9.0 ( 561845人評價) 紙質版47.30元起
(7)節點操作:addClass() 爲節點添加class屬性,removeClass()動態移除節點的class屬性。
# 字符串初始化
string_doc = pq(html_doc)
span = string_doc('.buy-info')
print(span)
span_mv = span.remove_class('buy-info')
print(span_mv)
span_add = span_mv.add_class('buy-info')
print(span_add)
# 輸出結果:
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span>
<span class=""> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span>
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span>
attr()、text()、html()修改屬性值、文本內容、html文本:
# 字符串初始化
string_doc = pq(html_doc)
span = string_doc('.buy-info')
# 修改class源屬性值buy-info爲price
print(span.attr('class', 'price'))
# 修改span內文本的內容爲“價格:47.30”
print(span.text('價格:47.30'))
# 修改span內部的html文本爲“<a>價格:47.30</a>”
print(span.html('<a>價格:47.30</a>'))
# 輸出結果:
<span class="price"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span>
<span class="price">價格:47.30</span>
<span class="price"><a>價格:47.30</a></span>
其他方法及使用方法參考:https://pyquery.readthedocs.io/en/latest/api.html
(8)僞類選擇器:CSS3的僞類選擇器可以參考https://www.w3school.com.cn/css/css_pseudo_classes.asp,但是https://www.runoob.com/css/css-pseudo-classes.html裏羅列的更爲詳細點。
- 僞類的語法:selector : pseudo-class {property: value}
- CSS 類與僞類搭配使用:selector.class : pseudo-class {property: value}
- first-child:選擇父元素下的第一個子元素,只有當元素是另一個元素的第一個子元素時才能匹配。
- last-child:選擇父元素下最後一個子元素。
- only-child:選擇所有僅有一個子元素的某元素。
- nth-child(n):選擇所有某元素的父元素的第n個子元素。
- nth-last-child(n):選擇所有某元素倒數的第n個子元素。
# 字符串初始化
string_doc = pq(html_doc)
div1 = string_doc('div:first-child')
print(div1)
div2 = string_doc('div:last-child')
print(div2)
div3 = string_doc('div:only-child')
print(div3)
div4 = string_doc('div:nth-child(3)')
print(div4)
div5 = string_doc('div:nth-last-child(4)')
print(div5)
# 輸出結果:
<div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="collect-info"> </div>
<div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> <div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div>
<div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div>
<div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div>
2. PyQuery簡單使用
新建tb_movie_comments表存儲爬取的評論:
CREATE TABLE `tb_movie_comments` (
`cid` int(11) NOT NULL AUTO_INCREMENT COMMENT '編號',
`commentator` varchar(100) DEFAULT NULL COMMENT '評論人' ,
`comments` varchar(2000) DEFAULT NULL COMMENT '評論內容',
`votes` varchar(20) DEFAULT NULL COMMENT '點贊數' ,
`createdate` datetime default CURRENT_TIMESTAMP COMMENT '創建時間',
`ctype` char(2) DEFAULT NULL COMMENT '評論類型:1.好評、2.一般、3.差評',
PRIMARY KEY (`cid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
下面爬取《少年的你》短評,理想是爬完存入數據庫然後做做大數據分析得出點有價值的信息,最好還能做什麼詞雲之類酷炫的分析,事實是反爬機制讓人認清現實,爬完十頁就game over了,但是作爲使用PyQuery的第一次,留點紀念如下:
# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_comment.py
# @Project: Python Notes
# @CreateTime : 2020/5/15 14:52:37
import urllib
from pyquery import PyQuery as pq
import requests
import pymysql
import random
import time
def login(url):
user_agents = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/81.0.4044.122 Safari/537.36}']
# headers參詳URL→F12→Network
headers = {
'Cookie': 'gr_user_id=36d2fea0-91b3-4445-b0c4-2f1eec5e681e; bid=3VpjSZO1pLI; douban-fav-remind=1; '
'__yadk_uid=AwRZnSg2z94qiZ0ziZx8rRTJx0GARPvJ; '
'trc_cookie_storage=taboola%2520global%253Auser-id%3D54ee53eb-ce52-4f1e-b503-f2b4ba820774'
'-tuct2359b57; __gads=ID=953ce3860eb89d60:T=1571272451:S=ALNI_MYayAKeBBq7vr_NBvFfsaRTVepXaw; '
'_vwo_uuid_v2=D2CFD349D628C78D38815D8765A3EB401|d8942a02c6249450bd209b499e64d81c; ll="118297"; '
'douban-profile-remind=1; _ga=GA1.2.2128425525.1488504434; push_doumail_num=0; push_noty_num=0; '
'__utmv=30149280.19762; ct=y; __utmc=30149280; '
'_pk_ref.100001.8cb4=%5B%22%22%2C%22%22%2C1589252865%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl'
'%3DjxWgT7kJtprsF-uyr7ziX2Rid2J_n9ZVC9_Qu-JHCj9InQNIG3Ew5bcMZK8paZow%26wd%3D%26eqid'
'%3Dae210c5000009ced000000065eba12fc%22%5D; '
'_pk_id.100001.8cb4=a9140f060c7b64ae.1488504433.95.1589252865.1589247679.; '
'viewed="25811418_25904568_4849666_27069880_27608412_2086633_11535042_33413575_34430051_1469051"; '
'dbcl2="77249558:xmnxDXaS+r8"; ck=h_ZU; '
'__utma=30149280.2128425525.1488504434.1589768146.1589771266.143; '
'__utmz=30149280.1589771266.143.73.utmcsr=accounts.douban.com|utmccn=('
'referral)|utmcmd=referral|utmcct=/passport/login',
'User-Agent': str(random.choice(user_agents)),
'Referer': 'https://accounts.douban.com/passport/login',
'Connection': 'keep-alive'
}
req = requests.get(url, headers=headers)
return req
# 定義函數傳入url頁碼與評論類型參數
def comment(ctype, page):
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/81.0.4044.122 Safari/537.36'}
num = page * 20
url = 'https://movie.douban.com/subject/30166972/comments?start=' + str(num) + '&limit=20&sort=new_score' \
'&status=P&percent_type=' + ctype
html = login(url)
html_doc = pq(html.text)
data_all = html_doc('.comment-item').items()
for data in data_all:
commentator = data('.comment-info a').text()
comments = data('.short').text()
votes = data('.votes').text()
createdate = data('.comment-time').text()
# print(commentator)
# 將數據存入數據庫
db = pymysql.connect(host='192.183.3.***', port=3306, user='nn',
password='******', database='nntest', charset='utf8')
cur = db.cursor()
sql = 'INSERT INTO tb_movie_comments(commentator, comments, votes, createdate, ctype)' \
'VALUES(% s, % s, % s, % s, % s)'
try:
cur.execute(sql, (commentator, comments, votes, createdate, ctype))
print('Insert Successful!')
db.commit()
except:
print('Sorry,Failed!')
db.rollback()
cur.close()
db.close()
# 如果想批量爬取並存入數據庫,可以採用如下代碼:
ctypes = ['h', 'm', 'l']
for ctype in ctypes:
# 反爬原因爬10頁就好了,page起始值爲0,爬取10頁
for page in range(0, 10, 1):
try:
comment(ctype, page)
print(ctype + '第' + str(page) + '頁爬取並存入數據庫成功')
except:
print(ctype + '第' + str(page) + '頁爬取並存入數據庫失敗')
time.sleep(10)
得數據者得天下,最後的最後,重要的事情說三遍:爬數請開小號!請開小號!請開小號!!!不作死就不會死,做賊不能光明正大,偷數據的小賊付出的Rollback不了的代價如下(這是我的正經賬號QAQ):
模擬登陸、代理什麼的要安排上了。