Python手記-11:PyQuery爬取豆瓣電影評論

目錄

1.PyQuery簡介

2. PyQuery簡單使用


PyQuery允許對xml文檔進行jQuery查詢,該API儘可能類似於jQuery,PyQuery使用lxml進行快速的xml和html操作。

1.PyQuery簡介

(1)初始化PyQuery對象包括:字符串初始化URL初始化、文件初始化

# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30

# import requests
from pyquery import PyQuery as pq

html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
print(string_doc('div'))
# URl初始化
# 如下等同:url_doc = pq(requests.get('http://news.baidu.com/').text)
url_doc = pq(url='http://news.baidu.com/')
print(url_doc('title'))
# 文件初始化
txt_doc = pq(filename='test.html')
print(txt_doc('title'))

(2)CSS選擇器:https://www.w3school.com.cn/cssref/css_selectors.asp

在 CSS 中,選擇器是一種模式,用於選擇需要添加樣式的元素,如下選取html_doc中class爲“subject-item”的所有div節點。

# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30

from pyquery import PyQuery as pq
import requests
html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
print(string_doc('.subject-item div'))

“Run”結果:

<div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> 

(3)查找節點

  • 子節點:調用find_all()方法傳入CSS選擇器,選取img節點的所有子孫節點,可以用children()只篩選子節點。

print(string_doc('.nbg').find('img'))
print(type(string_doc('.nbg').find('img')))
# 輸出結果:
<img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> 
<class 'pyquery.pyquery.PyQuery'>
# ------------------------------------------
print(string_doc('.cart-actions').children())
# 輸出結果:
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> 
  • 父節點:用parent() 查詢直接父節點,parents() 查找祖先節點    傳入CSS選擇器即可,如下,用.buy-info選取class爲buy-info的節點,然後調用parent() 方法得到其直接父節點,用parents() 查找所有的祖先節點,篩選某個祖先節點的話,可以向parents() 方法傳入CSS選擇器,如下篩選class爲cart-actions的父節點。

print(string_doc('.buy-info').parent())
# 輸出結果:
<div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> 

print(string_doc('.buy-info').parents())
# 輸出結果:
<html><body><li class="subject-item"> <div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li></body></html><body><li class="subject-item"> <div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li></body><li class="subject-item"> <div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li><div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> <div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> 

print(string_doc('.buy-info').parents('.cart-actions'))
# 輸出結果:
<div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> 
  • 兄弟節點:siblings() 篩選兄弟節點,同樣的,也可以傳入CSS選擇器來篩選指定的兄弟節點。

print(string_doc('.cart-actions').siblings())
# 輸出結果:
<div class="collect-info"> </div> 

(4)遍歷:對於多個節點的結果就需要調用items()方法 遍歷,如下:string_doc('span').items()遍歷所有的div標籤元素。

# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30

from pyquery import PyQuery as pq
import requests

html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
divs = string_doc('span').items()
# print(divs)
print(type(divs))
for div in divs:
    print(div)

輸出結果:

<class 'generator'>
<span class="allstar45"/> 
<span class="rating_nums">9.0</span> 
<span class="pl"> (
561845人評價) </span> 
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> 

(5)獲取屬性:調用attr()方法來獲取屬性,對於返回的結果爲多節點時,調用attr()方法只會得到第一個節點的屬性,需要使用for循環來實現每個節點的遍歷。

# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30

from pyquery import PyQuery as pq
import requests

html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
a = string_doc('div')
# print(a, type(a))
for item in a.items():
    print(item.attr('class'))

運行結果:

pic
info
pub
star clearfix
ft
collect-info
cart-actions

(6)獲取文本:調用text()方法來實現,不需要遍歷即可獲得所有節點內部的文本,如下:

# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_film.py
# @Project: Python Notes
# @CreateTime : 2020/5/12 8:56:30

from pyquery import PyQuery as pq
import requests

html_doc = """<li class="subject-item"> <div class="pic"> <a
class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width
="90"> </a> </div> <div class="info"> <h2 class=""> <a
href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a
href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li>"""
# 字符串初始化
string_doc = pq(html_doc)
print(string_doc.text())

“Run”結果:

小王子
[法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元
9.0 ( 561845人評價)
小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...
紙質版47.30元起

當然,也可以通過CSS選擇器來篩選輸出指定節點的文本:

    a = string_doc('span')
    print(a.text())

輸出結果: 9.0 ( 561845人評價) 紙質版47.30元起

(7)節點操作:addClass() 爲節點添加class屬性,removeClass()動態移除節點的class屬性。

# 字符串初始化
string_doc = pq(html_doc)
span = string_doc('.buy-info')
print(span)
span_mv = span.remove_class('buy-info')
print(span_mv)
span_add = span_mv.add_class('buy-info')
print(span_add)
# 輸出結果:
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> 
<span class=""> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> 
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> 

attr()、text()、html()修改屬性值、文本內容、html文本:

# 字符串初始化
string_doc = pq(html_doc)
span = string_doc('.buy-info')
# 修改class源屬性值buy-info爲price
print(span.attr('class', 'price'))
# 修改span內文本的內容爲“價格:47.30”
print(span.text('價格:47.30'))
# 修改span內部的html文本爲“<a>價格:47.30</a>”
print(span.html('<a>價格:47.30</a>'))
# 輸出結果:
<span class="price"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> 
<span class="price">價格:47.30</span> 
<span class="price"><a>價格:47.30</a></span> 

其他方法及使用方法參考:https://pyquery.readthedocs.io/en/latest/api.html

(8)僞類選擇器:CSS3的僞類選擇器可以參考https://www.w3school.com.cn/css/css_pseudo_classes.asp,但是https://www.runoob.com/css/css-pseudo-classes.html裏羅列的更爲詳細點。

  • 僞類的語法:selector : pseudo-class {property: value}
  • CSS 類與僞類搭配使用:selector.class : pseudo-class {property: value}
  • first-child:選擇父元素下的第一個子元素,只有當元素是另一個元素的第一個子元素時才能匹配。
  • last-child:選擇父元素下最後一個子元素。
  • only-child:選擇所有僅有一個子元素的某元素。
  • nth-child(n):選擇所有某元素的父元素的第n個子元素。
  • nth-last-child(n):選擇所有某元素倒數的第n個子元素。
# 字符串初始化
string_doc = pq(html_doc)
div1 = string_doc('div:first-child')
print(div1)
div2 = string_doc('div:last-child')
print(div2)
div3 = string_doc('div:only-child')
print(div3)
div4 = string_doc('div:nth-child(3)')
print(div4)
div5 = string_doc('div:nth-last-child(4)')
print(div5)
# 輸出結果:
<div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div> <div class="collect-info"> </div> 
<div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336',&#10;from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童,他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> <div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> 

<div class="star clearfix"> <span class="allstar45"/> <span class="rating_nums">9.0</span> <span class="pl"> (
561845人評價) </span> </div> 
<div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> 

2. PyQuery簡單使用

 

新建tb_movie_comments表存儲爬取的評論:

CREATE TABLE `tb_movie_comments` (
  `cid` int(11) NOT NULL AUTO_INCREMENT COMMENT '編號',
  `commentator` varchar(100) DEFAULT NULL COMMENT '評論人' ,
  `comments` varchar(2000) DEFAULT NULL COMMENT '評論內容',
  `votes` varchar(20)  DEFAULT NULL COMMENT '點贊數' ,
  `createdate` datetime default CURRENT_TIMESTAMP COMMENT '創建時間',
  `ctype` char(2) DEFAULT NULL COMMENT '評論類型:1.好評、2.一般、3.差評',
  PRIMARY KEY (`cid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

下面爬取《少年的你》短評,理想是爬完存入數據庫然後做做大數據分析得出點有價值的信息,最好還能做什麼詞雲之類酷炫的分析,事實是反爬機制讓人認清現實,爬完十頁就game over了,但是作爲使用PyQuery的第一次,留點紀念如下:

# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : pyquery_comment.py
# @Project: Python Notes
# @CreateTime : 2020/5/15 14:52:37

import urllib
from pyquery import PyQuery as pq
import requests
import pymysql
import random
import time


def login(url):

    user_agents = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                   'Chrome/81.0.4044.122 Safari/537.36}']
# headers參詳URL→F12→Network
    headers = {
        'Cookie': 'gr_user_id=36d2fea0-91b3-4445-b0c4-2f1eec5e681e; bid=3VpjSZO1pLI; douban-fav-remind=1; '
                  '__yadk_uid=AwRZnSg2z94qiZ0ziZx8rRTJx0GARPvJ; '
                  'trc_cookie_storage=taboola%2520global%253Auser-id%3D54ee53eb-ce52-4f1e-b503-f2b4ba820774'
                  '-tuct2359b57; __gads=ID=953ce3860eb89d60:T=1571272451:S=ALNI_MYayAKeBBq7vr_NBvFfsaRTVepXaw; '
                  '_vwo_uuid_v2=D2CFD349D628C78D38815D8765A3EB401|d8942a02c6249450bd209b499e64d81c; ll="118297"; '
                  'douban-profile-remind=1; _ga=GA1.2.2128425525.1488504434; push_doumail_num=0; push_noty_num=0; '
                  '__utmv=30149280.19762; ct=y; __utmc=30149280; '
                  '_pk_ref.100001.8cb4=%5B%22%22%2C%22%22%2C1589252865%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl'
                  '%3DjxWgT7kJtprsF-uyr7ziX2Rid2J_n9ZVC9_Qu-JHCj9InQNIG3Ew5bcMZK8paZow%26wd%3D%26eqid'
                  '%3Dae210c5000009ced000000065eba12fc%22%5D; '
                  '_pk_id.100001.8cb4=a9140f060c7b64ae.1488504433.95.1589252865.1589247679.; '
                  'viewed="25811418_25904568_4849666_27069880_27608412_2086633_11535042_33413575_34430051_1469051"; '
                  'dbcl2="77249558:xmnxDXaS+r8"; ck=h_ZU; '
                  '__utma=30149280.2128425525.1488504434.1589768146.1589771266.143; '
                  '__utmz=30149280.1589771266.143.73.utmcsr=accounts.douban.com|utmccn=('
                  'referral)|utmcmd=referral|utmcct=/passport/login',
        'User-Agent': str(random.choice(user_agents)),
        'Referer': 'https://accounts.douban.com/passport/login',
        'Connection': 'keep-alive'
    }
    req = requests.get(url, headers=headers)
    return req


# 定義函數傳入url頁碼與評論類型參數
def comment(ctype, page):
    headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                             'Chrome/81.0.4044.122 Safari/537.36'}
    num = page * 20
    url = 'https://movie.douban.com/subject/30166972/comments?start=' + str(num) + '&limit=20&sort=new_score' \
                                                                                   '&status=P&percent_type=' + ctype
    html = login(url)
    html_doc = pq(html.text)
    data_all = html_doc('.comment-item').items()
    for data in data_all:
        commentator = data('.comment-info a').text()
        comments = data('.short').text()
        votes = data('.votes').text()
        createdate = data('.comment-time').text()
        # print(commentator)
        # 將數據存入數據庫
        db = pymysql.connect(host='192.183.3.***', port=3306, user='nn',
                             password='******', database='nntest', charset='utf8')
        cur = db.cursor()
        sql = 'INSERT INTO tb_movie_comments(commentator, comments, votes, createdate, ctype)' \
              'VALUES(% s, % s, % s, % s, % s)'
        try:
            cur.execute(sql, (commentator, comments, votes, createdate, ctype))
            print('Insert Successful!')
            db.commit()
        except:
            print('Sorry,Failed!')
            db.rollback()
        cur.close()
        db.close()


# 如果想批量爬取並存入數據庫,可以採用如下代碼:
ctypes = ['h', 'm', 'l']
for ctype in ctypes:
    # 反爬原因爬10頁就好了,page起始值爲0,爬取10頁
    for page in range(0, 10, 1):
        try:
            comment(ctype, page)
            print(ctype + '第' + str(page) + '頁爬取並存入數據庫成功')
        except:
            print(ctype + '第' + str(page) + '頁爬取並存入數據庫失敗')
    time.sleep(10)

得數據者得天下,最後的最後,重要的事情說三遍:爬數請開小號!請開小號!請開小號!!!不作死就不會死,做賊不能光明正大,偷數據的小賊付出的Rollback不了的代價如下(這是我的正經賬號QAQ):

模擬登陸、代理什麼的要安排上了。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章