最近幫同學寫了個爬蟲,爬取微博上景點評論以及攜程上景點評論.下面都是以景點夫子廟爲例
微博爬蟲
在微博搜索夫子廟關鍵詞,然後得到網頁鏈接,我們用審查元素分析,應該是ajax模式,於是得到它的請求頭,分析它的參數,應該就是隻有一個page變量.其請求頭網頁.
接下來我們就可以爬取在夫子廟這個地點發的微博了.注意,因爲第一頁有點不一樣,所以我是從第二頁開始爬取的,不過微博有個反爬機制,你最多隻能爬取150頁.
這邊得到的是csv文件,之後再手動轉成excel文件就行.具體轉法:先右鍵csv文件,記事本打開,然後另存爲txt文件,新建一個excel文檔,把txt文件拖進來,然後把格式處理下,如自動換行啥的,最後另存爲excel文件就行.
import time
import requests
from urllib.parse import urlencode
from pyquery import PyQuery as pq
import csv
baseurl = 'https://m.weibo.cn/api/container/getIndex?'
headers = {
'Host': 'm.weibo.cn',
'Referer': 'https://m.weibo.com/p/100101B2094655D764AAFC4799',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
def get_page(page):
params = {
'containerid':'100101B2094655D764AAFC4799_-_weibofeed',
'page':page
}
url = baseurl + urlencode(params)
try:
response = requests.get(url,headers=headers)
if response.status_code == 200:
return response.json()
except requests.ConnectionError as e:
print('ERROR',e.args)
def parse_page(json, page: int):
if json:
items = json.get('data').get('cards')[0].get('card_group')
for index, item in enumerate(items):
if page == 1:
continue
else:
item = item.get('mblog')
id= item.get('id')
created_at = item.get('created_at')
text = pq(item.get('text')).text()
with open("fuzimiao.csv", "a", encoding='utf-8') as csvfile:
fieldnames = ['id', 'created_at', 'text']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'id': id, 'created_at': created_at, 'text': text})
max_page = 150
if __name__ == '__main__':
for page in range(153, max_page + 1):
print(page)
json = get_page(page)
parse_page(json,page)
time.sleep(2)
攜程爬蟲
攜程的這個爬蟲代碼是我在網上借鑑了一位老哥的,做了寫修改.其中url中的poiID參數是各景點的參數,這個需要自己在對應的景點頁面審查元素去獲取,然後那個502那個是隨便設置的最大頁數,應該可以自行更改.
import csv
import requests
from bs4 import BeautifulSoup
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
"Connection": "keep-alive",
"Host": "you.ctrip.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
}
for i in range(1,502):
try:
print(i)
url="http://you.ctrip.com/destinationsite/TTDSecond/SharedView/AsynCommentView?poiID=75702&districtId=702&districtEName=Yangshuo&pagenow=%d&order=3.0&star=0.0&tourist=0.0&resourceId=22079&resourcetype=2"%(i)
html=requests.get(url,headers=headers)
html.encoding="utf-8"
soup=BeautifulSoup(html.content)
block=soup.find_all(class_="comment_single")
for j in block:
text=j.find(class_="heightbox").text
time = j.find(class_="time_line").text
with open("fuzimiao_xiecheng.csv", "a", encoding='utf-8') as csvfile:
fieldnames = ['time', 'text']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'time': time, 'text': text})
except:
pass