景點評論爬蟲之微博爬蟲和攜程爬蟲

最近幫同學寫了個爬蟲,爬取微博上景點評論以及攜程上景點評論.下面都是以景點夫子廟爲例

微博爬蟲

在微博搜索夫子廟關鍵詞,然後得到網頁鏈接,我們用審查元素分析,應該是ajax模式,於是得到它的請求頭,分析它的參數,應該就是隻有一個page變量.其請求頭網頁.

接下來我們就可以爬取在夫子廟這個地點發的微博了.注意,因爲第一頁有點不一樣,所以我是從第二頁開始爬取的,不過微博有個反爬機制,你最多隻能爬取150頁.

這邊得到的是csv文件,之後再手動轉成excel文件就行.具體轉法:先右鍵csv文件,記事本打開,然後另存爲txt文件,新建一個excel文檔,把txt文件拖進來,然後把格式處理下,如自動換行啥的,最後另存爲excel文件就行.

import time
import requests
from urllib.parse import urlencode
from pyquery import PyQuery as pq
import csv

baseurl = 'https://m.weibo.cn/api/container/getIndex?'
headers = {
    'Host': 'm.weibo.cn',
    'Referer': 'https://m.weibo.com/p/100101B2094655D764AAFC4799',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
}


def get_page(page):
    params = {
        'containerid':'100101B2094655D764AAFC4799_-_weibofeed',
        'page':page
    }
    url = baseurl + urlencode(params)
    try:
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print('ERROR',e.args)


def parse_page(json, page: int):
    if json:
        items = json.get('data').get('cards')[0].get('card_group')
        for index, item in enumerate(items):
            if page == 1:
                continue
            else:
                item = item.get('mblog')
                id= item.get('id')
                created_at = item.get('created_at')
                text = pq(item.get('text')).text()
                with open("fuzimiao.csv", "a", encoding='utf-8') as csvfile:
                    fieldnames = ['id', 'created_at', 'text']
                    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
                    writer.writerow({'id': id, 'created_at': created_at, 'text': text})

max_page = 150

if __name__ == '__main__':
    for page in range(153, max_page + 1):
        print(page)
        json = get_page(page)
        parse_page(json,page)
        time.sleep(2)


攜程爬蟲

攜程的這個爬蟲代碼是我在網上借鑑了一位老哥的,做了寫修改.其中url中的poiID參數是各景點的參數,這個需要自己在對應的景點頁面審查元素去獲取,然後那個502那個是隨便設置的最大頁數,應該可以自行更改.

import csv
import requests
from  bs4 import BeautifulSoup

headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
    "Connection": "keep-alive",
    "Host": "you.ctrip.com",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
}

for i in range(1,502):
    try:
        print(i)
        url="http://you.ctrip.com/destinationsite/TTDSecond/SharedView/AsynCommentView?poiID=75702&districtId=702&districtEName=Yangshuo&pagenow=%d&order=3.0&star=0.0&tourist=0.0&resourceId=22079&resourcetype=2"%(i)
        html=requests.get(url,headers=headers)
        html.encoding="utf-8"
        soup=BeautifulSoup(html.content)
        block=soup.find_all(class_="comment_single")
        for j in block:
            text=j.find(class_="heightbox").text
            time = j.find(class_="time_line").text
            with open("fuzimiao_xiecheng.csv", "a", encoding='utf-8') as csvfile:
                fieldnames = ['time', 'text']
                writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
                writer.writerow({'time': time, 'text': text})
        
    except:
        pass



 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章