景点评论爬虫之微博爬虫和携程爬虫

最近帮同学写了个爬虫,爬取微博上景点评论以及携程上景点评论.下面都是以景点夫子庙为例

微博爬虫

在微博搜索夫子庙关键词,然后得到网页链接,我们用审查元素分析,应该是ajax模式,于是得到它的请求头,分析它的参数,应该就是只有一个page变量.其请求头网页.

接下来我们就可以爬取在夫子庙这个地点发的微博了.注意,因为第一页有点不一样,所以我是从第二页开始爬取的,不过微博有个反爬机制,你最多只能爬取150页.

这边得到的是csv文件,之后再手动转成excel文件就行.具体转法:先右键csv文件,记事本打开,然后另存为txt文件,新建一个excel文档,把txt文件拖进来,然后把格式处理下,如自动换行啥的,最后另存为excel文件就行.

import time
import requests
from urllib.parse import urlencode
from pyquery import PyQuery as pq
import csv

baseurl = 'https://m.weibo.cn/api/container/getIndex?'
headers = {
    'Host': 'm.weibo.cn',
    'Referer': 'https://m.weibo.com/p/100101B2094655D764AAFC4799',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
}


def get_page(page):
    params = {
        'containerid':'100101B2094655D764AAFC4799_-_weibofeed',
        'page':page
    }
    url = baseurl + urlencode(params)
    try:
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print('ERROR',e.args)


def parse_page(json, page: int):
    if json:
        items = json.get('data').get('cards')[0].get('card_group')
        for index, item in enumerate(items):
            if page == 1:
                continue
            else:
                item = item.get('mblog')
                id= item.get('id')
                created_at = item.get('created_at')
                text = pq(item.get('text')).text()
                with open("fuzimiao.csv", "a", encoding='utf-8') as csvfile:
                    fieldnames = ['id', 'created_at', 'text']
                    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
                    writer.writerow({'id': id, 'created_at': created_at, 'text': text})

max_page = 150

if __name__ == '__main__':
    for page in range(153, max_page + 1):
        print(page)
        json = get_page(page)
        parse_page(json,page)
        time.sleep(2)


携程爬虫

携程的这个爬虫代码是我在网上借鉴了一位老哥的,做了写修改.其中url中的poiID参数是各景点的参数,这个需要自己在对应的景点页面审查元素去获取,然后那个502那个是随便设置的最大页数,应该可以自行更改.

import csv
import requests
from  bs4 import BeautifulSoup

headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
    "Connection": "keep-alive",
    "Host": "you.ctrip.com",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
}

for i in range(1,502):
    try:
        print(i)
        url="http://you.ctrip.com/destinationsite/TTDSecond/SharedView/AsynCommentView?poiID=75702&districtId=702&districtEName=Yangshuo&pagenow=%d&order=3.0&star=0.0&tourist=0.0&resourceId=22079&resourcetype=2"%(i)
        html=requests.get(url,headers=headers)
        html.encoding="utf-8"
        soup=BeautifulSoup(html.content)
        block=soup.find_all(class_="comment_single")
        for j in block:
            text=j.find(class_="heightbox").text
            time = j.find(class_="time_line").text
            with open("fuzimiao_xiecheng.csv", "a", encoding='utf-8') as csvfile:
                fieldnames = ['time', 'text']
                writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
                writer.writerow({'time': time, 'text': text})
        
    except:
        pass



 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章