Instgram爬蟲及其斷點續傳_一個AJAX異步加載爬蟲

任務描述

對於給定的Ins賬戶列表，需要爬下所有posts，對於每條post要有

時間
配文
配圖
點贊數
評論數

如果是小視頻，需要有

視頻
觀看數

Ins網站結構分析

Ins的post數據在json文件裏存儲，一個json文件存儲12條post信息，並會給出查詢下一組post的cursor。

由於使用了hash加密，所以典型的查詢鏈接如下：

‘https://www.instagram.com/graphql/query/?query_hash=a5164aed103f24b03e7b7747a2d94e3c&variables=%7B%22id%22%3A%22{user_id}%22%2C%22first%22%3A12%2C%22after%22%3A%22{cursor}%22%7D’

也就是說對於每一組post，需要知道兩個參數：user_id和cursor。將這兩個參數填入url中並requests就能得到想要的json文件。事實上填入參數後的url直接在瀏覽器中打開也能得到對應的json：

顯然，以上的查詢鏈接中的user_id和end_cursor都是從上一頁中繼承的。可是對於每個賬戶的第一頁，顯然沒有辦法繼承。因此第一頁要單獨爬，好在每個賬戶第一頁的url都是固定的：

https://www.instagram.com/{account}/

從第一頁中分析得到user_id和第二頁的cursor即可。

數據存儲和可視化策略

爲了提高存儲效率，用兩層數據格式存儲。外層是一個列表，內存是一個詞典。詞典中按照需要爬取的變量設置不同的屬性，每爬取一個新的post更新一次詞典。得到的結果如下：

[{'img_url': 'https://scontent-frt3-1.cdninstagram.com/vp/2de4fe0ca443d27ead0601306a4d2d9f/5E65FD4A/t51.2885-15/e35/73497417_178613769947279_3483644294279361168_n.jpg?_nc_ht=scontent-frt3-1.cdninstagram.com&_nc_cat=1', 'comment_count': 9042, 'like_count': 565657, 'text': 'Meet today’s #WeeklyFluff, Albert (@pompous.albert), a Selkirk Rex cat who might look faux... but is keeping it real. 😻\\u2063\\n\\u2063\\nPhoto by @pompous.albert'}, 
{'img_url': 'https://scontent-frt3-1.cdninstagram.com/vp/ff83ef12404713e3584ba07441a23913/5E856EC0/t51.2885-15/e35/p1080x1080/72783038_1207153232810009_5652648210556063310_n.jpg?_nc_ht=scontent-frt3-1.cdninstagram.com&_nc_cat=1', 'comment_count': 5506, 'like_count': 637442, 'text': 'For Colombian singer-songwriter Camilo (@camilomusica), the #LatinGRAMMY Awards are a big party of close friends, who just happen to be some of the biggest artists in the world right now. 🔥🌎\\u2063\\n\\u2063\\nSee who Camilo runs into and guess who he’s going to collab with next. It’s #GameOn at the @latingrammys, right now on our story.'}]

爲了使爬取結果更直觀，寫了一個函數將以上數據轉成表格形式:

def nestedlist2csv(list, out_file):
    with open(out_file, 'w') as f:
        w = csv.writer(f)
        fieldnames=list[0].keys() 
        w.writerow(fieldnames)
        for row in list:
            w.writerow(row.values())

最終的結果如下：

主程序

import re
import json
import time
import random
import requests
from pyquery import PyQuery as pq
import pandas as pd
import csv
from datetime import datetime
import math

def baseurl(acc):
    url_base = 'https://www.instagram.com/%s/'%acc
    return(url_base)

uri = 'https://www.instagram.com/graphql/query/?query_hash=a5164aed103f24b03e7b7747a2d94e3c&variables=%7B%22id%22%3A%22{user_id}%22%2C%22first%22%3A12%2C%22after%22%3A%22{cursor}%22%7D'

idlist = pd.read_table('accidlist.txt',header=0,encoding='gb18030',delim_whitespace=True)
idlist.columns=['acc','id','postno']

headers = {
    "Origin": "https://www.instagram.com/",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/58.0.3029.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, sdch, br",
    "accept-language": "zh-CN,zh;q=0.8",
    "X-Instragram-AJAX": "1",
    "X-Requested-With": "XMLHttpRequest",
    "Upgrade-Insecure-Requests": "1",
}

def get_html(url):
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        else:
            print('請求網頁源代碼錯誤, 錯誤狀態碼：', response.status_code)
    except Exception as e:
        print(e)
        return None


def get_json(headers, url):
    try:
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code == 200:
            return response.json()
        else:
            print('請求網頁json錯誤, 錯誤狀態碼：', response.status_code)
    except Exception as e:
        print(e)
        time.sleep(60 + float(random.randint(1, 4000)) / 100)
        return get_json(headers, url)


def get_pics(picurl,picname):
    picresp = requests.get(picurl, headers=headers, timeout=10)
    with open('%s.png'%picname, 'wb') as f:
        f.write(picresp.content)

def nestedlist2csv(list, out_file):
    with open(out_file, 'w') as f:
        w = csv.writer(f)
        fieldnames=list[0].keys() 
        w.writerow(fieldnames)
        for row in list:
            w.writerow(row.values())

def get_date(timestamp):
    local_str_time = datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d')
    return local_str_time

def get_samples(html,acc):
    samples = []
    page = 0
    user_id = re.findall('"profilePage_([0-9]+)"', html, re.S)[0]
    print("The user id is %s"%user_id)

    doc = pq(html)
    items = doc('script[type="text/javascript"]').items()
    for item in items:
        if item.text().strip().startswith('window._sharedData'):
            js_data = json.loads(item.text()[21:-1], encoding='utf-8')

            edges = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"]["edges"]
            totalpost = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"]["count"]
            totalpage = math.ceil(totalpost/12)
            page_info = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"][
                'page_info']
            cursor = page_info['end_cursor']
            flag = page_info['has_next_page']

            for edge in edges:
                sample = {}

                if edge['node']['display_url']:
                    sample["Influencer"] = acc
                    timestamp = edge['node']['taken_at_timestamp']
                    sample["date"] = get_date(timestamp)
                    sample["comment_count"] = edge['node']['edge_media_to_comment']["count"]
                    sample["like_count"] = edge['node']['edge_liked_by']["count"]

                if edge['node']['shortcode']:
                    shortcode = edge['node']['shortcode']
                    sample['postlink'] = 'https://www.instagram.com/p/%s/'%(shortcode)
                    textUrl = 'https://www.instagram.com/p/' + shortcode + '/?__a=1'
                    textRespose = get_json(headers, textUrl)
                    try:
                        textDict = textRespose['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0][
                            'node']
                        sample["caption"] = str(textDict)[10:-2]
                    except:
                        sample["caption"] = ""
                    children = textRespose["graphql"]["shortcode_media"].get('edge_sidecar_to_children')
                    if children:
                        sample['multipic'] = 'True'
                        picurls = ""
                        for child in children['edges']:
                            picurls = picurls + child['node']['display_url'] + ','
                        sample['img_urls'] = picurls
                    else:
                        sample['multipic'] = 'False'
                        sample['img_urls'] = textRespose['graphql']['shortcode_media']['display_url']
                    isvideo = textRespose["graphql"]["shortcode_media"].get('is_video')
                    if isvideo:
                        sample['video_url'] = textRespose["graphql"]["shortcode_media"].get('video_url')
                        sample['video_view_count'] = textRespose["graphql"]["shortcode_media"].get('video_view_count')
                    else:
                        sample['video_url'] = ""
                        sample['video_view_count'] = ""

                samples.append(sample)
                time.sleep(float(random.randint(1, 3)))
            nestedlist2csv(samples,'%s_postlist.csv'%acc)
            page += 1
            print("Finish the %s page of %s, the total page number is %s"%(page,acc,totalpage))

    while flag:
        url = uri.format(user_id=user_id, cursor=cursor)
        print([user_id, cursor])
        js_data = get_json(headers, url)
        infos = js_data['data']['user']['edge_owner_to_timeline_media']['edges']
        cursor = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']
        flag = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']
        for info in infos:
            sample = {}
            sample["Influencer"] = acc
            timestamp = info['node']['taken_at_timestamp']
            sample["date"] = get_date(timestamp)
            sample["comment_count"] = info['node']['edge_media_to_comment']["count"]
            sample["like_count"] = info['node']['edge_media_preview_like']["count"]

            if info['node']['shortcode']:
                time.sleep(1)
                shortcode = info['node']['shortcode']
                sample['postlink'] = 'https://www.instagram.com/p/%s/' % (shortcode)
                textUrl = 'https://www.instagram.com/p/' + shortcode + '/?__a=1'
                textRespose = get_json(headers, textUrl)

                try:
                    textDict = textRespose['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0][
                        'node']
                    sample["caption"] = str(textDict)[10:-2]
                except:
                    sample["caption"] = ""

                children = textRespose["graphql"]["shortcode_media"].get('edge_sidecar_to_children')
                if children:
                    sample['multipic'] = 'True'
                    picurls = ""
                    for child in children['edges']:
                        picurls = picurls + child['node']['display_url'] + ','
                    sample['img_urls'] = picurls
                else:
                    sample['multipic'] = 'False'
                    sample['img_urls'] = textRespose['graphql']['shortcode_media']['display_url']
                isvideo = textRespose["graphql"]["shortcode_media"].get('is_video')
                if isvideo:
                    sample['video_url'] = textRespose["graphql"]["shortcode_media"].get('video_url')
                    sample['video_view_count'] = textRespose["graphql"]["shortcode_media"].get('video_view_count')
                else:
                    sample['video_url'] = ""
                    sample['video_view_count'] = ""
            samples.append(sample)
            time.sleep(float(random.randint(1, 3)))
        nestedlist2csv(samples,'%s_postlist.csv'%acc)
        page += 1
        print("Finish the %s page of %s, the total page number is %s" % (page, acc, totalpage))


def main():
    for i in range(len(idlist.loc[:, 'acc'])):
        acc = idlist.loc[i, 'acc']
        url = baseurl(acc)
        print(url)
        html = get_html(url)
        ticks = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        print("開始處理賬戶%s,當前時間爲:%s"%(acc,ticks))
        try:
            get_samples(html,acc)
        except:
            print("程序中斷，中斷時間爲%s" % (ticks))
            break
        print("結束處理賬戶%s,當前時間爲:%s" % (acc, ticks))
        time.sleep(float(random.randint(1, 4000)/10))

if __name__ == '__main__':
    start = time.time()
    main()

斷點續傳

import time
import random
import requests
import pandas as pd
import csv
from datetime import datetime

uri = 'https://www.instagram.com/graphql/query/?query_hash=a5164aed103f24b03e7b7747a2d94e3c&variables=%7B%22id%22%3A%22{user_id}%22%2C%22first%22%3A12%2C%22after%22%3A%22{cursor}%22%7D'

idlist = pd.read_table('accidlist.txt',header=0,encoding='gb18030',delim_whitespace=True)
idlist.columns=['acc','id','postno']

headers = {
    "Origin": "https://www.instagram.com/",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/58.0.3029.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, sdch, br",
    "accept-language": "zh-CN,zh;q=0.8",
    "X-Instragram-AJAX": "1",
    "X-Requested-With": "XMLHttpRequest",
    "Upgrade-Insecure-Requests": "1",
}

def get_html(url):
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        else:
            print('請求網頁源代碼錯誤, 錯誤狀態碼：', response.status_code)
    except Exception as e:
        print(e)
        return None


def get_json(headers, url):
    try:
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code == 200:
            return response.json()
        else:
            print('請求網頁json錯誤, 錯誤狀態碼：', response.status_code)
    except Exception as e:
        print(e)
        time.sleep(60 + float(random.randint(1, 4000)) / 100)
        return get_json(headers, url)


def get_pics(picurl,picname):
    picresp = requests.get(picurl, headers=headers, timeout=10)
    with open('%s.png'%picname, 'wb') as f:
        f.write(picresp.content)

def nestedlist2csv(list, out_file):
    with open(out_file, 'a') as f:
        w = csv.writer(f)
        fieldnames=list[0].keys()  
        for row in list:
            w.writerow(row.values())

def get_date(timestamp):
    local_str_time = datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d')
    return local_str_time

def get_breakpoint(breakpage, id, cursor, acc, flag):
    page = breakpage
    while flag:
        samples = []
        url = uri.format(user_id=id, cursor=cursor)
        print([id, cursor])
        js_data = get_json(headers, url)
        infos = js_data['data']['user']['edge_owner_to_timeline_media']['edges']
        cursor = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']
        flag = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']
        for info in infos:
            sample = {}
            sample["Influencer"] = acc
            timestamp = info['node']['taken_at_timestamp']
            sample["date"] = get_date(timestamp)
            sample["comment_count"] = info['node']['edge_media_to_comment']["count"]
            sample["like_count"] = info['node']['edge_media_preview_like']["count"]

            if info['node']['shortcode']:
                time.sleep(1)
                shortcode = info['node']['shortcode']
                sample['postlink'] = 'https://www.instagram.com/p/%s/' % (shortcode)
                textUrl = 'https://www.instagram.com/p/' + shortcode + '/?__a=1'
                textRespose = get_json(headers, textUrl)

                try:
                    textDict = textRespose['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']
                    sample["caption"] = str(textDict)[10:-2]
                except:
                    sample["caption"] = ""

                children = textRespose["graphql"]["shortcode_media"].get('edge_sidecar_to_children')
                if children:
                    sample['multipic'] = 'True'
                    picurls = ""
                    for child in children['edges']:
                        picurls = picurls + child['node']['display_url'] + ','
                    sample['img_urls'] = picurls
                else:
                    sample['multipic'] = 'False'
                    sample['img_urls'] = textRespose['graphql']['shortcode_media']['display_url']
                isvideo = textRespose["graphql"]["shortcode_media"].get('is_video')
                if isvideo:
                    sample['video_url'] = textRespose["graphql"]["shortcode_media"].get('video_url')
                    sample['video_view_count'] = textRespose["graphql"]["shortcode_media"].get('video_view_count')
                else:
                    sample['video_url'] = ""
                    sample['video_view_count'] = ""
            samples.append(sample)
            time.sleep(float(random.randint(1, 3)))
        nestedlist2csv(samples, '%s_postlist.csv' % acc)
        page += 1
        print("Finish the %s page of %s" % (page, acc))

if __name__ == '__main__':
    breakpage = 58
    id = "89899"
    cursor = "QVFBajF5bVdqV0otYUhfSGJHTFZOdDhULTQ3X19kU0J3ZXd5cXJ2UnNkblNkQW5sU3A0UHFNeU1YbjU1Sm5UZ3pkaUphTC1xZVVyeTRaLXFFdDRyc0lXNw=="
    acc = "oliviermorisse"
    flag = "true"
    ticks = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    print("Restart time is %s"%ticks)
    get_breakpoint(breakpage, id, cursor, acc, flag)

爬所有圖片

import requests
import pandas as pd
import random
import os
import time
from datetime import datetime


picpath = '/Users/mengjiexu/Googledrive/Influencers_pic/'
postpath = '/Users/mengjiexu/Googledrive/Influencers_post/'

headers = {
    "Origin": "https://www.instagram.com/",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/58.0.3029.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, sdch, br",
    "accept-language": "zh-CN,zh;q=0.8",
    "X-Instragram-AJAX": "1",
    "X-Requested-With": "XMLHttpRequest",
    "Upgrade-Insecure-Requests": "1",
}

def parsepics(filename):
    data = pd.read_excel(postpath + filename)
    influencer = filename.split('_postlist')[0]
    print(influencer)
    inpicpath = '%s%s_pic/'%(picpath,influencer)
    print(inpicpath)
    #os.makedirs(inpicpath)
    os.chdir(inpicpath)
    postlink = data['postlink']
    piclink = data['img_urls']
    multipic = data['multipic']
    for i in range(len(postlink)):
        postid = postlink[i].split('p/')[-1].split('/')[0]
        postindex = len(postlink) - i
        print('This is the %s post of %s' % (i, influencer))
        while multipic[i]:
            pics = piclink[i].split(',')[:-1]
            for j in range(len(pics)):
                try:
                    picresp = requests.get(pics[j], headers=headers, timeout=10)
                    with open('%s%s_%s_%s_%s.jpeg' % (inpicpath, influencer, postindex, j, postid), 'wb') as f:
                        f.write(picresp.content)
                    time.sleep(float(random.randint(0, 2)))
                except:
                    pass
            break
        else:
            try:
                picresp = requests.get(piclink[i], headers=headers, timeout=10)
                with open('%s%s_%s_%s.jpeg' % (inpicpath, influencer, postindex, postid), 'wb') as f:
                    f.write(picresp.content)
                time.sleep(float(random.randint(0, 1)))
            except:
                pass


for filename in os.listdir(postpath):
    ticks = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    with open(picpath + 'Processinghistory.txt','a') as f:
        f.write('Start to process %s, starting time is %s'%(filename, ticks)+'\r')
        parsepics(filename)
        f.write('End processing %s, ending time is %s'%(filename, ticks)+'\r')
        time.sleep(float(random.randint(0, 1)))

後續

將程序放到服務器或colab上跑
爬所有的followers

主要參考鏈接

https://blog.csdn.net/qq_27297393/article/details/82915102

Instgram爬蟲及其斷點續傳_一個AJAX異步加載爬蟲

任務描述

Ins網站結構分析

數據存儲和可視化策略

主程序

斷點續傳

爬所有圖片

後續

主要參考鏈接

薅谷歌爸爸羊毛第一集 - 在Colab上運行Python代碼

Instgram爬蟲及其斷點續傳_一個AJAX異步加載爬蟲

使用Stata做脈衝響應分析

如何優雅地管理微信數據庫？

公司的投資決策是否會受到同行股價的影響：公司是否對同行股價有學習效應

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結