背景介紹
我微博玩的晚,同學裏面加上好友的也就40不到,爲了把那些隱藏的好友揪出來。用scrapy寫一個爬蟲試一試。
思路
微博上面關注和粉絲都是公開的數據,可以用爬蟲獲取到的。而一個好友圈子裏面的人,相互粉的比例也會比較大。這就是找到隱藏的好友的一個切入點。於是思路如下:
- 從自己的賬號入手,先抓取自己關注的人和自己的粉絲(0級好友)
- 從第一批抓的數據開始,繼續爬取0級好友的關注人和粉絲
- 在爬取的數據中分析他們的網絡關係,找到可能是自己好友的人
遇到的問題
- 爬取的數據量需要控制,每級迭代,用戶的數量都是爆炸式的增長,需要進行一定篩選
- 爬取的速度的限制,我的破電腦實測爬取速度大概在17page/min
- 小心大V賬號的陷阱,在爬取人物關係的時候,一定要把大V的賬號剔除掉!,粉絲頁每頁只能顯示10人左右,有的大V一個人就百萬粉絲,爬蟲就卡在這一個號裏面出不來了。
分析了這些,下面就開始搞了
具體實現
非常幸運地在網上別人搞好的一個爬取微博數據的框架,可以拿來改一改,這個框架是包含爬取微博內容,評論,用戶信息,用戶關係的。微博內容和評論我是不需要的,我就拿來改成只爬取用戶信息和關係的就可以了。
我的主要改動如下:
- 刪除爬取評論和微博的部分
- 將爬取的起始id變成從json文件讀取(爲了動態指定爬取範圍)
- 新增加一個只爬取用戶信息的爬蟲,提高效率(原有爬蟲專門爬取用戶關係)
整個項目結構是:用scrapy作爲爬蟲的框架,有兩個爬蟲,weibo_spider.py用來爬取用戶關係;weibo_userinfo用來爬取用戶資料。爬取的數據存在mongodb中。另外有一個程序spider_control.py專門指定爬取的id的範圍,將待爬取的id存在start_uids.json文件中,兩個爬蟲文件從這個json文件指定爬取的範圍。
爬取用戶關係的weibo_spider.py代碼如下:
#!/usr/bin/env python
# # encoding: utf-8
import re
from lxml import etree
from scrapy import Spider
from scrapy.crawler import CrawlerProcess
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.utils.project import get_project_settings
from sina.items import TweetsItem, InformationItem, RelationshipsItem, CommentItem
from sina.spiders.utils import time_fix, extract_weibo_content, extract_comment_content
import time
import json
class WeiboSpider(Spider):
name = "weibo_spider"
base_url = "https://weibo.cn"
def start_requests(self):
with open('start_uids.json','r') as f:
data = json.load(f)
start_uids = data["start_uids"]
for uid in start_uids:
yield Request(url="https://weibo.cn/%s/info" % uid, callback=self.parse_information)
def parse_information(self, response):
""" 抓取個人信息 """
information_item = InformationItem()
information_item['crawl_time'] = int(time.time())
selector = Selector(response)
information_item['_id'] = re.findall('(\d+)/info', response.url)[0]
text1 = ";".join(selector.xpath('body/div[@class="c"]//text()').extract()) # 獲取標籤裏的所有text()
nick_name = re.findall('暱稱;?[::]?(.*?);', text1)
gender = re.findall('性別;?[::]?(.*?);', text1)
place = re.findall('地區;?[::]?(.*?);', text1)
briefIntroduction = re.findall('簡介;?[::]?(.*?);', text1)
birthday = re.findall('生日;?[::]?(.*?);', text1)
sex_orientation = re.findall('性取向;?[::]?(.*?);', text1)
sentiment = re.findall('感情狀況;?[::]?(.*?);', text1)
vip_level = re.findall('會員等級;?[::]?(.*?);', text1)
authentication = re.findall('認證;?[::]?(.*?);', text1)
labels = re.findall('標籤;?[::]?(.*?)更多>>', text1)
if nick_name and nick_name[0]:
information_item["nick_name"] = nick_name[0].replace(u"\xa0", "")
if gender and gender[0]:
information_item["gender"] = gender[0].replace(u"\xa0", "")
if place and place[0]:
place = place[0].replace(u"\xa0", "").split(" ")
information_item["province"] = place[0]
if len(place) > 1:
information_item["city"] = place[1]
if briefIntroduction and briefIntroduction[0]:
information_item["brief_introduction"] = briefIntroduction[0].replace(u"\xa0", "")
if birthday and birthday[0]:
information_item['birthday'] = birthday[0]
if sex_orientation and sex_orientation[0]:
if sex_orientation[0].replace(u"\xa0", "") == gender[0]:
information_item["sex_orientation"] = "同性戀"
else:
information_item["sex_orientation"] = "異性戀"
if sentiment and sentiment[0]:
information_item["sentiment"] = sentiment[0].replace(u"\xa0", "")
if vip_level and vip_level[0]:
information_item["vip_level"] = vip_level[0].replace(u"\xa0", "")
if authentication and authentication[0]:
information_item["authentication"] = authentication[0].replace(u"\xa0", "")
if labels and labels[0]:
information_item["labels"] = labels[0].replace(u"\xa0", ",").replace(';', '').strip(',')
request_meta = response.meta
request_meta['item'] = information_item
yield Request(self.base_url + '/u/{}'.format(information_item['_id']),
callback=self.parse_further_information,
meta=request_meta, dont_filter=True, priority=1)
def parse_further_information(self, response):
text = response.text
information_item = response.meta['item']
tweets_num = re.findall('微博\[(\d+)\]', text)
if tweets_num:
information_item['tweets_num'] = int(tweets_num[0])
follows_num = re.findall('關注\[(\d+)\]', text)
if follows_num:
information_item['follows_num'] = int(follows_num[0])
fans_num = re.findall('粉絲\[(\d+)\]', text)
if fans_num:
information_item['fans_num'] = int(fans_num[0])
request_meta = response.meta
request_meta['item'] = information_item
yield information_item
# 獲取關注列表
yield Request(url=self.base_url + '/{}/follow?page=1'.format(information_item['_id']),
callback=self.parse_follow,
meta=request_meta,
dont_filter=True)
# 獲取粉絲列表
yield Request(url=self.base_url + '/{}/fans?page=1'.format(information_item['_id']),
callback=self.parse_fans,
meta=request_meta,
dont_filter=True)
def parse_follow(self, response):
"""
抓取關注列表
"""
# 如果是第1頁,一次性獲取後面的所有頁
if response.url.endswith('page=1'):
all_page = re.search(r'/> 1/(\d+)頁</div>', response.text)
if all_page:
all_page = all_page.group(1)
all_page = int(all_page)
for page_num in range(2, all_page + 1):
page_url = response.url.replace('page=1', 'page={}'.format(page_num))
yield Request(page_url, self.parse_follow, dont_filter=True, meta=response.meta)
selector = Selector(response)
id_username_pair = re.findall('<a href="https://weibo.cn/u/(\d+)">(.{0,30})</a>',response.text)
# urls = selector.xpath('//a[text()="關注他" or text()="關注她" or text()="取消關注"]/@href').extract()
# uids = re.findall('uid=(\d+)', ";".join(urls), re.S)
uids = [item[0] for item in id_username_pair]
followed_names = [item[1] for item in id_username_pair]
ID = re.findall('(\d+)/follow', response.url)[0]
for uid, followed_name in zip(uids,followed_names):
relationships_item = RelationshipsItem()
relationships_item['crawl_time'] = int(time.time())
relationships_item["fan_id"] = ID
relationships_item["followed_id"] = uid
relationships_item["_id"] = ID + '-' + uid
relationships_item['fan_name'] = response.meta['item']['nick_name']
relationships_item['followed_name'] = followed_name
yield relationships_item
def parse_fans(self, response):
"""
抓取粉絲列表
"""
# 如果是第1頁,一次性獲取後面的所有頁
if response.url.endswith('page=1'):
all_page = re.search(r'/> 1/(\d+)頁</div>', response.text)
if all_page:
all_page = all_page.group(1)
all_page = int(all_page)
for page_num in range(2, all_page + 1):
page_url = response.url.replace('page=1', 'page={}'.format(page_num))
yield Request(page_url, self.parse_fans, dont_filter=True, meta=response.meta)
selector = Selector(response)
id_username_pair = re.findall('<a href="https://weibo.cn/u/(\d+)">(.{0,30})</a>', response.text)
# urls = selector.xpath('//a[text()="關注他" or text()="關注她" or text()="取消關注"]/@href').extract()
# uids = re.findall('uid=(\d+)', ";".join(urls), re.S)
uids = [item[0] for item in id_username_pair]
fan_names = [item[1] for item in id_username_pair]
ID = re.findall('(\d+)/fans', response.url)[0]
for uid, fan_name in zip(uids,fan_names):
relationships_item = RelationshipsItem()
relationships_item['crawl_time'] = int(time.time())
relationships_item["fan_id"] = uid
relationships_item["followed_id"] = ID
relationships_item["_id"] = uid + '-' + ID
relationships_item['fan_name'] = fan_name
relationships_item['followed_name'] = response.meta['item']['nick_name']
yield relationships_item
if __name__ == "__main__":
process = CrawlerProcess(get_project_settings())
process.crawl('weibo_spider')
process.start()
只爬取用戶資料的weibo_userinfo.py代碼如下:
# -*- coding: utf-8 -*-
import re
from lxml import etree
from scrapy import Spider
from scrapy.crawler import CrawlerProcess
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.utils.project import get_project_settings
from sina.items import TweetsItem, InformationItem, RelationshipsItem, CommentItem
from sina.spiders.utils import time_fix, extract_weibo_content, extract_comment_content
import time
import json
class WeiboUserinfoSpider(Spider):
name = "weibo_userinfo"
base_url = "https://weibo.cn"
def start_requests(self):
with open('start_uids.json', 'r') as f:
data = json.load(f)
start_uids = data["start_uids"]
# start_uids = [
# '6505979820', # 我
# #'1699432410' # 新華社
# ]
for uid in start_uids:
yield Request(url="https://weibo.cn/%s/info" % uid, callback=self.parse_information)
def parse_information(self, response):
""" 抓取個人信息 """
information_item = InformationItem()
information_item['crawl_time'] = int(time.time())
selector = Selector(response)
information_item['_id'] = re.findall('(\d+)/info', response.url)[0]
text1 = ";".join(selector.xpath('body/div[@class="c"]//text()').extract()) # 獲取標籤裏的所有text()
nick_name = re.findall('暱稱;?[::]?(.*?);', text1)
gender = re.findall('性別;?[::]?(.*?);', text1)
place = re.findall('地區;?[::]?(.*?);', text1)
briefIntroduction = re.findall('簡介;?[::]?(.*?);', text1)
birthday = re.findall('生日;?[::]?(.*?);', text1)
sex_orientation = re.findall('性取向;?[::]?(.*?);', text1)
sentiment = re.findall('感情狀況;?[::]?(.*?);', text1)
vip_level = re.findall('會員等級;?[::]?(.*?);', text1)
authentication = re.findall('認證;?[::]?(.*?);', text1)
labels = re.findall('標籤;?[::]?(.*?)更多>>', text1)
if nick_name and nick_name[0]:
information_item["nick_name"] = nick_name[0].replace(u"\xa0", "")
if gender and gender[0]:
information_item["gender"] = gender[0].replace(u"\xa0", "")
if place and place[0]:
place = place[0].replace(u"\xa0", "").split(" ")
information_item["province"] = place[0]
if len(place) > 1:
information_item["city"] = place[1]
if briefIntroduction and briefIntroduction[0]:
information_item["brief_introduction"] = briefIntroduction[0].replace(u"\xa0", "")
if birthday and birthday[0]:
information_item['birthday'] = birthday[0]
if sex_orientation and sex_orientation[0]:
if sex_orientation[0].replace(u"\xa0", "") == gender[0]:
information_item["sex_orientation"] = "同性戀"
else:
information_item["sex_orientation"] = "異性戀"
if sentiment and sentiment[0]:
information_item["sentiment"] = sentiment[0].replace(u"\xa0", "")
if vip_level and vip_level[0]:
information_item["vip_level"] = vip_level[0].replace(u"\xa0", "")
if authentication and authentication[0]:
information_item["authentication"] = authentication[0].replace(u"\xa0", "")
if labels and labels[0]:
information_item["labels"] = labels[0].replace(u"\xa0", ",").replace(';', '').strip(',')
request_meta = response.meta
request_meta['item'] = information_item
yield Request(self.base_url + '/u/{}'.format(information_item['_id']),
callback=self.parse_further_information,
meta=request_meta, dont_filter=True, priority=1)
def parse_further_information(self, response):
text = response.text
information_item = response.meta['item']
tweets_num = re.findall('微博\[(\d+)\]', text)
if tweets_num:
information_item['tweets_num'] = int(tweets_num[0])
follows_num = re.findall('關注\[(\d+)\]', text)
if follows_num:
information_item['follows_num'] = int(follows_num[0])
fans_num = re.findall('粉絲\[(\d+)\]', text)
if fans_num:
information_item['fans_num'] = int(fans_num[0])
request_meta = response.meta
request_meta['item'] = information_item
yield information_item
if __name__ == "__main__":
process = CrawlerProcess(get_project_settings())
process.crawl('weibo_userinfo')
process.start()
除此之外,另外用一個程序專門指定爬取的範圍
import os
import json
import pymongo
class SpiderControl:
def __init__(self, init_uid):
self.init_uid = init_uid
self.set_uid([self.init_uid])
self.dbclient = pymongo.MongoClient(host='localhost', port=27017)
self.db = self.dbclient['Sina']
self.info_collection = self.db['Information']
self.relation_collection = self.db['Relationships']
self.label_collection = self.db['Label']
def set_uid(self, uid_list):
uid_data = {'start_uids': uid_list}
with open('start_uids.json','w') as f:
json.dump(uid_data, f)
def add_filter(list1,list2):
return list(set(list1+list2))
if __name__ == '__main__':
init_id = '6505979820'
sc = SpiderControl(init_id)
try:
sc.label_collection.insert({'_id':init_id,'label':'center'})
except:
print('已有該數據')
# 這裏第一次運行weibo_spider,第一次爬取粉絲和關注
info_filter = [init_id] # 標記已經爬過的id
relation_filter = [init_id] # 標記已經爬過的id
condition_fan0 = {'followed_id':init_id} # 主人公的粉絲
condition_follow0 = {'fan_id':init_id} # 主人公的關注
fan0_id = [r['fan_id'] for r in sc.relation_collection.find(condition_fan0)]
follow0_id = [r['followed_id'] for r in sc.relation_collection.find(condition_follow0)]
# 打標籤
for id in follow0_id:
try:
sc.label_collection.insert_one({'_id':id,'label':'follow0'}) # 沒有就添加
except:
sc.label_collection.update_one({'_id':id},{'$set':{'label':'follow0'}}) # 有就更新
for id in fan0_id:
try:
sc.label_collection.insert_one({'_id':id,'label':'fan0'})
except:
sc.label_collection.update_one({'_id':id},{'$set':{'label':'fan0'}})
fans_follows0 = list(set(fan0_id) | set(follow0_id)) # 取關注和粉絲的並集
# 運行爬蟲weibo_userinfo獲得粉絲和關注用戶的信息
info_filter = add_filter(info_filter, fans_follows0)
condition_famous = {'fans_num':{'$gt':500}} # 選取粉絲小於500的賬號
famous_id0 = [r['_id'] for r in sc.info_collection.find(condition_famous)]
relation_filter = add_filter(relation_filter, famous_id0)
uid_set = list(set(fans_follows0) - set(relation_filter))
print('id數量{}'.format(len(uid_set)))
sc.set_uid(uid_set)
# 運行爬蟲weibo_spider獲得這些人的粉絲關注網絡
condition_fan1 = {'followed_id': {'$in':fans_follows0}} # 一級好友的粉絲
condition_follow1 = {'fan_id': {'$in':fans_follows0}} # 一級好友的關注
fan1_id = [r['fan_id'] for r in sc.relation_collection.find(condition_fan1)]
follow1_id = [r['followed_id'] for r in sc.relation_collection.find(condition_follow1)]
fan1_id = list(set(fan1_id)-set(fans_follows0)) # 去除0級好友
follow1_id = list(set(follow1_id) - set(fans_follows0)) # 去除0級好友
# 打標籤
for id in follow1_id:
try:
sc.label_collection.insert_one({'_id': id, 'label': 'follow1'}) # 沒有就添加
except:
sc.label_collection.update_one({'_id': id}, {'$set': {'label': 'follow1'}}) # 有就更新
for id in fan1_id:
try:
sc.label_collection.insert_one({'_id': id, 'label': 'fan1'})
except:
sc.label_collection.update_one({'_id': id}, {'$set': {'label': 'fan1'}})
fans_follows1 = list(set(fan1_id) & set(follow1_id)) # 取關注和粉絲的交集
uid_set = list(set(fans_follows1) - set(info_filter))
print('id數量{}'.format(len(uid_set)))
sc.set_uid(uid_set)
# 運行spider_userinfo
info_filter = add_filter(info_filter, fans_follows1)
運行情況
第一次爬取,篩選掉大V後(粉絲大於500)有60個賬號。在這60個裏面再運行一次,爬取人物關係,這60個號總共有3800左右的粉絲和4300左右的關注數,取並集大概6400個賬號,先從這些賬號裏面分析一波
可視化使用pyecharts
先出一張圖
??!!woc這密密麻麻的什麼玩意,瀏覽器都要卡成狗了
趕緊縮小一波數據,在關注和粉絲裏面取交集,把數據量縮小到800
這回就舒服多了,其中,不同顏色代表了不同的類別(我分的)。
不同的顏色分別對應了:
- 我(紅色)
- 我的關注
- 我的粉絲
- 我的好友(我的關注+我的粉絲)的粉絲和關注
- 大V
通過找到和我好友連接比較多的賬號,就可以找到隱藏好友了
舉個例子
這個選中的是qqd,很顯然和我的好友有大量的關聯,與這些泡泡相連的大概率就是我的同學,比如這位:
果然這樣就發現了好多我的同學哈哈哈。
代碼寫的賊亂,反正能跑了(狗頭保命),我放到GitHub上了:https://github.com/buaalzm/WeiboSpider/tree/relation_spider