一直琢磨着寫個爬蟲玩,上學期都沒實行,於是花了大概一天寫了這個東西
其實半天就把程序調試好了,可是在往mysql數據庫裏保存數據的時候出了問題
python的中文編碼實在是非常麻煩,不光如此,因爲有些用戶的簡介裏有一些特殊符號,®或者笑臉之類的,於是在這裏糾結了很久,好歹最後是成功了(其實也就是過濾掉了那些特殊符號)
效率來說呢,開始的時候一個小時能採集大概1.4w條微博的用戶信息,但是由於我是從每個人的關注列表裏採集的,所以很快就會遇到爬到許多已經爬過的用戶,所以總的來說效率不是很高,怪不得那個“中國爬盟”要發動羣衆的力量去爬
而且有些擔心爬久了微博賬號被封,我也沒敢嘗試太久,最後爬了5w條用戶信息,8w條關係數據,我拿着數據目前也沒什麼用,所以就這樣吧
python沒看多久,代碼有冗餘的地方,其實主要就是三個函數save_user(),creepy_myself(),creepy_others()
具體的就看代碼的註釋吧,下載地址,和下面的一樣(代碼有冗餘,因爲要先爬出來用戶的關注數目來計算有多少頁)
#coding=utf8
import urllib2
import re
from BeautifulSoup import *
import MySQLdb
import sys
"""
Login to Sina Weibo with cookie
setdefaultencoding 用於對中文編碼的處理
"""
reload(sys)
sys.setdefaultencoding('utf8')
COOKIE ='你的cookie'
HEADERS = {'cookie': COOKIE}
UID= COOKIE[COOKIE.find('uid')+4:COOKIE.find('uid')+14]
'''
嘗試連接數據庫,以供保存詩句
'''
try:
conn=MySQLdb.connect(host='127.0.0.1',user='root',passwd='root',db='weibodata',port=3309,charset='utf8',use_unicode=False)
cur=conn.cursor()
except MySQLdb.Error,e:
print "Mysql Error %d: %s" % (e.args[0], e.args[1])
def save_user(uuid,uid,name,common):
'''
save_user(uuid,uid,name,common)
用於保存詩句,uuid->uid是用戶關係,uuid關注uid
uid,name,common是將要保存的用戶信息
setup.ini中保存有兩個數字
第一個是now我對當前用戶的編號
第二個point是當前正在掃描的用戶的編號
你可以把它們看作是一個隊列的兩個指針
'''
fileHandle = open ( 'setup.ini','r+');
now=int(fileHandle.readline())+1;
point =int(fileHandle.readline())
print now
#print uuid,uid,name,common
#保存用戶關係信息
count=cur.execute('select * from relations where uid1=\''+str(uuid)+'\' and uid2=\''+str(uid)+'\'')
if (count==0):
cur.execute('insert into relations(uid1,uid2)values(\''+\
str(uuid)+'\',\''+str(uid)+'\')')
conn.commit()
count=cur.execute('select * from users where uid=\''+str(uid)+'\'')
#保存用戶信息
if (count==0):
cs=common.encode('gbk', 'ignore').decode('gbk', 'ignore').encode('utf-8', 'ignore')
#print cs
cur.execute('insert into users(id,uid,name,common)values(\''+\
str(now)+'\',\''+str(uid)+'\',\''+str(name)+'\',\"'+\
cs +\
'\")')
conn.commit()
fileHandle.close()
fileHandle = open ( 'setup.ini','w');
fileHandle.write(str(now)+'\n'+str(point))
fileHandle.close()
def creepy_myself():
'''
這是用來掃描你自己的關注列表的
我想着得有個開頭,所以第一次使用時應調用這個函數爲隊列添加一些用戶再作擴展
'''
uid= COOKIE[COOKIE.find('uid')+4:COOKIE.find('uid')+14]
url = 'http://weibo.com/'+str(uid)+'/myfollow?t=1&page=1'
mainurl='http://weibo.com/'+str(uid)+'/myfollow?t=1&page='
req = urllib2.Request(url, headers=HEADERS)
text = urllib2.urlopen(req).read()
mainSoup=BeautifulSoup(text)
strs=str(mainSoup.find('div','lev2'));
num=int(strs[strs.find('(')+1:strs.find(')')])
lines=text.splitlines()
for line in lines:
if line.startswith('<script>STK && STK.pageletM && STK.pageletM.view({"pid":"pl_relation_myf'):
n = line.find('html":"')
if n > 0:
j = line[n + 7: -12].replace("\\", "")
soup =BeautifulSoup(j)
follows=soup.findAll('div','myfollow_list S_line2 SW_fun')
for follow in follows:
namess=follow.find('ul','info').find('a')['title']
temp_str=str(follow)
uiddd= temp_str[temp_str.find('uid')+4:temp_str.find('&')]
save_user(UID,uiddd,namess,follow.find('div','intro S_txt2').contents[0][6:])
for i in range(2,num/30+1):
url = 'http://weibo.com/2421424850/myfollow?t=1&page='+str(i)
req = urllib2.Request(url, headers=HEADERS)
text = urllib2.urlopen(req).read()
lines=text.splitlines()
for line in lines:
if line.startswith('<script>STK && STK.pageletM && STK.pageletM.view({"pid":"pl_relation_myf'):
n = line.find('html":"')
if n > 0:
j = line[n + 7: -12].replace("\\", "")
soup =BeautifulSoup(j)
follows=soup.findAll('div','myfollow_list S_line2 SW_fun')
for follow in follows:
namess=follow.find('ul','info').find('a')['title']
temp_str=str(follow)
uiddd =temp_str[temp_str.find('uid')+4:temp_str.find('&')]
save_user(UID,uiddd,namess,follow.find('div','intro S_txt2').contents[0][6:])
def creepy_others(uid):
'''
掃描制定uid用戶的信息
和上面一樣代碼有冗餘
因爲要先得到這個用戶的關注人數,來計算一共有多少頁數據
'''
url="http://weibo.com/"+str(uid)+"/follow?page=";
req = urllib2.Request(url, headers=HEADERS)
text = urllib2.urlopen(req).read()
mainSoup=BeautifulSoup(text.strip())
lines=text.splitlines()
num=1
for line in lines:
if line.startswith('<script>STK && STK.pageletM && STK.pageletM.view({"pid":"pl_relation_hisFollow'):
n = line.find('html":"')
if n > 0:
j = line[n + 7: -12].replace("\\n", "")
j = j.replace("\\t","")
j = j.replace("\\",'');
soup=BeautifulSoup(j)
strs=str(soup.find('div','patch_title'))
num=int(strs[strs.find('關注了')+9:strs.find('人</div')]);
follows=soup.findAll('li','clearfix S_line1')
for follow in follows:
temp_str=str(follow)
# print temp_str
temp_uid=temp_str[temp_str.find('uid'):temp_str.find('&')];
temp_soup=BeautifulSoup(temp_str);
temp_fnick=temp_soup.find('div').find('a')['title']
save_user(uid,temp_uid[4:],temp_fnick,str(temp_soup.find('div','info'))[18:-6]);
#print num/20+2
for i in range(2,num/20+1):
urls="http://weibo.com/"+str(uid)+"/follow?page="+str(i);
req = urllib2.Request(urls, headers=HEADERS)
text = urllib2.urlopen(req).read()
lines=text.splitlines()
for line in lines:
if line.startswith('<script>STK && STK.pageletM && STK.pageletM.view({"pid":"pl_relation_hisFollow'):
n = line.find('html":"')
if n > 0:
j = line[n + 7: -12].replace("\\n", "")
j = j.replace("\\t","")
j = j.replace("\\",'');
soup=BeautifulSoup(j)
strs=str(soup.find('div','patch_title'))
num=int(strs[strs.find('關注了')+9:strs.find('人</div')]);
follows=soup.findAll('li','clearfix S_line1')
for follow in follows:
temp_str=str(follow)
# print temp_str
temp_uid=temp_str[temp_str.find('uid'):temp_str.find('&')];
temp_soup=BeautifulSoup(temp_str);
temp_fnick=temp_soup.find('div').find('a')['title']
save_user(uid,temp_uid[4:],temp_fnick,str(temp_soup.find('div','info'))[18:-6]);
if __name__ == '__main__':
#save_user('123','123','ads','212332231')
#creepy_myself()
'''
雖然很謹慎地處理了中文編碼,但每過一段時間還是會有一些問題
於是拋掉了所有異常,防止程序中斷
'''
while(1):
'''
首先取得隊列的尾指針,也就是point
根據point從數據庫中找到uid,然後creepy_others(uuid)
'''
fileHandle = open ( 'setup.ini','r+');
now=int(fileHandle.readline());
point =int(fileHandle.readline())+1;
fileHandle.close()
fileHandle = open ( 'setup.ini','w');
fileHandle.write(str(now)+'\n'+str(point))
fileHandle.close()
cur.execute('select uid from users where id=\''+str(point)+'\'')
uuid=cur.fetchone()[0];
if len(uuid)==10:
try:
creepy_others(uuid)
except Exception , e:
pass
cur.close()
conn.close()