Python爬蟲自動獲取CSDN博客收藏文章

CSDN的Python創意編程活動開始第一天就看到了，但是認爲自己是菜鳥，就向當“吃瓜羣衆”，後來看到有好多人的代碼是關於爬蟲的，當初我就是由於對爬蟲感興趣才自學的Python。現在也打算參加一下這個活動。

由於經常使用CSDN，所以收藏了好多優秀的文章，但是對於收藏夾沒有整理好，要回去找之前收藏的文章不是很方便，經過研究，就用自學的簡單Python爬蟲幫我吧。

去到首頁一看，收藏的文章是算是異步加載的吧。。但是每次都要點擊“顯示更多"才能看到後面的內容。

幸運的是我也知道一點異步加載的知識，就按F12進行研究：瀏覽後一些內容以後

雙擊打開Name下的鏈接：

什麼鬼，這是什麼東西，我看不懂呀!其實這是unicode編碼，要換成中文也很簡單：在命令行就可以轉換

當然，還有一個更好的方法，一個好用的網頁： http://tool.chinaz.com/tools/unicode.aspx

可以實現在線轉碼的功能。把那個網頁的第一條信息複製進來，點擊unicode轉中文，就可以看到中文了。

現在來研究一下那個鏈接，

http://my.csdn.net/my/favorite/get_favorite_list?pageno=2&pagesize=10&username=hurmishine

經過測試後發現，pageno這個參數控制顯示頁面起始編號，pagesize就是每頁顯示的數據條數。

我們可以通過改變參數來獲取全部信息。

最後確定的鏈接爲：

http://my.csdn.net/my/favorite/get_favorite_list?pageno=0&pagesize=10000&username=hurmishine

pagesize儘量設大一點,如果收藏的實際數量少於參數,將會以實際數量顯示。

對於每一條數據：

{"id":"12653825","username":"hurmishine","url":"http:\/\/blog.csdn.net\/marksinoberg\/article\/details\/70946107","domain":"blog.csdn.net","title":"CSDN \u535a\u5ba2\u5907\u4efd\u5de5\u5177 - \u66f4\u4e0a\u4e00\u5c42\u697c\uff01 - \u535a\u5ba2\u9891\u9053 - CSDN.NET","description":"","share":"1","dateline":"1493451002","map_name":""},

對於我來說有用的只有url和title，我們用正則表達式匹配出來即可。

完整代碼如下,具體細節自己體會:

#coding:utf-8
import urllib,urllib2,re,cookielib

def saveByText():
	f=open("html.html")#保存到本地的文件名
	html = f.read();
	#"url":"http:\/\/blog.csdn.net\/zhangweiguo_717\/article\/details\/52716677",
	#"title":"Python\u6a21\u62df\u767b\u5f55CSDN - \u535a\u5ba2\u9891\u9053 - CSDN.NET",
	# urls = re.findall(r'"url":"(.*?)",',html)
	# links = re.findall(r'"title":"(.*?)",',html)

	links = re.findall(r'"url":"(.*?)",.*?"title":"(.*?)"',html)
	f2=open("index.html","w")
	f2.write("<meta charset='utf-8'>\r\n")
	index=0
	for link in links :
		ans=link[1].decode('unicode-escape').encode('utf-8')
		# print ans
		ans=ans.replace(' - 博客頻道 - CSDN.NET','').replace("\/",'/')
		# print ans
		url = link[0].replace("\/",'/')
		index+=1
		f2.write('<font size="5">'+' '*10+str(index)+"、</font>"+"\n<a href="+url+' target="_blank">'+'\n')
		f2.write('<font size="5">'+ans+"</font></a><br><br><br>\n\n")
	f2.close()

if __name__ == '__main__':
	saveByText()

結果顯示：

PS：自己也沒想到自己竟然收藏了那麼多的文章，其間也發現，竟然有重複的收藏文章，應該是之前的Bug的吧，還有，收藏夾或許是個罪惡的根源，總以爲收藏了以後去看。。。但是實際呢？

到現在可以說是基本完成了，但是每次都要複製，有點麻煩，如果可以模擬登陸，全自動那該多好呀，但是我還不會呀，但是經過不懈的努力，終於在網上找到了模擬登陸CSDN博客成功的代碼，就拿來用了....

鏈接地址：http://blog.csdn.net/zhangweiguo_717/article/details/52716677

雖然此次活動中備份CSDN博客的那份代碼也涉及到模擬登陸：http://blog.csdn.net/Marksinoberg/article/details/70946107

自動登陸獲取收藏內容代碼：

#coding:utf-8
import urllib,urllib2,re,cookielib
import re
import getpass #密文輸入
UA='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36'
headers = {'User-Agent':UA}  
def login(username,password):
	#建立帶有cookie的opener
	cookie = cookielib.CookieJar()
	cookieProc = urllib2.HTTPCookieProcessor(cookie)
	global opener 
	opener = urllib2.build_opener(cookieProc)
	h = opener.open('https://passport.csdn.net').read().decode("utf8")
	patten1 = re.compile(r'name="lt" value="(.*?)"')
	patten2 = re.compile(r'name="execution" value="(.*?)"')
	b1 = patten1.findall(h)
	b2 = patten2.findall(h)
	global postData
	postData = {
	    'username': username,
	    'password': password,
	    'lt': b1[0],
	    'execution': b2[0],
	    '_eventId': 'submit',
	}
	
	postData= urllib.urlencode(postData)

	opener.addheaders = [('User-Agent',UA),
	                     ('Referer', 'https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn')
	                     ]
	response = opener.open('https://passport.csdn.net', data=postData)
	# response2 = opener.open('http://my.csdn.net/my/fans')  
	# text2 = response2.read()
	# print text2

def autoSave(username):
	url='http://my.csdn.net/my/favorite/get_favorite_list?pageno=0&pagesize=10000&username=hurmishine'
	# req = urllib2.Request(url=url.format(username),headers=headers)  
	# html = urllib2.urlopen(req).read()
	html = opener.open(url.format(username), data=postData).read()
	links = re.findall(r'"url":"(.*?)",.*?"title":"(.*?)"',html)
	f2=open("index.html","w")
	f2.write("<meta charset='utf-8'>\r\n")
	index=0
	flag=''
	print len(links)
	for link in links :
		ans=link[1].decode('unicode-escape').encode('utf-8')
		ans=ans.replace(' - 博客頻道 - CSDN.NET','').replace("\/",'/')
		url = link[0].replace("\/",'/')
		index+=1
		print ans,url
		f2.write('<font size="5">'+' '*10+str(index)+"、</font>"+"\n<a href="+url+' target="_blank">'+'\n')
		f2.write('<font size="5">'+ans+"</font></a><br><br><br>\n\n")
	f2.close()

if __name__ == '__main__':
	username=""
	password=""
	login(username,password)
	autoSave(username)

好了，就這樣吧。

自動獲取收藏的那份代碼,上傳的47行有問題,註釋部分忘了去除,上面代碼已更正.

參考博客：http://blog.csdn.net/haichao062/article/details/8107316
http://blog.csdn.net/devil_2009/article/details/38796533

AC_Dreameng

發佈了374 篇原創文章 · 獲贊 211 · 訪問量 70萬+

私信關注

Python爬蟲自動獲取CSDN博客收藏文章

[軟件工具百科] 互聯網資源歷史快照歸檔站點與數字圖書館

網易面試：SpringBoot如何開啓虛擬線程？

杭州的 IT 崩盤了麼？

程序員常見的文本查看工具

VS2022 解決方案打不開 .NET Framework 4.0 、 4.5 等老項目

Vue3 運行可以，build 打包發佈報錯，app.config.globalProperties 用法坑

既然測試也要求寫代碼，那乾脆讓開發兼任測試不就好了嗎？

ITSM落地經驗之建設藍圖規劃

PDF 補丁丁 1.0.2 版更新

奇怪！應用的日誌呢？？

POJ 1236 Network of Schools【強連通縮點】【Tarjan算法】

比賽鏈接

POJ 1383 Labyrinth 【樹的直徑】【真正的圖】

POJ 2631 Roads in the North 【樹的直徑】

POJ 1849 Two【樹的直徑+樹的遍歷】

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結