網頁抓取

網頁抓取

原創

2019-02-22 16:00

### -*- coding: cp936 -*-
###<a href="http://home.51cto.com" target="_blank">家園</a>
##import urllib
##str0='<a href="http://home.51cto.com" target="_blank">家園</a>'
##href=str0.find('<a href')
##print href
##com=str0.find('.com"')
##print com
##url=str0[href+9:com+4]
##print url
##content=urllib.urlopen(url).read()
###print content
##filename=url[-9:]
##print filename
##open(filename,'w').write(content)
####_________________________________
import urllib
url = ['']*50
con = urllib.urlopen('http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html').read()
i = 0
title = con.find(r'<a title=')
href = con.find('href=',title)
html = con.find('.html',href)

while title !=-1 and href != -1 and html != -1 and i < 50 :
url[i] = con[href + 6:html + 5]
print url[i]
title = con.find('<a title=',html)
href = con.find('href=',title)
html = con.find('.html',href)
i = i + 1
else:
    print 'find end!'
j = 0
while j < 50:
    content = urllib.urlopen(url[j]).read()
    open('hanhan/'+url[j][-26:],'w').write(content)
    j = j + 1
else:
    print "over"

##
##--------------------------------------------

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

linux安裝cuda和cudnn

模擬手機設備：使用 Playwright 實現移動端自動化測試

Mellanox網卡開啓SR-IOV

全面系統的AI學習路徑，幫助普通人也能玩轉AI

HTML 00 Tutorial

uni-app實現上拉加載

vue3編譯優化之“靜態提升”

又是一個月-20240513

flask 如何保證返回json有序

linux服務器設置ssh免密

李嘉誠自述（全文）

中間件

linux之sed用法

linux之Awk用法

風險評估框架及流程

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結