beautifulsoup庫簡單抓取網頁--獲取所有鏈接例子

原創

aaa1111sss

2018-09-11 02:51

簡介：

通過BeautifulSoup 的 find_all方法，找出所有a標籤中的href屬性中包含http的內容，這就是我們要找的網頁的一級鏈接（ 這裏不做深度遍歷鏈接）

並返回符合上述條件的a標籤的href屬性的內容，這就是我們要找的某個網頁的所帶有的一級鏈接

#!/opt/yrd_soft/bin/python

import re
import urllib2
import requests
import lxml
from bs4 import BeautifulSoup

url = 'http://www.baidu.com'

#page=urllib2.urlopen(url)
page=requests.get(url).text
pagesoup=BeautifulSoup(page,'lxml')
for link  in pagesoup.find_all(name='a',attrs={"href":re.compile(r'^http:')}):
    #print type(link)
    print link.get('href')

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

beautifulsoup庫簡單抓取網頁--獲取所有鏈接例子

一鍵自動化博客發佈工具,用過的人都說好(頭條篇)

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

京東面試：如何進行JVM調優？

01 穩定性（一）如何應對事故並做好覆盤？

線程池那些坑爹的參數-核心線程數&最大線程數&工作隊列

Stream流常用方法總結

python 統計單詞個數---不去重

vim列模式編輯

python變量的legb作用域規則

我的友情鏈接

python做文本按行去重

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結