bs4解析html文件記錄

from bs4 import BeautifulSoup

with open('html.txt', 'r', encoding='utf8') as f:
	htmlfile = f.read()
soup = BeautifulSoup(htmlfile, 'lxml')
article_titles = soup.findAll('p', {'class':'article-list-item-txt'})
article_links = soup.findAll('div', {'class':'item-info-oper'})
ats = []
for at in article_titles:
	atstr = at.a.get_text()
	ats.append(atstr)
als = []
for al in article_links:
	for l in al.children:
		if l.name == 'a' and l.attrs['href'] != 'javascript:void(0);':
			alstr = l.attrs['href']
			als.append(alstr)
assert len(ats) == len(als)
for at, al in zip(ats, als):
	with open('lala.md', 'a', encoding='utf8') as f:
		f.write('[%s](%s)\n'%(at, al))

以上程序分析以下html文件:

<div class="article-list-item-mp"><div class="list-item-title"><!----><span class="article-list-item-tag">置頂</span><p class="article-list-item-txt "><!----><a href="/console/editor/html/104424751" class="" title="編輯">如果你重裝了服務器的centos系統</a></p></div><div class="article-list-item-info"><div class="item-info-left"><span> 原創 </span><span>2020年02月21日 11:33:06</span><span class="article-list-item-readComment"><img src="https://csdnimg.cn/release/mp/img/read.png" title="閱讀數" alt="" class="icon"> 20 </span><span class="article-list-item-readComment"><img src="https://csdnimg.cn/release/mp/img/comment.png" title="評論" alt="" class="icon"> 0 </span></div><div class="item-info-oper"><!----><a href="javascript:void(0);"><span class="useCard item-info-oper-text"> 使用推薦卡 </span></a><a href="https://blog.csdn.net/z1314520cz/article/details/104424751" target="_blank"><span class="item-info-oper-text">查看</span></a><div class="item-info-discuss"><div class="el-select"><!----><div class="el-input el-input--suffix"><!----><input type="text" readonly="readonly" autocomplete="off" placeholder="請選擇" class="el-input__inner"><!----><span class="el-input__suffix"><span class="el-input__suffix-inner"><i class="el-select__caret el-input__icon el-icon-arrow-up"></i><!----><!----><!----><!----><!----></span><!----></span><!----><!----></div><div class="el-select-dropdown el-popper" style="display: none; min-width: 90px;"><div class="el-scrollbar" style=""><div class="el-select-dropdown__wrap el-scrollbar__wrap" style="margin-bottom: -23px; margin-right: -23px;"><ul class="el-scrollbar__view el-select-dropdown__list"><!----><li class="el-select-dropdown__item selected"><span> 評論公開</span></li><li class="el-select-dropdown__item"><span>審覈後公開</span></li></ul></div><div class="el-scrollbar__bar is-horizontal"><div class="el-scrollbar__thumb" style="transform: translateX(0%);"></div></div><div class="el-scrollbar__bar is-vertical"><div class="el-scrollbar__thumb" style="transform: translateY(0%);"></div></div></div><!----></div></div></div><a href="javascript:void(0);"><span class="setTop item-info-oper-text">取消置頂</span></a><div class="item-info-inline"><a href="javascript:void(0);" class="item-right-border"><span class="del item-info-oper-text">刪除</span></a></div></div></div></div>

使用BeautifulSoup對象解析html文件,找到有特徵的節點獲取節點信息。有以下幾點注意:

1.單節點下可通過標籤名訪問下一節點。2.訪問子節點用屬性children。3.節點屬性name就是標籤名,attrs就是屬性字典。

參考鏈接:

python 爬蟲(一)Beautifulsoup 和 父標籤、子標籤、兄標籤

BeautifulSoup常用操作

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章