python xpath介紹和新聞內容爬蟲

二十、python xpath介紹和新聞內容爬蟲

Xpath介紹

用xpath提取感興趣的內容

一個網頁文檔是一個半結構化的數據，其實html文檔就是一個樹形結構。根節點是html

用正則表達式也可以提取，但是不如xpath方便

1、路徑表示法

//定位根節點

/text(): 提取文本的內容

/@attr:提取屬性的內容

2、篩選條件

/div[@id]

/div[@id="content_id"]

/book[price>100] #按照節點的值

問題：先再windows中安裝lxml ？？？

前提是先裝好：pip

C:\Users\lyd>pip install lxml

Xpath使用

#獲取新聞列表

import requests

from lxml import etree

import datetime

#根據跟url獲取該文檔中的新聞列表的信息

def getNewsUrlList(baseUrl):

x = requests.get(baseUrl)

html = x.content.decode('gbk')

selector = etree.HTML(html)

contents = selector.xpath('//div[@id="content_right"]/div[@class="content_list"]/ul/li[div]')

for eachLink in contents:

url = eachLink.xpath('div/a/@href')[0]

title = eachLink.xpath('div/a/text()')[0]

ptime = eachLink.xpath('div[@class="dd_time"]/text()')[0]

yield(title, url, ptime)

#根據具體的新聞的url獲取該新聞的內容

def getNewsContent(newsUrlList):

for title, url, ptime in newUrlList:

x = requests.get(url)

html = x.content.decode('gbk') #整個頁面的編碼爲gbk2312

selector = etree.HTML(html)

constants = selector.xpath('//div[@class="left_zw"]/p/text()')

news = '\r\n'.join(constants) #在每一個p標籤都回車換行。 \r\n: linux中的換行，就相當於wind中的\n

yield title, url, ptime, news

# 獲取昨天的日期

def getYesterday(i):

today = datetime.date.today()

oneday = datetime.timedelta(days=i)

yesterday = today - oneday

return yesterday.strftime("%m%d")

if __name__ == "__main__":

urlTemplate = 'http://www.chinanews.com/scroll-news/mil/{0}/{1}{2}/news.shtml'

#http://www.chinanews.com/scroll-news/2017/0719/news.shtml

#http://www.chinanews.com/scroll-news/mil/2017/0717/news.shtml

testurl = urlTemplate.format('2017', '7', '20')

#print testurl

# newUrlList = getNewsUrlList(testurl)

# for title, url, ptime in newUrlList:

# print title, url, ptime

# newsConstant = getNewsContent(newUrlList)

# f = open('news.txt','w') #以寫的方式打開一個文件

# w = lambda x: f.write((x+u'\r\n').encode('utf-8'))

# for title, url, ptime, news in newsConstant:

# w(u'~'*100)

# w(title)

# w(url)

# w(news)

# f.close()

#########################如下是作業：####################################

#爬去今天以前n多天的新聞數據

for i in range(0, 10):

yesterday = getYesterday(i)

urls = '%s%s%s' % ("http://www.chinanews.com/scroll-news/mil/2017/", yesterday, '/news.shtml')

newUrlList = getNewsUrlList(urls)

for title, url, ptime in newUrlList:

print title, url, ptime

# newsConstant = getNewsContent(newUrlList)

# f = open('news.txt','w') #以寫的方式打開一個文件

# w = lambda x: f.write((x+u'\r\n').encode('utf-8'))

# for title, url, ptime, news in newsConstant:

# w(u'~'*100)

# w(title)

# w(url)

# w(news)

# f.close()

-----------------------------------------------------------------------------------------------------------------------

Aidon-東哥博客

發佈了49 篇原創文章 · 獲贊 60 · 訪問量 18萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python xpath介紹和新聞內容爬蟲

二十、python xpath介紹和新聞內容爬蟲

Xpath介紹

Xpath使用

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

nodejs學習06——小案例

評估統計算法在銀行僞造鈔票檢測中的價值

Java ThreadPoolShutdown

5月21日相聚上海張江！與文心大模型一起共建大模型產業應用生態圈

通義千問 2.5 “客串” ChatGPT4，你分的清嗎？

python 特殊方法、運算符重載

python mysqlDB的安裝和使用

hadoop job 的container日誌的查看

centos7.7 常用命令一

tez 0.9.0 的安裝和測試

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結