原理看上一篇
工具篇
Xpath Help 谷歌插件(谷歌商店你懂得)
爬取鳳凰首頁新聞
插件使用
![
提取全部修改Xpath語法即可
在python上如何使用?
代碼如下:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import requests
from lxml import etree
from lxml.html import tostring#將某個元素節點 保存爲字符串
import json
def getNews():
url = 'https://news.ifeng.com/'
html = requests.get(url=url)
html = html.content.decode('utf-8')
news_tree = etree.HTML(html)
# #xpath返回一個集合數組,如果有20條,則數組的len爲20
titles = news_tree.xpath(
'//*[@id="root"]/div[6]/div[1]/div[5]/ul/li/a/@title')
hrefs = news_tree.xpath('//*[@id="root"]/div[6]/div[1]/div[5]/ul/li/a/@href')
imgs = news_tree.xpath(
'//*[@id="root"]/div[6]/div[1]/div[5]/ul/li/a/img/@src')
times = news_tree.xpath(
'//*[@id="root"]/div[6]/div[1]/div[5]/ul/li/div/div/time')
tags = news_tree.xpath(
'//*[@id="root"]/div[6]/div[1]/div[5]/ul/li/div/div/span')
#通過遍歷,獲得每一個的信息,然後存入字典中
#然後存入數組,返回json數據
array = []
count = 0
while (count < len(titles)):
title = titles[count]
link = hrefs[count]
img = imgs[count]
time = times[count].text
tag = tags[count].text
dic = {'title': title, 'href': link, 'img': img, 'time': time, 'tag': tag}
array.append(dic)
count = count + 1
return json.dumps(array, ensure_ascii=False)
if __name__ == "__main__":
jsonstring = getNews()
print(jsonstring)
打印輸入如下:
[{
"title": "綠地迴應被舉報高管貪腐問題:調查中 不會姑息",
"href": "//news.ifeng.com/c/7weTelvvWbY",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/CDB7AA8A2B55483B843DAF99CE559E11_w698_h392.png",
"time": "今天 12:05",
"tag": "中國新聞網"
}, {
"title": "美國抗議者在白宮外放裝屍袋辦“葬禮” 問責政府抗疫不力",
"href": "//news.ifeng.com/c/7weTH6IwesH",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/ucms/2020_21/0370BFA6C72EAB721A55DB02731CED811930349E_w698_h392.png",
"time": "今天 12:05",
"tag": "環球網"
}, {
"title": "張文宏:各地有偶發病例是大概率事件,應長期保持適當社交距離",
"href": "//news.ifeng.com/c/7weRdCzXkJc",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/1F2D720F73E54AF8956B39DB212606C6_w690_h387.jpg",
"time": "今天 11:37",
"tag": "張文宏醫生"
}, {
"title": "又美又有才,難道她就是特朗普的“完美”發言人?",
"href": "//news.ifeng.com/c/7weRLg43Viq",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/2CB1E314289C482395DE1CD313E0CCD2_w698_h392.jpg",
"time": "今天 11:33",
"tag": "冰汝看美國"
}, {
"title": "美國傳染病專家福奇兩週未接受採訪,美媒懷疑其被禁聲",
"href": "//news.ifeng.com/c/7weOmUniq6O",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/9DE3F1B1A36F4832BA4DF6D12267D80C_w698_h392.jpg",
"time": "今天 11:15",
"tag": "澎湃新聞"
}, {
"title": "酒駕致廣東援鄂醫生王爍殉職案開庭 被告曾以涉嫌交通肇事罪被批捕",
"href": "//news.ifeng.com/c/7wePraJu7kG",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/3968E83E17854629AC6BDEE647F8C3B4_w698_h392.png",
"time": "今天 11:10",
"tag": "南方都市報"
}, {
"title": "全國政協會議將爲抗疫犧牲烈士和逝世同胞默哀一分鐘",
"href": "//news.ifeng.com/c/7wePxtxWRZA",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/4F5C7FED0DA045EE96DDD311B4542436_w533_h299.jpg",
"time": "今天 11:09",
"tag": "工人日報"
}, {
"title": "全國人大代表姚勁波:降低公積金繳存比例,減輕企業經營負擔",
"href": "//news.ifeng.com/c/7wePkyXwfho",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/ucms/2020_21/740C78C4AE2878A548CAFB829EA511B7B5405646_w698_h392.jpg",
"time": "今天 11:08",
"tag": "澎湃新聞網"
}, {
"title": "人民日報:把“黑暴”趕出香港,得從根上拔除“毒瘤”",
"href": "//news.ifeng.com/c/7wePQZK5wUS",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/07991C78F2EA42DB85E525BE4E847C6F_w600_h336.jpg",
"time": "今天 11:05",
"tag": "人民日報"
}, {
"title": "華爲美國高管:美國斷供我們能挺過去,不過大量美國人會失業",
"href": "//news.ifeng.com/c/7wePHsPV6UC",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/01D0AD2B338D469286850CC5CD8F19AE_w569_h319.jpg",
"time": "今天 11:04",
"tag": "環球網"
}, {
"title": "人大代表建議:取消生育三孩以上的處罰政策 國家給予育兒補貼",
"href": "//news.ifeng.com/c/7weNTNgpLOi",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/ucms/2020_21/630A01F7A7A78464A6D06536A6A6873858EFD058_w698_h392.jpg",
"time": "今天 11:00",
"tag": "新京報"
}, {
"title": "瘋狂的頭盔:我10天賺了800萬",
"href": "//news.ifeng.com/c/7weOSjP6hN2",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/9CCED3FB9DFF4554B5C9F9FC4599B608_w512_h287.jpg",
"time": "今天 10:53",
"tag": "縱相新聞"
}, {
"title": "美國加州聯邦參議員提議案 譴責“中國病毒”等詞彙指稱新冠",
"href": "//news.ifeng.com/c/7weNkE8jFA0",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/FF274567E4C0492496E65C1ECF119A87_w698_h392.jpg",
"time": "今天 10:44",
"tag": "中國日報網"
}, {
"title": "王學坤委員:建議建立農民退休制度 讓65歲以上農民“洗腳上田,老有所養”",
"href": "//news.ifeng.com/c/7weNbtSo7v6",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/75DF69024398483AAA24F657C0EC764F_w602_h338.png",
"time": "今天 10:39",
"tag": "最高人民檢察院"
}, {
"title": "特朗普叫囂“中國有個瘋子”,評論區翻車",
"href": "//news.ifeng.com/c/7weN4fqF7BI",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/4969C19002F340678BCC48B9266B1D2C_w698_h392.jpg",
"time": "今天 10:32",
"tag": "觀察者網"
}, {
"title": "軍報頭版評論:“蓬佩奧們”邊喊抓賊邊做賊,下場註定可悲",
"href": "//news.ifeng.com/c/7weMvlBWShM",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/636D31AEB733412F96581882FCEFC64E_w698_h392.png",
"time": "今天 10:31",
"tag": "解放軍報"
}, {
"title": "特殊時期的中國兩會 外媒都在關注這些",
"href": "//news.ifeng.com/c/7weMghINhz6",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/D4270B52C0C14B72909202DBABECE1B6_w698_h392.jpg",
"time": "今天 10:28",
"tag": "央視新聞客戶端"
}, {
"title": "雷軍建議:進一步降低民營企業進入衛星互聯網門檻",
"href": "//news.ifeng.com/c/7weL0NllooO",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/ucms/2020_21/AB6E6070DCCBE7465220469B2578CD910EA67390_w698_h392.jpg",
"time": "今天 10:20",
"tag": "澎湃新聞"
}, {
"title": "北京15座王府14座被佔,政協委員:應設騰退協調機構",
"href": "//news.ifeng.com/c/7weKz9vBGm5",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/ucms/2020_21/429BCCAD3FC016669563C909F36859F71B506DE0_w698_h392.jpg",
"time": "今天 10:20",
"tag": "新京報"
}, {
"title": "荷蘭政府:水貂可能將新冠病毒傳給人 清查所有養殖場",
"href": "//news.ifeng.com/c/7weKeI1Yr6D",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/09FBF81BFE594527AAF2C36D2ED4EEDF_w519_h291.jpg",
"time": "今天 10:16",
"tag": "觀察者網"
}]
如果需要新聞詳情呢:
方式一:直接在列表中返回,也就是在 getNews()
方法中,先獲取到連接 hrefs
然後遍歷鏈接 得到 href
再去重新使用 lxml
抓取,這種方式對直接返回給客戶端使用不是很友好,一個是返回 json
體積過大,一個是等待時間過長
方式二:重寫抓取函數,傳入相對應頁面的 URL
獲取詳情數據代碼如下:
def getNewsContent(url):
html = requests.get(url=url)
html = html.content.decode('utf-8')
news_content_tree = etree.HTML(html)
#因爲xpath 語法可以保證只獲取一個詳情元素,所以直接取第一個即可
content = news_content_tree.xpath(
'//*[@id="root"]/div/div[3]/div[1]/div[1]/div[3]')[0]
content_html = str(tostring(content))
#如果打印 會發現 前面有一個(b') 以及最後的 (') 所以直接執行切割字符串操作
content_html_text = content_html[2:len(content_html)-1]
return content_html_text
打印數據如下:
<div class="main_content-LcrEruCc"><div><div class="text-3zQ3cZD4"><p>近日,21岁的冼嘉豪因暴动罪被香港法院判刑4年,他在求情信中说:“没有一天不后悔”。2019年6月至2020年4月15日,8001人被捕,1365人被起诉,566人被控暴动罪。个体的悲剧还在持续上演,数字的揪心让人持久难平,一场“修例风波”造就的暴力旋涡,已让多少香港年轻人命运脱轨、前途毁弃。</p><p>曾经拥有的东西因为参与非法暴力活动而丧失,一直拥有的生活因为暴力破坏而止步,狮子山下的纷乱伤害了多少逐梦路上的人。回望香港“修例风波”,正是因为反中乱港分子鼓吹暴力、煽惑暴力,被洗脑的年轻人迷信暴力、使用暴力,香港才结出了孩子有家难回、有梦难圆,市民有工难开、无工可开的苦果,让繁荣稳定的香港陷入危机困境。</p><p><img src="https://x0.ifengimg.com/ucms/2020_21/A1688E829DE205EEBC309384E3783FE8BA15437D_w1080_h1920.jpg"></p><p>这是香港市民想要的吗?最基本的安全被剥夺,出行怕有人又去砸地铁,营业怕黑衣人又来打砸,饭桌上有不同政见也不敢轻易发表,校园里竟成了“兵工厂”;人被贴上标签,店被贴上标签,被起底、被排斥、被攻击,在所谓“私了”和“装修”之下,黑色恐怖的利刃戳进市民的心,让人普遍变得焦虑、恐惧。因为暴徒,个人这小家被黑暗包裹,因为暴力,香港这个大家已满目疮痍,怎能不让人心痛、不让人愤慨,不让人期盼香港重归祥和安定!</p><p>在“修例风波”中,人们已经看尽暴力的危害、暴徒的凶残。特区政府警务处处长邓炳强此前表示,香港正面临本土恐怖主义的威胁,威胁到香港市民的人身安全,也在对国家安全造成冲击。反暴力,是因为暴力已渗透进香港市民的日常生活,危险近在咫尺;是因为暴力还有延续、扩散和升级的可能,要摧毁家园;是因为暴力不止,暴徒将更加猖狂,反中乱港分子将更加嚣张,香港要葬送掉一代代人辛苦建立的基业,辉煌篇章被恐怖主义湮灭。</p><p>通过香港警方严正执法,香港暴徒的气焰已被压制;由于香港市民拥护止暴制乱,香港暴力的土壤正被逐步铲除。但发生在香港的暴力并未绝迹,蠢蠢欲动的暴徒还在伺机而动。5月份前后,人们又看到了暴徒投掷的燃烧弹,看到了暴徒寄出的恐吓邮件。香港市民需要强化共识,一起向暴力说不;香港警方需要再接再厉,不给暴徒任何喘息之机。更需从根本上想办法,根治“黑暴”这个毒瘤。只有让暴徒、暴力成过街老鼠、众矢之的,纵暴、施暴的人付出沉重的代价,香港才有岁月静好,市民才能安心生活。</p></div><span></span><div class="end-37GBinZ_"></div></div></div>