PyQuery簡介
pyquery相當於jQuery的python實現,可以用於解析HTML網頁等。它的語法與jQuery幾乎完全相同,對於使用過jQuery的人來說很熟悉,也很好上手。
引用作者的原話就是:
“The API is as much as possible the similar to jquery.” 。
安裝
使用 pip 或者 easy_install 都可以。
注意:由於 pyquery 依賴於 lxml ,要先安裝 lxml ,否則會提示失敗。
- 安裝lxml:https://pypi.python.org/pypi/lxml/2.3/ (建議直接下載安裝包,方便快捷);
- 安裝pyquery:easy_install pyquery 或者pip install pyquery;
- 驗證:輸入
import pyquery
回車不報錯即安裝成功
初始化
有 4 種方法可以進行初始化:
可以通過傳入 字符串、lxml、文件 或者 url 來使用PyQuery。
1 2 3 4 5 |
from pyquery import PyQuery as pq from lxml import etree d = pq(“<html></html>”)#傳入字符串 d = pq(etree.fromstring(“<html></html>”))#傳入lxml d = pq(url=‘http://google.com/’) #傳入url d = pq(filename=path_to_html_file) #傳入文件 |
現在,d 就像 jQuery 中的 $ 一樣了。
字符串初始化
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) print(doc('li')) |
查找所有的li標籤。輸出結果如下:
1 2 3 4 5 |
<li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> |
URL初始化
1 2 3 |
from pyquery import PyQuery as pq doc = pq(url='http://www.baidu.com') print(doc('head')) |
選出百度網站裏面head標籤裏面的內容。
輸出結果如下:
1 |
<head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>百度一下,你就知道</title></head> |
文件初始化
1 2 3 |
from pyquery import PyQuery as pq doc = pq(filename='demo.html') print(doc('li')) |
基本CSS選擇器
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) print(doc('#container .list li')) |
選擇id=container和list類下的裏標籤。空格代表一個嵌套。
輸出結果爲:
1 2 3 4 5 |
<li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> |
查找元素
子元素
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') print(type(items)) print(items) lis = items.find('li') print(type(lis)) print(list) |
find找出所有li標籤。
輸出結果爲:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
<class 'pyquery.pyquery.PyQuery'> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul>
<class 'pyquery.pyquery.PyQuery'> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> |
children
查找所有的直接子元素
1 2 |
lis = items.children('.active') print(lis) |
查找子元素裏類爲active類的元素。
父元素
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') container = items.parent() print(type(container)) print(container) |
輸出:
1 2 3 4 5 6 7 8 9 10 |
<class 'pyquery.pyquery.PyQuery'> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> |
parents元素
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') parents = items.parents() print(type(parents)) print(parents) |
輸出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
<class 'pyquery.pyquery.PyQuery'> <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div><div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> |
還可以加入參數進行篩選。
1 2 |
parent = items.parents('.wrap') print(parent) |
選取類爲wrap的標籤。
兄弟元素:
代碼:x.siblings()
遍歷
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) lis = doc('li').items() print(type(lis)) for li in lis: print(li) |
.items()方法,返回一個迭代對象。
1 2 3 4 5 6 7 8 9 10 |
<class 'generator'> <li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li> |
獲取信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) a = doc('.item-0.active a') print(a) print(a.attr('href')) print(a.attr.href) |
選取a下屬性爲href的內容。
1 2 3 |
<a href="link3.html"><span class="bold">third item</span></a> link3.html link3.html |
獲取文本
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) a = doc('.item-0.active a') print(a) print(a.text()) |
輸出:
1 2 |
<a href="link3.html"><span class="bold">third item</span></a> third item |
獲取html
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active') print(li) print(li.html()) |
輸出:
1 2 3 |
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<a href="link3.html"><span class="bold">third item</span></a> |
DOM操作
addClass、removeClass增加類和刪除類
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active') //選擇li標籤 print(li) li.removeClass('active') //移除active標籤 print(li) li.addClass('active') //增加active鏢旗啊 print(li) |
輸出結果:
1 2 3 4 5 |
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
修改屬性和css
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active') print(li) li.attr('name', 'link') //把li增加標籤name=link。如果已經存在name屬性則改變name=link。 print(li) li.css('font-size', '14px')//設置font-size=14px print(li) |
輸出結果:
1 2 3 4 5 |
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li> |
remove
1 2 3 4 5 6 7 8 9 10 11 12 |
html = ''' <div class="wrap"> Hello, World <p>This is a paragraph.</p> </div> ''' from pyquery import PyQuery as pq doc = pq(html) wrap = doc('.wrap') print(wrap.text()) wrap.find('p').remove() print(wrap.text()) |
如果只獲取Hello,world
.remove移除。
運行結果:
1 2 |
Hello, World This is a paragraph. Hello, World |