數據爬蟲(六):爬蟲PyQuery基本使用

PyQuery簡介

pyquery相當於jQuery的python實現,可以用於解析HTML網頁等。它的語法與jQuery幾乎完全相同,對於使用過jQuery的人來說很熟悉,也很好上手。

引用作者的原話就是:

“The API is as much as possible the similar to jquery.” 。

安裝

使用 pip 或者 easy_install 都可以。
注意:由於 pyquery 依賴於 lxml ,要先安裝 lxml ,否則會提示失敗。

  1. 安裝lxml:https://pypi.python.org/pypi/lxml/2.3/ (建議直接下載安裝包,方便快捷);
  2. 安裝pyquery:easy_install pyquery 或者pip install pyquery;
  3. 驗證:輸入 import pyquery 回車不報錯即安裝成功

初始化

有 4 種方法可以進行初始化:
可以通過傳入 字符串、lxml、文件 或者 url 來使用PyQuery。

 

1

2

3

4

5

from pyquery import PyQuery as pq

from lxml import etree d = pq(“<html></html>”)#傳入字符串

d = pq(etree.fromstring(“<html></html>”))#傳入lxml

d = pq(url=‘http://google.com/’) #傳入url

d = pq(filename=path_to_html_file) #傳入文件

現在,d 就像 jQuery 中的 $ 一樣了。

字符串初始化

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

html = '''

<div>

    <ul>

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul>

</div>

'''

from pyquery import PyQuery as pq

doc = pq(html)

print(doc('li'))

查找所有的li標籤。輸出結果如下:

 

1

2

3

4

5

<li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

 

URL初始化

 

 

1

2

3

from pyquery import PyQuery as pq

doc = pq(url='http://www.baidu.com')

print(doc('head'))

選出百度網站裏面head標籤裏面的內容。
輸出結果如下:

 

1

<head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>&#231;&#153;&#190;&#229;&#186;&#166;&#228;&#184;&#128;&#228;&#184;&#139;&#239;&#188;&#140;&#228;&#189;&#160;&#229;&#176;&#177;&#231;&#159;&#165;&#233;&#129;&#147;</title></head>

 

文件初始化

 

 

1

2

3

from pyquery import PyQuery as pq

doc = pq(filename='demo.html')

print(doc('li'))

 

基本CSS選擇器

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

html = '''

<div id="container">

    <ul class="list">

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul>

</div>

'''

from pyquery import PyQuery as pq

doc = pq(html)

print(doc('#container .list li'))

選擇id=container和list類下的裏標籤。空格代表一個嵌套。
輸出結果爲:

 

1

2

3

4

5

<li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

 

查找元素

子元素

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

html = '''

<div id="container">

    <ul class="list">

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul>

</div>

'''

from pyquery import PyQuery as pq

doc = pq(html)

items = doc('.list')

print(type(items))

print(items)

lis = items.find('li')

print(type(lis))

print(list)

find找出所有li標籤。
輸出結果爲:

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

<class 'pyquery.pyquery.PyQuery'>

<ul class="list">

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul>

 

<class 'pyquery.pyquery.PyQuery'>

<li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

 

children

查找所有的直接子元素

 

1

2

lis = items.children('.active')

print(lis)

查找子元素裏類爲active類的元素。

父元素

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

html = '''

<div id="container">

    <ul class="list">

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul>

</div>

'''

from pyquery import PyQuery as pq

doc = pq(html)

items = doc('.list')

container = items.parent()

print(type(container))

print(container)

輸出:

 

1

2

3

4

5

6

7

8

9

10

<class 'pyquery.pyquery.PyQuery'>

<div id="container">

    <ul class="list">

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul>

</div>

 

parents元素

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

html = '''

<div class="wrap">

    <div id="container">

        <ul class="list">

             <li class="item-0">first item</li>

             <li class="item-1"><a href="link2.html">second item</a></li>

             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

             <li class="item-1 active"><a href="link4.html">fourth item</a></li>

             <li class="item-0"><a href="link5.html">fifth item</a></li>

         </ul>

     </div>

</div>

'''

from pyquery import PyQuery as pq

doc = pq(html)

items = doc('.list')

parents = items.parents()

print(type(parents))

print(parents)

輸出:

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

<class 'pyquery.pyquery.PyQuery'>

<div class="wrap">

    <div id="container">

        <ul class="list">

             <li class="item-0">first item</li>

             <li class="item-1"><a href="link2.html">second item</a></li>

             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

             <li class="item-1 active"><a href="link4.html">fourth item</a></li>

             <li class="item-0"><a href="link5.html">fifth item</a></li>

         </ul>

     </div>

</div><div id="container">

        <ul class="list">

             <li class="item-0">first item</li>

             <li class="item-1"><a href="link2.html">second item</a></li>

             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

             <li class="item-1 active"><a href="link4.html">fourth item</a></li>

             <li class="item-0"><a href="link5.html">fifth item</a></li>

         </ul>

     </div>

還可以加入參數進行篩選。

 

1

2

parent = items.parents('.wrap')

print(parent)

選取類爲wrap的標籤。

兄弟元素:

代碼:x.siblings()

遍歷

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

html = '''

<div class="wrap">

    <div id="container">

        <ul class="list">

             <li class="item-0">first item</li>

             <li class="item-1"><a href="link2.html">second item</a></li>

             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

             <li class="item-1 active"><a href="link4.html">fourth item</a></li>

             <li class="item-0"><a href="link5.html">fifth item</a></li>

         </ul>

     </div>

</div>

'''

from pyquery import PyQuery as pq

doc = pq(html)

lis = doc('li').items()

print(type(lis))

for li in lis:

    print(li)

.items()方法,返回一個迭代對象。

 

1

2

3

4

5

6

7

8

9

10

<class 'generator'>

<li class="item-0">first item</li>

 

<li class="item-1"><a href="link2.html">second item</a></li>

 

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

 

<li class="item-1 active"><a href="link4.html">fourth item</a></li>

 

<li class="item-0"><a href="link5.html">fifth item</a></li>

 

獲取信息

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

html = '''

<div class="wrap">

    <div id="container">

        <ul class="list">

             <li class="item-0">first item</li>

             <li class="item-1"><a href="link2.html">second item</a></li>

             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

             <li class="item-1 active"><a href="link4.html">fourth item</a></li>

             <li class="item-0"><a href="link5.html">fifth item</a></li>

         </ul>

     </div>

</div>

'''

from pyquery import PyQuery as pq

doc = pq(html)

a = doc('.item-0.active a')

print(a)

print(a.attr('href'))

print(a.attr.href)

選取a下屬性爲href的內容。

 

1

2

3

<a href="link3.html"><span class="bold">third item</span></a>

link3.html

link3.html

 

獲取文本

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

html = '''

<div class="wrap">

    <div id="container">

        <ul class="list">

             <li class="item-0">first item</li>

             <li class="item-1"><a href="link2.html">second item</a></li>

             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

             <li class="item-1 active"><a href="link4.html">fourth item</a></li>

             <li class="item-0"><a href="link5.html">fifth item</a></li>

         </ul>

     </div>

</div>

'''

from pyquery import PyQuery as pq

doc = pq(html)

a = doc('.item-0.active a')

print(a)

print(a.text())

輸出:

 

1

2

<a href="link3.html"><span class="bold">third item</span></a>

third item

 

獲取html

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

html = '''

<div class="wrap">

    <div id="container">

        <ul class="list">

             <li class="item-0">first item</li>

             <li class="item-1"><a href="link2.html">second item</a></li>

             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

             <li class="item-1 active"><a href="link4.html">fourth item</a></li>

             <li class="item-0"><a href="link5.html">fifth item</a></li>

         </ul>

     </div>

</div>

'''

from pyquery import PyQuery as pq

doc = pq(html)

li = doc('.item-0.active')

print(li)

print(li.html())

輸出:

 

1

2

3

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

            

<a href="link3.html"><span class="bold">third item</span></a>

 

DOM操作

addClass、removeClass增加類和刪除類

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

html = '''

<div class="wrap">

    <div id="container">

        <ul class="list">

             <li class="item-0">first item</li>

             <li class="item-1"><a href="link2.html">second item</a></li>

             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

             <li class="item-1 active"><a href="link4.html">fourth item</a></li>

             <li class="item-0"><a href="link5.html">fifth item</a></li>

         </ul>

     </div>

</div>

'''

from pyquery import PyQuery as pq

doc = pq(html)

li = doc('.item-0.active') //選擇li標籤

print(li)

li.removeClass('active') //移除active標籤

print(li)

li.addClass('active')  //增加active鏢旗啊

print(li)

輸出結果:

 

1

2

3

4

5

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

 

<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>

 

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

 

修改屬性和css

 

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

html = '''

<div class="wrap">

    <div id="container">

        <ul class="list">

             <li class="item-0">first item</li>

             <li class="item-1"><a href="link2.html">second item</a></li>

             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

             <li class="item-1 active"><a href="link4.html">fourth item</a></li>

             <li class="item-0"><a href="link5.html">fifth item</a></li>

         </ul>

     </div>

</div>

'''

from pyquery import PyQuery as pq

doc = pq(html)

li = doc('.item-0.active')

print(li)

li.attr('name', 'link')  //把li增加標籤name=link。如果已經存在name屬性則改變name=link。

print(li)

li.css('font-size', '14px')//設置font-size=14px

print(li)

輸出結果:

 

1

2

3

4

5

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

 

<li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>

 

<li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li>

 

remove

 

 

1

2

3

4

5

6

7

8

9

10

11

12

html = '''

<div class="wrap">

    Hello, World

    <p>This is a paragraph.</p>

</div>

'''

from pyquery import PyQuery as pq

doc = pq(html)

wrap = doc('.wrap')

print(wrap.text())

wrap.find('p').remove()

print(wrap.text())

如果只獲取Hello,world
.remove移除。
運行結果:

 

1

2

Hello, World This is a paragraph.

Hello, World

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章