基本使用
from bs4 import BeautifulSoup
html = """
<div>test</div>
"""
bs = BeautifulSoup( html, 'lxml' )
print ( bs. prettify( ) )
提取元素
from bs4 import BeautifulSoup
html = """
<tr>
<td>1</td>
<td>2</td>
</tr>
<tr class='even'>
<td>1</td>
<td>2</td>
</tr>
<a class='test'id='test' href="www.baidu.com">2</a>
<a href="www.baidu.com"></a>
"""
soup = BeautifulSoup( html, 'lxml' )
trs = soup. find_all( 'tr' )
for tr in trs:
print ( tr)
tr = soup. find_all( 'tr' , limit= 2 ) [ 1 ]
trs = soup. find_all( 'tr' , class_= 'even' )
trs = soup. find_all( 'tr' , attrs= { 'class' : 'even' } )
print ( trs)
aList = soup. find_all( 'a' , id = 'test' , class_= 'test' )
aList = soup. find_all( 'a' , attrs= { 'id' : 'test' , 'class' : 'test' } )
print ( aList)
aList = soup. find_all( 'a' )
for a in aList:
href = a[ 'href' ]
print ( href)
href = a. attrs[ 'href' ]
print ( href)
trs = soup. find_all( 'tr' )
for tr in trs:
print ( tr)
print ( tr. string)
trs = soup. find_all( 'tr' )
for tr in trs:
print ( list ( tr. stripped_strings) )
find返回匹配的第一個標籤,find_all返回匹配的所有標籤,以列表的形式。
select
from bs4 import BeautifulSoup
html = ''
soup = BeautifulSoup( html, 'lxml' )
p = soup. select( 'p' )
p = soup. select( '.className' )
p = soup. select( '#idName' )
p = soup. select( '.box p' )
p = soup. select( '.box>p' )
p = soup. select( 'a[name="a"]' )
p = soup. select( 'div.line' )
四個常用對象
Tag:BeautifulSoup中所有標籤都是Tag類型,並且BeautifulSoup的對象其實本質上也是一個Tag類型,所以其實一些方法比如find,find_all()並不是BeautifulSoup的,而是Tag
NavigableString:繼承python的str,用起來跟python中的str是一樣的
Comment:就是繼承自NavigableString
BeautifulSoup:繼承自Tag。用來生成BeautifulSoup樹的。
遍歷
返回某個標籤下直接子元素,其中也包括字符串。
contents:返回一個列表
children:返回一個迭代器