python 抓取百度搜索名稱和路徑

原創

TimerBin

2020-07-07 16:00

python 新手筆記：利用python，根據搜索關鍵字，抓取其結果的名稱和路徑。

一、百度搜索參數介紹

pn ：抓取第幾頁內容

cl ：搜索的類型 3 爲網頁搜索 2爲新聞搜索

wd ：搜索關鍵字

rn ：需要搜索多少條結果

詳情可參閱：http://blog.sina.com.cn/s/blog_3e28c8a50102v0ck.html

二、定義pthon函數

# coding: UTF-8
import urllib.request 
import urllib.parse
import re

# 查詢名稱   查詢條數
def findBaiduUrlList(searchName,number):
    #定義百度搜索請求URL
    url="http://www.baidu.com/s?pn=0&cl=3&rn="+str(number)
    #將中文進行URL編碼
    url = url +"&wd="+urllib.parse.quote(searchName)
    #發起request請求，並獲取返回結果
    response = urllib.request.urlopen(url)
    #將返回結果進行轉換成UTF-8轉碼
    html = response.read().decode('utf-8')
    #定義截取字符串正則表達式
    splitPattern = re.compile(r'<h3 \D*">')
    requestList = re.split(splitPattern,html)
    
    myUrl=[]
    for c in requestList:
        #清楚換行和前後空格
        c=trim(c)
        if c.startswith('<a') :
            #定義返回結果
            urlObj = ['','']
            urlPattern = re.compile(r'http://www.baidu.com/link.{0,300}target="_blank"')
            urlsMatch = urlPattern.search(c) 
            if urlsMatch: 
                urlObj[1]= trims(urlsMatch.group())            
            
            namePattern = re.compile(r'target="_blank"\s*>.{0,40}</a>')
            nameMatch = namePattern.search(c) 
            if nameMatch: 
                urlObj[0] = trims(nameMatch.group())
            
            myUrl.append(urlObj)
            
        else:
            c = ''
    return myUrl
        
#清除無用字符信息
def trims(str):
    str = str.strip('target="_blank"')
    str = str.replace('<em>','').replace('</em>','').replace('</a>','')
    str = str.replace('>','').replace('"','').replace(' ','').replace('	','')
    return trim(str)
#清除換行和前後空格
def trim(str):
    str = str.replace('\n','').strip()
    return str

python正則表達式可參閱：http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

三、調用代碼

myUrl = findBaiduUrlList('timerbin',1)

for c in myUrl:
    print(c)

四、輸出結果

['返回鏈接名稱'，'鏈接地址']

['TimerBin的博客-ITeye技術網站', 'http://www.baidu.com/link?url=Rvj1VAmkb6527AEXIMQnSKSRFvy4jT0BAYnHjw3Gu4npAccEysMnyRi0fj3Ziwqr']

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python 抓取百度搜索名稱和路徑

python 抓取百度搜索名稱和路徑

@value 註解設置默認值

Spring Orika Bean Copy 屬性丟失問題說明

JVM配置CMS調優實戰筆記

自構建多級緩存

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結