python爬取百度搜索結果

原創

alihonglong

2020-02-20 14:23

作爲python學習的一個練習：爬取百度搜索結果的前八頁的搜索結果，每個條目保存標題、鏈接、描述。

環境：

1，python-3.3.2，環境編碼格式utf-8

2，beautifulsoup4-4.1.0

說明：

1，將要搜索的關鍵詞放在個腳本文件同級目錄下searchfile.txt中，一個關鍵詞一行

2，搜索結果會位於同級目錄下data文件夾中，一個關鍵詞一個文件

腳本：

#coding:utf-8
import sys
import time
import urllib.request
from bs4 import BeautifulSoup #from BeautifulSoup import BeautifulSoup 舊的版本，
import os
mymap=['0','1','2','3','4','5','6','7']

#函數1，根據關鍵字獲取查詢網頁
def baidu_search(key_words,pagenum):
    url='http://www.baidu.com/s?wd='+key_words+'&pn='+mymap[pagenum]
    html=urllib.request.urlopen(url).read()
    return html

#函數2，處理一個要搜索的關鍵字
def deal_key(key_words):
    if os.path.exists('data')==False:
        os.mkdir('data')
    filename='data\\'+key_words+'.txt'
    fp=open(filename,'wb')                                                             #打開方式用‘w'時，下邊的寫要str轉換，而對於網頁要編碼轉換，遇到有些不規範的空格還出錯
    if fp:
        pass
    else:
        print('文件打失敗：'+filename)
        return
    x=0
    while x<=7:
        htmlpage=baidu_search(key_words,x)
        soup=BeautifulSoup(htmlpage)
        for item in soup.findAll("div", {"class": "result"}):                #這個格式應該參考百度網頁佈局
            a_click = item.find('a')
            if a_click:
                fp.write(a_click.get_text().encode('utf-8'))                  #標題
            fp.write(b'#')
            if a_click:
                fp.write(a_click.get("href").encode('utf-8'))                 #鏈接
            fp.write(b'#')
            c_abstract=item.find("div", {"class": "c-abstract"})
            if c_abstract:
                strtmp=c_abstract.get_text()
                fp.write(strtmp.encode('utf-8'))                                    #描述
            fp.write(b'#')
        x=x+1
        fp.write(b'\n')
    fp.close()

#函數3，讀取搜索文件內容，依次取出要搜索的關鍵字
def search_file():
    fp=open('searchfile.txt')
    i=0
    keyword=fp.readline()
    while keyword:
        i=i+1
        if i==5:
            print('sleep...')
            time.sleep(15)
            print('end...')
            i=0
        nPos=keyword.find('\n')
        if nPos>-1:
            keyword=keyword[:-1]#keyword.replace('\n','')
        deal_key(keyword)
        keyword=fp.readline()

#腳本入口
print('Start:')
search_file()
print('End！')

alihonglong

發佈了42 篇原創文章 · 獲贊 10 · 訪問量 8萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬取百度搜索結果

《Python進階》學習筆記

一個docker容器暴露多個端口

leetcode 60 排列序列

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

Android的入門學習一

之：GRUB總結

問題總結error in your SQL syntax

騰訊出此策，阿里就怕了

統計算法學習梳理（一）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結