Python使用bs4庫爬蟲實例

原創

2020-06-28 01:48

實驗內容： http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html

網址中提取大學的排名信,包括排名、學校名稱、省市、總分以及所有的指標得分（生源質量（新生高考成績得分、培養結果（畢業生就業率）、社會聲譽（社會捐贈收入·千元）、科研規模（論文數量·篇）、科研質量（論文質量·FWCI）、頂尖成果（高被引論文·篇）、頂尖人才（高被引學者·人）、）科技服務（企業科研經費·千元）、成果轉化（技術轉讓收入·千元）、學生國際化（留學生比例）），並將爬取的信息存在當前目錄中的“大學排名.csv”。

import re,requests
import csv
import numpy
import lxml
from bs4 import BeautifulSoup

url1 = "http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html"
html1 = requests.get(url1).content.decode()

soup = BeautifulSoup(html1,'lxml')
tag = soup.find(class_='table table-small-font table-bordered table-striped')
text1 = tag.find_all('th')[0:4]
text2 = tag.find_all('option')
text3 = tag.find_all('td')

th = []
td = []
for a in text1+text2:
    th += [a.string]
for a in text3:
    td += [a.string]
td = numpy.array(td).reshape(int(len(text3)/14),14)

with open('大學排名.csv','w',newline='',encoding='utf-8') as f:
    writer = csv.writer(f)
    #writer.writeheader()
    writer.writerow(th)
    for a in td:
        print(a)
        writer.writerow(a)

實驗內容：https://www.dxsbb.com/news/5463.html 網址中提取大學的排名信息,包括排名、學校名稱、綜合總分、星級排名以及辦學層次信息，並將爬取的信息存在當前目錄中的“大學排名校友會版.csv”。

import re,requests
import csv
import numpy
import lxml
from bs4 import BeautifulSoup

url1 = "https://www.dxsbb.com/news/5463.html"
html1 = requests.get(url1).content.decode('gbk')

soup = BeautifulSoup(html1,'html.parser')
text1 = soup.find_all('tbody')[1].find_all('td',)

td = []
for a in text1:
    td += [a.text]
td = numpy.array(td).reshape(int(len(text1)/5),5)

with open('大學排名校友會版.csv','w',newline='',encoding='utf-8') as f:
    writer = csv.writer(f)
    #writer.writeheader()
    for a in td:
        print(a)
        writer.writerow(a)

實驗內容：

（1）打開網址 http://dianying.2345.com/list/----2019---.html，點擊網頁底端的下一頁，查看網頁 URL 鏈接的變化；

（2）爬取第一頁網頁的中所有電影的<名稱>、<演員>以及<得分>

（3）利用步驟二的模式編寫函數，利用循環結構爬取所有頁面中電影的信息。

（4）將爬取的信息存入“最新電影信息.csv”文件中。

import re,requests
import csv
import numpy
import lxml
from bs4 import BeautifulSoup

film_list = []
for i in range(1,30):
    url = "http://dianying.2345.com/list/----2019---" + str(i) + ".html"
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'html.parser')

    filename_tag = soup.find_all('em', class_='emTit')
    score_tag = soup.find_all('span', {"class": "pRightBottom"})
    star_tag = soup.find_all('span', {"class": "sDes"})

    for i in range(0, len(filename_tag)):
        tag = star_tag[i]
        if (tag.em != None):
            temp = tag.text.strip().split("：")[1].split("\xa0\xa0\xa0")
        else:
            temp = ['無']
        film_list += [[filename_tag[i].text, score_tag[i].em.text] + temp]

with open('電影.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    #writer.writerow(["名稱","評分","主演"])
    for a in film_list:
        print(a)
        writer.writerow(a)

實驗內容：

（1） http://www.zhcw.com/ssq/kaijiangshuju/index.shtml?type=0，打開此網址，並通過瀏覽器中“檢查”選項發現此網頁數據來源規律；

（2）爬取 1-150 頁的中所有中獎的<開獎時間>、<期號>、<中獎號碼>、<銷售額>、、 <一等獎>、、 <二等獎>信息存儲至 CSV 文件。

import re,requests
import csv
from bs4 import BeautifulSoup

form = []
for i in range(1,2):
    url1 = "http://kaijiang.zhcw.com/zhcw/html/ssq/list_%s.html" %(i)
    html1 = requests.get(url1).text
    soup = BeautifulSoup(html1, 'html.parser')
    tag = soup.find_all('tr')
    print(tag)
    for a in tag[2:len(tag) - 1]:
        temp = []
        for b in a.contents[0:12]:
            if (b != '\n'):
                temp += [b.text.strip().replace('\r\n', '').replace(' ', '').replace('\n', ' ')]
        form.append(temp)

with open('雙色球中獎信息.csv','w',newline='',encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['開獎日期', '期號', '中獎號碼', '銷售額(元)', '一等獎', '二等獎'])
    for a in form:
        print(a)
        writer.writerow(a)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python使用bs4庫爬蟲實例

探究職業發展的關鍵：能力模型解讀

高效率使用windows

智能決策新時代：可視化大屏是否能夠超越傳統白板？

解密Prompt系列28. LLM Agent之金融領域摸索：FinMem & FinAgent

分享幾個.NET開源的AI和LLM相關項目框架

Python使用bs4庫爬蟲實例

Python使用Selenium庫爬取動態網頁

Python re庫正則方式爬取貓眼電影

Java集合體系與集合選用

Java集合Collection接口的常用方法(實現集合元素的增刪，集合之間的交併差，集合與數組間的轉換，集合的迭代器法遍歷)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結