使用Python網絡爬蟲抓取牛客網題目

原創

Curren.wong

2020-05-18 15:01

文章目錄

5. 輸出

1. 背景

最近做題的時候要寫一些題解，在把牛客網的題目複製下來的時候，數學公式的處理比較麻煩，所以我用Python的selenium、urllib.request和BeautifulSoup4庫對題目信息進行了爬取，寫題解的時候時間節約了很多。

考慮到大家可能也會遇到同樣的問題，寫一篇筆記分享給大家。

2. 前期準備

安裝selenium、urllib和BeautifulSoup庫。

pip3 install urllib
pip3 install selenium
pip3 install beautifulsoup4

3. 獲取網頁內容

以牛客網 NC204552 咪咪遊戲爲例。

# 導入庫
import urllib.request
import bs4
import time
from bs4 import BeautifulSoup
from selenium import webdriver

# 題目屬性
problemId = "204552"
# 打開瀏覽器，模擬登陸
# 此處用的是Chrome，如果沒有安裝可以替換爲其他支持的瀏覽器
driver = webdriver.Chrome()

獲取網頁內容

# 獲取頁面內容
# 題目鏈接
url = f"https://ac.nowcoder.com/acm/problem/{problemId}"
# 打開網頁
driver.get(url)
# 網頁加載等待時間
time.sleep(3)
# 找到 輸入 用戶名 和密碼框，並且設置內容
username = driver.find_element_by_id('jsEmailIpt')
# 輸入賬號名，xxx替換爲自己的賬戶名
username.send_keys('xxx')

time.sleep(1)
password = driver.find_element_by_id('jsPasswordIpt')
#輸入密碼，xxx替換爲自己的密碼
password.send_keys('xxx')

time.sleep(1)
# 分析網頁，找到登錄按鈕
login = driver.find_elements_by_css_selector('div[class=col-input-login] a')[0]
# 點擊按鈕
login.click()

time.sleep(3)
# 格式化源代碼
soup = BeautifulSoup(driver.page_source,'lxml')
# 退出瀏覽器
driver.quit()

存儲和預處理

# 存儲
data_dict = {}
# 找到主體內容
mainContent = soup.find_all(name="div", attrs={"class" :"terminal-topic"})[0]

# 去除公式的重複html元素
for each in mainContent.find_all('mrow'):
    each.decompose()
for each in mainContent.find_all(name="span", attrs={"class" :"katex-html"}):
    each.decompose()
# 更換換行符
for each in mainContent.find_all('br'):
    each.replace_with("\n\n")

4. 內容處理

4.1. Limit

先從比較簡單的信息入手，找到題目標題、時間、和內存限制。

# Limit
# 找到題目標題、時間、和內存限制
div = mainContent.find_all(name="div", attrs={"class":"subject-item-wrap"})[0].find_all("span")
# 放入字典中存儲
data_dict['Title'] = f"牛客網 NC{problemId} " + soup.title.contents[0]
# Time Limit
data_dict['Time Limit'] = div[0].contents[0].split('：')[1]
# Memory Limit
data_dict['Memory Limit'] = div[1].contents[0].split('：')[1]

定義函數，處理主體內容中詭異的空格和公式的符號。

def divTextProcess(div):
    """
    處理<div>標籤中的文本內容
    """
#     獲取文本
    strBuffer = div.get_text()
#     替換公式標記
    strBuffer = strBuffer.replace("{", " $").replace("}", "$ ")
#     去除多個空格
    strBuffer = strBuffer.replace("  ", "")
#     去除多個換行符
    strBuffer = strBuffer.replace("\n\n\n", "\n")
#     去除內容中用\xa0表示的空格
    strBuffer = strBuffer.replace("\xa0", "")
#     去除首位空格
    strBuffer = strBuffer.strip()
    # 返回結果
    return strBuffer

4.2. Problem Description

獲取題目描述

# 處理題目描述
div = mainContent.find_all(name="div", attrs={"class": "subject-question"})[0]
data_dict['Problem Description'] = divTextProcess(div)

4.3. Input

輸入描述

div = mainContent.find_all(name="pre")[0]
data_dict['Input'] = divTextProcess(div)

4.4. Output

輸出描述

div = mainContent.find_all(name="pre")[1]
data_dict['Output'] = divTextProcess(div)

4.5. Sample Input & Onput

輸入樣例，用代碼框環境包圍。

# Input
div = mainContent.find_all(name="div", attrs={"class":"question-oi-cont"})[0]
data_dict['Sample Input'] = "```cpp" + div.get_text() + '```'
# Onput
div = mainContent.find_all(name="div", attrs={"class":"question-oi-cont"})[1]
data_dict['Sample Onput'] = "```cpp" + div.get_text() + '```'

4.6. Note

備註

# 若有備註
if len(mainContent.find_all(name="pre")) >= 5:
    div = mainContent.find_all(name="pre")[-1]
    data_dict['Note'] = divTextProcess(div)

4.7. Source

題目鏈接

data_dict['Source'] = '[' + data_dict['Title'] + ']' + '(' + url + ')'

5. 輸出

for each in data_dict.keys():
    print('### ' + each + '\n')
    print(data_dict[each].replace("\n\n**", "**").replace("**\n\n", "**") + '\n')

下面是最後的輸出結果

### Title

牛客網 NC204552 咪咪遊戲

### Time Limit

C/C++ 1秒，其他語言2秒

### Memory Limit

C/C++ 524288K，其他語言1048576K

### Problem Description

牛牛最近喜歡玩咪咪遊戲，於是自己寫了個程序編了個遊戲讓牛妹來玩。遊戲是這樣的： 

牛牛有一個長的字符串（只包26含個小寫字母），他想讓牛妹判斷這個字符串是好的。

定義一個串是好的：這個串是由連續的mq連接而成的。 

比如 $mqmq$ 說明這個串是好的， $mqmqm$ 或 $mqmqx$ 都是不好的。現在牛牛 想問牛妹這個串是否是好的，如果好的輸出 $Yes$ ，否則輸出 $No$

### Input

第一行一個整數Q，表示詢問次數

就下來Q行，一個字符串$s

### Output

Q行，每行輸出 $Yes$ 或 $No$

### Sample Input

// 這裏會有```cpp代碼環境，在這裏爲了展示方便去掉了
4
mqmq
mqmqm
mqakioi
mqqmmq


### Sample Onput

Yes
No
No
No

### Note

對於 $60\%$ 的數據滿足： $|s|<=10,Q<=10$ 且保證只出現m,q兩個字符

對於 $100\%$ 的數據： $|s| <=10^5,Q<=10$ 

對於所有數據保證只出現26個英文小寫字母

### Source

[牛客網 NC204552 咪咪遊戲](https://ac.nowcoder.com/acm/problem/204552)

聯繫郵箱：[email protected]

Github：https://github.com/CurrenWong

歡迎轉載/Star/Fork，有問題歡迎通過郵箱交流。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用Python網絡爬蟲抓取牛客網題目

文章目錄

1. 背景

2. 前期準備

3. 獲取網頁內容

4. 內容處理

4.1. Limit

4.2. Problem Description

4.3. Input

4.4. Output

4.5. Sample Input & Onput

4.6. Note

4.7. Source

5. 輸出

網絡安全入門之跨站腳本攻擊 DVWA XSS DOM Low to High

2021考研數學高數第六章定積分的應用

2021考研數學高數第五章定積分與反常積分

網絡安全入門之跨站請求僞造 DVWA CSRF Low to High

網絡安全入門之 Burp Suite 暴力破解 DVWA Brute Force Low

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結