使用Python網絡爬蟲抓取CodeForces題目

1. 背景

最近做題的時候要寫一些題解,在把CodeForces的題目複製下來的時候,數學公式的處理比較麻煩,所以我用Pythonurllib.requestBeautifulSoup4庫對題目信息進行了爬取,寫題解的時候時間節約了很多。

考慮到大家可能也會遇到同樣的問題,寫一篇筆記分享給大家。

2. 前期準備

安裝urllibBeautifulSoup庫。

pip3 install urllib
pip3 install beautifulsoup4

3. 獲取網頁內容

CodeForces 1353 B. Two Arrays And Swaps 爲例。

# 導入庫
import urllib.request
import bs4
from bs4 import BeautifulSoup

# 題目屬性
problemSet = "1353"
problemId = "B"

# 題目鏈接
url = f"https://codeforces.com/problemset/problem/{problemSet}/{problemId}"
# 獲取網頁內容
html = urllib.request.urlopen(url).read()
# 格式化
soup = BeautifulSoup(html,'lxml')

# 存儲
data_dict = {}
# 找到主體內容
mainContent = soup.find_all(name="div", attrs={"class" :"problem-statement"})[0]

4. 內容處理

4.1. Limit

先從比較簡單的信息入手,找到題目標題、時間、和內存限制。

# Limit
# 找到題目標題、時間、和內存限制
# Title
data_dict['Title'] = f"CodeForces {problemSet} " + mainContent.find_all(name="div", attrs={"class":"title"})[0].contents[-1]
# Time Limit
data_dict['Time Limit'] = mainContent.find_all(name="div", attrs={"class":"time-limit"})[0].contents[-1]
# Memory Limit
data_dict['Memory Limit'] = mainContent.find_all(name="div", attrs={"class":"memory-limit"})[0].contents[-1]




定義函數,處理主體內容中詭異的空格和公式的三個美元符號$$$

def divTextProcess(div):
    """
    處理<div>標籤中<p>的文本內容
    """
    strBuffer = ''
    # 遍歷處理每個<p>標籤
    for each in div.find_all("p"):
        for content in each.contents:
            # 如果不是第一個,加換行符
            if (strBuffer != ''):
                strBuffer += '\n\n'
            # 處理
            if (type(content) != bs4.element.Tag):
            # 如果是文本,添加至字符串buffer中
                strBuffer += content.replace("       ", " ").replace("$$$", "$")
            else:
            # 如果是html元素,如span等,加上粗體
                strBuffer += "**" + content.contents[0].replace("       ", " ").replace("$$$", "$") + "**" 
    # 返回結果
    return strBuffer


4.2. Problem Description

獲取題目描述,由於題目描述的<div>標籤沒有idclass屬性,這裏通過找列表中第10div的方式來獲取。

# 處理題目描述
data_dict['Problem Description'] = divTextProcess(mainContent.find_all("div")[10])

4.3. Input

輸入描述

div = mainContent.find_all(name="div", attrs={"class":"input-specification"})[0]
data_dict['Input'] = divTextProcess(div)

4.4. Output

輸出描述

div = mainContent.find_all(name="div", attrs={"class":"output-specification"})[0]
data_dict['Output'] = divTextProcess(div)

4.5. Sample Input & Onput

輸入樣例,用代碼框環境包圍。

# Input
div = mainContent.find_all(name="div", attrs={"class":"input"})[0]
data_dict['Sample Input'] = "```cpp" + div.find_all("pre")[0].contents[0] + '```'
# Onput
div = mainContent.find_all(name="div", attrs={"class":"output"})[0]
data_dict['Sample Onput'] = "```cpp" + div.find_all("pre")[0].contents[0] + '```'


4.6. Note

樣例說明

# 若有樣例說明
if(len(mainContent.find_all(name="div", attrs={"class":"note"})) > 0):
    div = mainContent.find_all(name="div", attrs={"class":"note"})[0]
    data_dict['Note'] = divTextProcess(div)
    

4.7. Source

題目鏈接

data_dict['Source'] = '[' + data_dict['Title'] + ']' + '(' + url + ')'

5. 輸出

for each in data_dict.keys():
    print('### ' + each + '\n')
    print(data_dict[each].replace("\n\n**", "**").replace("**\n\n", "**") + '\n')
    

下面是最後的輸出結果

### Title

CodeForces 1353 B. Two Arrays And Swaps

### Time Limit

1 second

### Memory Limit

256 megabytes

### Problem Description

You are given two arrays $a$ and $b$ both consisting of $n$ positive (greater than zero) integers. You are also given an integer $k$.

In one move, you can choose two indices $i$ and $j$ ($1 \le i, j \le n$) and swap $a_i$ and $b_j$ (i.e. $a_i$ becomes $b_j$ and vice versa). Note that $i$ and $j$ can be equal or different (in particular, swap $a_2$ with $b_2$ or swap $a_3$ and $b_9$ both are acceptable moves).

Your task is to find the **maximum** possible sum you can obtain in the array $a$ if you can do no more than (i.e. at most) $k$ such moves (swaps).

You have to answer $t$ independent test cases.

### Input

The first line of the input contains one integer $t$ ($1 \le t \le 200$) — the number of test cases. Then $t$ test cases follow.

The first line of the test case contains two integers $n$ and $k$ ($1 \le n \le 30; 0 \le k \le n$) — the number of elements in $a$ and $b$ and the maximum number of moves you can do. The second line of the test case contains $n$ integers $a_1, a_2, \dots, a_n$ ($1 \le a_i \le 30$), where $a_i$ is the $i$-th element of $a$. The third line of the test case contains $n$ integers $b_1, b_2, \dots, b_n$ ($1 \le b_i \le 30$), where $b_i$ is the $i$-th element of $b$.

### Output

For each test case, print the answer — the **maximum** possible sum you can obtain in the array $a$ if you can do no more than (i.e. at most) $k$ swaps.

### Sample Input

// 這裏會有```cpp代碼環境,在這裏爲了展示方便去掉了
5
2 1
1 2
3 4
5 5
5 5 6 6 5
1 2 5 4 3
5 3
1 2 3 4 5
10 9 10 10 9
4 0
2 2 4 3
2 4 2 3
4 4
1 2 2 1
4 4 5 4

### Sample Onput

6
27
39
11
17

### Note

In the first test case of the example, you can swap $a_1 = 1$ and $b_2 = 4$, so $a=[4, 2]$ and $b=[3, 1]$.

In the second test case of the example, you don't need to swap anything.

In the third test case of the example, you can swap $a_1 = 1$ and $b_1 = 10$, $a_3 = 3$ and $b_3 = 10$ and $a_2 = 2$ and $b_4 = 10$, so $a=[10, 10, 10, 4, 5]$ and $b=[1, 9, 3, 2, 9]$.

In the fourth test case of the example, you cannot swap anything.

In the fifth test case of the example, you can swap arrays $a$ and $b$, so $a=[4, 4, 5, 4]$ and $b=[1, 2, 2, 1]$.

### Source

[CodeForces 1353 B. Two Arrays And Swaps](https://codeforces.com/problemset/problem/1353/B)




















聯繫郵箱:[email protected]

Github:https://github.com/CurrenWong

歡迎轉載/Star/Fork,有問題歡迎通過郵箱交流。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章