任務描述
- 下載GOCI衛星二級產品數據(Chla, )
- 時間跨度: 2011年4月——2019年11月
具體實現
1.獲取下載鏈接
import requests
from bs4 import BeautifulSoup
import time
import random
def find_data(url):
ip = "http://222.236.46.45"
time.sleep(random.uniform(3, 5)) # 隨機停頓3-5秒,請求太快必封IP
res = requests.get(url=url) # 用requests發起請求
html = BeautifulSoup(res.text, "html.parser") # 用BeautifulSoup解析網頁
for link in html.find_all('a')[1:]: # 獲取網頁中所有的a標籤,去掉第一個a標籤,
full_link = ip + link.get('href') # 從a標籤中獲取鏈接
if ".zip" in full_link: # 如果鏈接以.zip結尾,就追加到列表中
find_chl2_rc2(full_link)
else:
find_data(full_link) # 進行遞歸爬取整個網頁所有鏈接
2.篩選
因爲師兄要求只下載每天第一條Chla和第三條(中午時間數據質量最好),所以加了一個篩選的過程
def find_chl2_rc2(full_link):
if "CHL" in full_link:
if full_link[80:82] == "02":
downlist.append(full_link)
print(full_link)
elif "CDOM" in full_link:
pass # 'pass' mean do nothing
elif "TSS" in full_link:
pass # 'pass' mean do nothing
elif "RRS" in full_link:
pass # 'pass' mean do nothing
else:
if full_link[80:82] == "02":
downlist.append(full_link)
print(full_link)
3.下載
IDM下載
從IDM導入獲取到的下載鏈接
選擇下載路徑,點擊“全部選擇”–>“確定”,開始下載
wget下載
使用wget進行下載downlist.txt中的所有下載鏈接,-nc:不要重複下載已存在的文件;-c:斷點續傳
wget --input-file=downlist.txt -nc -c
不足與展望
- 用requests請求太快容易被封ip,後面考慮使用ip池
- 數據鏈接篩選過程看起來很low,考慮使用正則表達式進行優化
- 自己有點忙,所以沒有用到多線程下載,直接調用IDM下載
完整代碼
import requests
from bs4 import BeautifulSoup
import time
import random
def find_chl2_rc2(full_link):
if "CHL" in full_link:
if full_link[80:82] == "02":
downlist.append(full_link)
print(full_link)
elif "CDOM" in full_link:
pass # 'pass' mean do nothing
elif "TSS" in full_link:
pass # 'pass' mean do nothing
elif "RRS" in full_link:
pass # 'pass' mean do nothing
else:
if full_link[80:82] == "02":
downlist.append(full_link)
print(full_link)
def find_data(url):
ip = "http://222.236.46.45"
time.sleep(random.uniform(3, 5))
res = requests.get(url=url)
html = BeautifulSoup(res.text, "html.parser")
for link in html.find_all('a')[1:]:
full_link = ip + link.get('href')
if ".zip" in full_link:
find_chl2_rc2(full_link)
else:
find_data(full_link)
if __name__ == '__main__':
BASE_URL = 'http://222.236.46.45/nfsdb/COMS/GOCI/2.0/2019'
downlist = []
find_data(BASE_URL)
with open("downlist.txt", 'w') as f:
for line in downlist:
f.write(line + '\n')