[python]爬取全國城市歷史天氣數據

原創

BENULL

2020-02-22 14:51

python爬取全國城市歷史天氣數據

爬取全國城市2011至2020每天天氣數據
以requests+BeautifulSoup的方式抓取數據
多線程爬取
按城市名爬取後按省份存爲xls
從全國城市名稱對應拼音構造字典時，存在城市拼音相同問題
這個網站對城市的拼音有錯誤導致數據爬不到
構造省份城市字典，按照省份創建文件夾歸檔
數據來源：天氣後報網站
完整代碼及數據：項目地址

分析資源URL

http://www.tianqihoubao.com/lishi/beijing/month/201101.html

易見http://www.tianqihoubao.com/lishi/{城市拼音}/month/{年月}.html

主要代碼

class Crawler(threading.Thread,):

    def run(self):
        print("%s is running" % threading.current_thread())
        while True:
            # 上鎖
            gLock.acquire()
            if len(city_dict) == 0:
                # 釋放鎖
                gLock.release()
                continue
            else:
                item = city_dict.popitem()
                gLock.release()
                data_ = list()
                urls = self.get_urls(item[0])
                for url in urls:
                    try:
                        data_.extend(self.get_data(url))  # 列表合併，將某個城市所有月份的天氣信息寫到data_
                    except Exception as e:
                        print(e)
                        pass
                self.saveTocsv(data_, item[1])  # 保存爲csv
                if len(city_dict) == 0:
                    end = time.time()
                    print("消耗的時間爲：", (end - start))
                    exit()

    # 獲取城市歷史天氣url
    def get_urls(self,city_pinyin):
        urls = []
        for year in target_year_list:
            for month in target_month_list:
                date = year + month
                # url = "http://www.tianqihoubao.com/lishi/beijing/month/201812.html"
                urls.append("http://www.tianqihoubao.com/lishi/{}/month/{}.html".format(city_pinyin, date))
        return urls

    def get_soup(self,url):
        try:
            r = requests.get(url, timeout=30)
            r.raise_for_status()  # 若請求不成功,拋出HTTPError 異常
            soup = BeautifulSoup(r.text, "html.parser")
            return soup
        except Exception as e:
            print(e)
            pass

    # 將天氣數據保存至xls文件
    def saveTocsv(self,data, city):
        fileName = './weather_data/' + city + '天氣.xls'
        result_weather = pd.DataFrame(data, columns=['日期', '天氣狀況', '氣溫', '風力風向'])
        # print(result_weather)
        result_weather.to_excel(fileName, index=False)
        print('Save all weather success!')
        print('remain{}'.format(len(city_dict)))

    def get_data(self,url):
        print(url)
        try:
            soup = self.get_soup(url)
            all_weather = soup.find('div', class_="wdetail").find('table').find_all("tr")
            data = list()
            for tr in all_weather[1:]:
                td_li = tr.find_all("td")
                for td in td_li:
                    s = td.get_text()
                    data.append("".join(s.split()))
            res = np.array(data).reshape(-1, 4)
            return res

        except Exception as e:
            print(e)
            pass

BENULL

發佈了14 篇原創文章 · 獲贊 33 · 訪問量 9萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

[python]爬取全國城市歷史天氣數據

python爬取全國城市歷史天氣數據

分析資源URL

主要代碼

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

使用Dockerfile創建Ubuntu+Pytorch+CUDA 鏡像

[Pytorch練手]使用CNN圖像分類

2020考研復旦計算機專碩392經驗貼

在docker中使用pytorch時共享內存問題

[python]微信公衆號文章爬取

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結