python爬蟲實戰案例_靜態頁面爬取

原創

2020-06-19 06:55

http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html

爬取此網站的大學排名信息並存入數據庫,首先先要啓動mysql數據庫，才能使python連接到mysql，唯一一處代碼需要改動的地方爲mysql方法裏的連接mysql，密碼需要使用自己的密碼，並且需要有一個名爲university的數據庫和一個名爲university_level的表，並且寫入字段爲rankings(int),schools,provinces,totalscores(float),indexscores(float)

import requests
import pymysql
from pymysql.cursors import DictCursor
from bs4 import BeautifulSoup
class Mysql(object):
    def mysql(self,rankings,schools,provinces,totalscores,indexscores):
        try:
            #連接mysql
            conn = pymysql.connect(
                host="localhost", user="root", password="taistrive0900",
                database="university", port=3306, charset='utf8'
            )
            cursor = conn.cursor(DictCursor)
            sql = "insert into university_level(ranking,school,province,totalscore,indexscore) values(%s,%s,%s,%s,%s)"#sql插入語句，佔位符都用%s表示
            cursor.execute(sql,(rankings,schools,provinces,totalscores,indexscores))
            conn.commit()
        except Exception as e:
            print(e)
            conn.rollback()
        finally:
            cursor.close()
            conn.close()

class Spider(object):
    url = "http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html"  #url獲取
    def __getContent(self):
        try:
            r = requests.get(Spider.url)    #Response對象包含爬蟲返回的內容
            r.raise_for_status()    #http的請求狀態
            r.encoding = r.apparent_encoding    #從HTTP header中猜測的響應內容編碼方式 = 從內容中分析出的響應內容編碼方式（備選編碼方式）
            #不容易出現亂碼形式
            return r.text   #返回html文本
        except Exception as e:
            print(e)
            print("error")

    def show_univerList(self,unList):
        html = self.__getContent()
        soup = BeautifulSoup(html, 'html.parser')   #顯示html源碼
        #print(soup.prettify())     #結構化顯示html源碼
        university = soup.find_all('td')    #查找全部td標籤的語句
        #print(university)
        i = 1
        #清除冗餘元素
        for item in university:
            if i % 14 == 1 or i % 14 == 2 or i % 14 == 3 or i % 14 == 4 or i % 14 == 5: #篩選數據
                unList.append(item.string)  #將td標籤中間部分抽取出來加入列表
            i+=1

    def data_handle(self, unList):
        # 查看列表中數據個數
        # 使列表裏的數據的類型對應mysql數據類型
        i = len(unList)
        j = 1
        for item in range(0, i):
            if item % 5 == 0:
                unList[item] = int(j)
                #因爲數據中有相同排名的數據但是數據庫中的主鍵的值要求是唯一的所以將相同的值更改爲不一樣的值
                j += 1
            elif item % 5 == 3 or item % 5 == 4:
                unList[item] = float(unList[item])  # 將列表裏的部分值轉成浮點類型
            else:
                unList[item] = str(unList[item])  # 將列表裏的部分值轉成字符類型
        for item in range(0, i - 4, 5):
            msql.mysql(unList[item], unList[item + 1], unList[item + 2],
                         unList[item + 3], unList[item + 4])
if __name__=="__main__":
    sql_univeiList = []
    spider = Spider()
    msql = Mysql()
    spider.show_univerList(sql_univeiList)
    spider.data_handle(sql_univeiList)
    print("finished")
    #在控制檯顯示結構化數據
    # for item in range(0, i - 4, 5):
    #     print(sql_univerList[item], sql_univerList[item + 1], sql_univerList[item + 2], sql_univerList[item + 3], sql_univerList[item+4])
    #     print(str(item) + " " + str(item + 1) + " " + str(item + 2) + " " + str(item + 3) + " " + str(item + 4))

測試數據庫是否開啓

爬取到的數據存入到數據庫，使用navicat結構化顯示

終端顯示

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬蟲實戰案例_靜態頁面爬取

Python_元組

常用的HDFS操作

python爬蟲實戰案例_靜態頁面爬取

Python_基本概念_列表

python mysqldb 報錯： ProgrammingError: must be real number, not str 解決

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結