《Python網絡數據採集》第五章(閱讀代碼筆記)

原創

象话

2020-02-20 14:51

分享關於學習Python，跟着書籍敲的代碼。

第一本書：《Byte Of Python》，給出代碼筆記鏈接：ByteOfPython筆記代碼，鏈接中有此書的PDF格式資源。

第二本書：《Python網絡數據採集》，給出此書PDF格式的資源鏈接：https://pan.baidu.com/s/1eSq6x5g 密碼：a46q

此篇給出《Python網絡數據採集》第五章：存儲數據的代碼筆記，供大家參考。

第五章：存儲數據

# -*-coding:utf-8-*-

# #############存儲數據

from urllib.error import HTTPError
from urllib.request import urlopen
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
import urllib

import os
import json
import re
import datetime
import random
import csv
# ## urllib.request.urlretrieve 可以根據文件的 URL 下載文件
html=urlopen("https://github.com")
bshtml=BeautifulSoup(html,"html.parser")
logo=bshtml.find("link",{"rel":"fluid-icon"}).attrs["href"]
urlretrieve(logo,"gitCat.jpg")
print(logo)

# ###下載所有的“src”中的鏈接文件

downLoadDirectory="downloaded"
baseUrl="http://pythonscraping.com"

def getAllSrcFile(baseUrl,source):
    if source.startswith("http://www"):
        url="http://"+source[11:]
    elif(source.startswith("http://")):
        url=source
    elif(source.startswith("www.")):
        url="http://"+source[4:]
    else:
        url=baseUrl+"/"+source
    if baseUrl not in source:
        return None
    return url

def getDownloadPath(baseUrl,absoludeUrl,downLoadDirectory):
    print(absoludeUrl)
    path=absoludeUrl.replace("www.","")
    path=path.replace(baseUrl,"")
    path=downLoadDirectory+path
    print("path:{0}".format(path))
    directory = os.path.dirname(path)
    if not os.path.exists(directory):
        print(directory)
        os.makedirs(directory)
    return path


html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html,"html.parser")
print(bsObj)
downloadList = bsObj.findAll(src=True)
for download in downloadList:
    fileUrl = getAllSrcFile(baseUrl, download["src"])
    if fileUrl is not None:
        print(fileUrl)
        urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downLoadDirectory))
    else:
        print("None")

# ##獲取 HTML 表格並寫入 CSV 文件
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://en.wikipedia.org/wiki/Comparison_of_text_editors")
bsObj = BeautifulSoup(html,"html.parser")
# # 主對比表格是當前頁面上的第一個表格
table = bsObj.findAll("table",{"class":"wikitable"})[0]
rows = table.findAll("tr")
csvFile = open("files/editors.csv", 'wt',newline='',encoding='utf-8')
writer = csv.writer(csvFile)
try:
    for row in rows:
        csvRow = []
        for cell in row.findAll(['td', 'th']):
            csvRow.append(cell.get_text())
        writer.writerow(csvRow)
finally:
    csvFile.close()


# ##使用pymysql鏈接MySQL數據庫
import pymysql
conn=pymysql.connect(host="localhost",user="root",passwd="123456",db="pymysql",charset='utf8')
conn2=pymysql.connect(host="127.0.0.1",user="root",passwd="123456",db="pymysql",charset='utf8')

try:
    cur = conn.cursor()
    cur2=conn2.cursor()
    # 使用execute執行sql語句
    reCount = cur.execute('select * from user;')
    reCount2=cur2.execute('select name,phone from user where id in(1,2)')
    # 使用fetchone 獲取一條數據
    # data = cur.fetchone()

    # 使用fetchall 獲取所有數據
    data = cur.fetchall()
    data2=cur2.fetchall()
    # 提交命令
    conn.commit()
    conn2.commit()
except:
    # 發生錯誤回滾
    conn.rollback()
    conn2.rollback()
finally:
    # 關閉遊標
    cur.close()
    cur2.close()
    # 關閉數據庫連接
    conn.close()
    conn2.close()


print(reCount)
print(data)
print(data2)
for user in data2:
    print("{0}的電話是：{1}".format(user[0],user[1]))


# ##爬取網絡數據，存儲到MySQL數據庫中

import pymysql
def getConnect(user,passwd,host="localhost",db="pymysql"):
    conn=pymysql.Connect(user=user,passwd=passwd,host=host,db=db,charset="utf8")
    return conn
conn = getConnect("root", "123456")
cur=conn.cursor()

def insertSql(value1,value2,table="user",avg1="name",avg2="phone"):
    try:
        count=cur.execute("insert into {0} ({1},{2}) VALUES ('{3}','{4}')".format(table,avg1,avg2,value1,value2))
        conn.commit()
        print("存儲成功")
    except:
        conn.rollback()
        print("存儲失敗")


def getLinks(articleUrl):
    html=urlopen("http://en.wikipedia.org{0}".format(articleUrl))
    bshtml=BeautifulSoup(html,"html.parser")
    title=bshtml.find("h1").get_text()
    content=bshtml.find("div",{"id":"mw-content-text"}).find("p").get_text()
    insertSql(title,content,"wiki_pages","title","content")
    return bshtml.find("div",{"id":"mw-content-text"}).findAll("a",href=re.compile("^(/wiki/)((?!:)(?!%).)*$"))

links = getLinks("/wiki/Kevin_Bacon")
try:
    while(len(links)>0):
        newArticleUrl=links[random.randint(0,len(links)-1)].attrs["href"]
        print(newArticleUrl)
        links=getLinks(newArticleUrl)
except:
    print("_ _ _ _ _失敗")
finally:
    cur.close()
    conn.close()
    print("光標已關閉_ _ _   連接已斷開_ _ _")

第六章筆記代碼：讀取文檔

象話

發佈了35 篇原創文章 · 獲贊 30 · 訪問量 14萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

《Python網絡數據採集》第五章(閱讀代碼筆記)

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

生產者消費者從低級到高級

JDK8的新特性總結

《Python網絡數據採集》第三章(閱讀代碼筆記)

Git導圖以及命令彙總

python抓取網站88titienmae88中的“圖片區”的第一頁的所有圖片

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結