大家好,我是皮皮。
一、前言
還是昨天的那個網絡爬蟲問題,大佬們,幫忙看看這個網絡爬蟲代碼怎麼修改?那個粉絲說自己不熟悉pandas
,用pandas
做的爬蟲,雖然簡潔,但是自己不習慣,想要在他自己的代碼基礎上進行修改,獲取數據的代碼已經寫好了,就差存儲到csv
中去了。
他的原始代碼如下:
import requests
from lxml import etree
import csv
import time
import pandas as pd
def gdpData(page):
url = f'https://www.hongheiku.com/category/gdjsgdp/page/{page}'
headers ={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
resp = requests.get(url,headers = headers)
# print(resp.text)
data(resp.text)
file = open('data.csv',mode='a',encoding='utf-8',newline='')
csv_write=csv.DictWriter(file,fieldnames=['排名','地區','GDP','年份'])
csv_write.writeheader()
def data(text):
e = etree.HTML(text)
lst = e.xpath('//*[@id="tablepress-48"]/tbody/tr[@class="even"]')
for l in lst:
no = l.xpath('./td[1]/center/span/text()')
name = l.xpath('./td[2]/a/center/text()')
team = l.xpath('./td[3]/center/text()')
year = l.xpath('./td[4]/center/text()')
data_dict = {
'排名':no,
'地區':name,
'GDP':team,
'年份':year
}
print(f'排名:{no} 地區:{name} GDP:{team} 年份:{year} ')
csv_write.writerow(data_dict)
file.close()
url = 'https://www.hongheiku.com/category/gdjsgdp'
headers ={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
resp = requests.get(url,headers = headers)
# print(resp.text)
data(resp.text)
e = etree.HTML(resp.text)
#//*[@id="tablepress-48"]/tbody/tr[192]/td[3]/center
count = e.xpath('//div[@class="pagination pagination-multi"][last()]/ul/li[last()]/span/text()')[0].split(' ')[1]
for index in range(int(count) - 1):
gdpData(index + 2)
二、實現過程
這裏粉絲給了一瓶冰紅茶的費用,一個熱心市民給了一份代碼,在他的代碼基礎上進行修改的,代碼如下:
import requests
from lxml import etree
import csv
import time
import pandas as pd
def gdpData(page):
url = f'https://www.hongheiku.com/category/gdjsgdp/page/{page}'
headers ={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
resp = requests.get(url,headers = headers)
# print(resp.text)
data(resp.text)
def data(text):
file = open('data.csv', mode='a', encoding='utf-8', newline='')
csv_write = csv.DictWriter(file, fieldnames=['排名', '地區', 'GDP', '年份'])
csv_write.writeheader()
e = etree.HTML(text)
lst = e.xpath('//*[@id="tablepress-48"]/tbody/tr[@class="even"]')
for l in lst:
no = ''.join(l.xpath('./td[1]/center/span/text()'))
name = ''.join(l.xpath('./td[2]/a/center/text()')[0])
team = ''.join(l.xpath('./td[3]/center/text()'))
year = ''.join(l.xpath('./td[4]/center/text()'))
data_dict = {
'排名':no,
'地區':name,
'GDP':team,
'年份':year
}
print(f'排名:{no} 地區:{name} GDP:{team} 年份:{year} ')
csv_write.writerow(data_dict)
file.close()
url = 'https://www.hongheiku.com/category/gdjsgdp'
headers ={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
resp = requests.get(url,headers = headers)
# print(resp.text)
data(resp.text)
e = etree.HTML(resp.text)
#//*[@id="tablepress-48"]/tbody/tr[192]/td[3]/center
count = e.xpath('//div[@class="pagination pagination-multi"][last()]/ul/li[last()]/span/text()')[0].split(' ')[1]
for index in range(int(count) - 1):
gdpData(index + 2)
代碼運行之後,數據就存儲到csv
中去了。
順利地解決了粉絲的問題!
三、總結
大家好,我是皮皮。這篇文章主要盤點了一個Python
網絡爬蟲後數據存儲的問題,文中針對該問題,給出了具體的解析和代碼實現,幫助粉絲順利解決了問題。
最後感謝粉絲【藍桉】提問,感謝【熱心市民】給出的思路和代碼解析,感謝【eric】等人蔘與學習交流。