雲棲號資訊：【點擊查看更多行業資訊】
在這裏您可以找到不同行業的第一手的上雲資訊，還在等什麼，快來！

隨着中國工業和科技的發展，中國的一些發達城市的空氣質量問題變得越來越嚴重，其中最爲嚴重的便是PM2.5帶來的惡劣環境問題。
本文在根據網絡公開空氣質量數據的基礎上進行爬取相關數據，主要針對環境較爲惡劣的城市，天津、北京、廣州等幾個城市，尤其是針對天津的質量數據進行對比分析。在分析的基礎上得出空氣質量變化情況，提出一些意見。並藉助機器學習算法根據數據預測空氣質量，以達到分析預測的典型大數據分析模式效果。
整體分析的流程圖如下：

實驗前的準備

1.1 數據獲取
我們這裏所得到的數據來源於網絡公開的空氣質量數據，數據來源於“天氣後報”網站，網址爲：http://www.tianqihoubao.com/aqi/tianjin.html。網址內容如下圖可見：

圖1-1 網址數據圖

整個數據的獲取使用python進行爬取。流程如下：
（1）導入爬蟲所需要的的庫：

在air_tianjin_2019.py程序中。
其中Requests 是用Python語言編寫，基於urllib，採用 Apache2 Licensed開源協議的 HTTP 庫。它比 urllib 更加方便，可以節約我們大量的工作，完全滿足 HTTP 測試需求。
其中BeautifulSoup庫是一個靈活又方便的網頁解析庫，處理高效，支持多種解析器。利用它就不用編寫正則表達式也能方便的實現網頁信息的抓取
對應代碼如下：

（2）爲了防止網站的反爬機制，我們設定模擬瀏覽器進行訪問獲取數據：
headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
（3）然後獲取2019年全年的空氣質量數據：

1.2 數據預處理
如果僅僅是從網站上得到的數據會有一些標籤等干擾項，我們針對一些標籤進行去除即可：
for j in tr[1:]:
td = j.find_all('td')
Date = td[0].get_text().strip()
Quality_grade = td[1].get_text().strip()
AQI = td[2].get_text().strip()
AQI_rank = td[3].get_text().strip()
PM = td[4].get_text()
with open('air_tianjin_2019.csv', 'a+', encoding='utf-8-sig') as f:

f.write(Date + ',' + Quality_grade + ',' + AQI + ',' + AQI_rank + ',' + PM + '\n')

最終爬取下來的部分數據如下：
表1-1 部分天津爬取數據表

這幾個數據分別對應着AQI指數、當天AQI排名和PM2.5值

數據分析

這裏的數據分析主要通過可視化的方法得到圖像來進行分析。
（1）天津AQI全年走勢圖
代碼在air_tianjin_2019_AQI.py中
通過導入pyecharts 庫來進行繪製走勢圖
首先通過已經獲取到的數據進行讀取：
df = pd.read_csv('air_tianjin_2019.csv', header=None, names=["Date", "Quality_grade", "AQI", "AQI_rank", "PM"])
然後獲取日期和AQI數據，儲存在列表變量中，以方便繪製圖像：
attr = df['Date']v1 = df['AQI']

接着定義標題，繪製曲線並保存爲網頁即可：
line = Line("2019年天津AQI全年走勢圖", title_pos='center', title_top='18', width=800, height=400)
line.add("", attr, v1, mark_line=['average'], is_fill=True, area_color="#000", area_opacity=0.3, mark_point=["max", "min"], mark_point_symbol="circle", mark_point_symbolsize=25)
line.render("2019年天津AQI全年走勢圖.html")
最終的效果圖如下可見

圖2-2 2019年天津AQI全年走勢圖

根據圖2-2可知，在2019年度，天津的空氣質量峯值分別是在1月、2月、11月和12月，即主要集中在春冬季，考慮到可能是春冬季通風較差，且節日較多，過多的節日煙花和汽車人員流動造成了空氣質量變差。

（2）天津月均AQI走勢圖
air_tianjin_2019_AQI_month.py
爲了體現出每月的平均空氣質量變化，我們繪製了月均走勢圖。
首先同樣的是讀取數據：
df = pd.read_csv('air_tianjin_2019.csv', header=None, names=["Date", "Quality_grade", "AQI", "AQI_rank", "PM"])
接着獲取日期和空氣質量數據，並加以處理，去除日期中間的“-”：
dom = df[['Date', 'AQI']]
list1 = []
for j in dom['Date']:

time = j.split('-')[1]
list1.append(time)

df['month'] = list1
接着計算每月空氣質量的平均值
month_message = df.groupby(['month'])
month_com = month_message['AQI'].agg(['mean'])
month_com.reset_index(inplace=True)
month_com_last = month_com.sort_index()
attr = ["{}".format(str(i) + '月') for i in range(1, 13)]
v1 = np.array(month_com_last['mean'])
v1 = ["{}".format(int(i)) for i in v1]
然後繪製走勢圖：
line = Line("2019年天津月均AQI走勢圖", title_pos='center', title_top='18', width=800, height=400)
line.add("", attr, v1, mark_point=["max", "min"])
line.render("2019年天津月均AQI走勢圖.html")

最終的效果圖如下可見：

圖2-3 2019年天津月均AQI走勢圖

（3）天津季度AQI箱形圖
代碼在air_tianjin_2019_AQI_season.py中
繪製天津季度空氣質量箱型圖，步驟如下：
讀取爬取下來的數據：
df = pd.read_csv('air_tianjin_2019.csv', header=None, names=["Date", "Quality_grade", "AQI", "AQI_rank", "PM"])
接着按照月份分季，可以分爲四個季度：
dom = df[['Date', 'AQI']]
data = [[], [], [], []]
dom1, dom2, dom3, dom4 = data
for i, j in zip(dom['Date'], dom['AQI']):
time = i.split('-')[1]
if time in ['01', '02', '03']:

    dom1.append(j)

elif time in ['04', '05', '06']:

    dom2.append(j)

elif time in ['07', '08', '09']:

    dom3.append(j)

else:

    dom4.append(j)

然後定義箱型圖的標題，橫縱座標等繪製箱型圖：
boxplot = Boxplot("2019年天津季度AQI箱形圖", title_pos='center', title_top='18', width=800, height=400)
x_axis = ['第一季度', '第二季度', '第三季度', '第四季度']
y_axis = [dom1, dom2, dom3, dom4]
_yaxis = boxplot.prepare_data(y_axis)
boxplot.add("", x_axis, _yaxis)
boxplot.render("2019年天津季度AQI箱形圖.html")
最終得到繪製的箱型圖如下可見：

圖2-4 2019年天津季度AQI箱形圖

KNN算法預測

整體的代碼流程分爲兩個部分，一部分是建立test.py程序用來將CSV文件轉爲符合標準的TXT數據存儲；另一部分是K均值聚類的數據分類。
（1）數據生成TXT
代碼在test.py中
首先讀入數據，存出入列表爲x何y。同時因爲y的值爲漢字，需要轉換爲數字：

文件的名字

FILENAME1 = "air_tianjin_2019.csv"

禁用科學計數法

pd.set_option('float_format', lambda x: '%.3f' % x)
np.set_printoptions(threshold=np.inf)

讀取數據

data = pd.read_csv(FILENAME1)
rows, clos = data.shape

DataFrame轉化爲array

DataArray = data.values
Y=[]
y = DataArray[:, 1]
for i in y:

if i=="良":
    Y.append(0)
if i=="輕度污染":
    Y.append(1)
if i=="優":
    Y.append(2)
if i=="嚴重污染":
    Y.append(3)
if i=="重度污染":
    Y.append(4)

print(Y)
print(len(y))
X = DataArray[:, 2:5]
print(X[1])
然後將存儲的數據寫入TXT，其中要注意換行和加“,”：
for i in range(len(Y)):
f=open("data.txt","a+")
for j in range(3):

    f.write(str(X[i][j])+",")

f.write(str(Y[i])+"n")
print("data.txt數據生成")
（2）K均值聚類
代碼在KNearestNeighbor.py中。
首先是讀取數據：
def loadDataset(self,filename, split, trainingSet, testSet): # 加載數據集 split以某個值爲界限分類train和test

with open(filename, 'r') as csvfile:
    lines = csv.reader(csvfile)   #讀取所有的行
    dataset = list(lines)     #轉化成列表
    for x in range(len(dataset)-1):
        for y in range(3):
            dataset[x][y] = float(dataset[x][y])
        if random.random() < split:   # 將所有數據加載到train和test中
            trainingSet.append(dataset[x])
        else:
            testSet.append(dataset[x])

定義計算距離的函數
def calculateDistance(self,testdata, traindata, length): # 計算距離

distance = 0     # length表示維度 數據共有幾維
for x in range(length):
    distance += pow((int(testdata[x])-traindata[x]), 2)
return math.sqrt(distance)

對每個數據文檔測量其到每個質心的距離，並把它歸到最近的質心的類。
def getNeighbors(self,trainingSet, testInstance, k): # 返回最近的k個邊距

distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):   #對訓練集的每一個數計算其到測試集的實際距離
    dist = self.calculateDistance(testInstance, trainingSet[x], length)
    print('訓練集:{}-距離:{}'.format(trainingSet[x], dist))
    distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))   # 把距離從小到大排列
print(distances)
neighbors = []
for x in range(k):   #排序完成後取前k個距離
    neighbors.append(distances[x][0])
    print(neighbors)
    return neighbors

決策函數，根據少數服從多數，決定歸類到哪一類：
def getResponse(self,neighbors): # 根據少數服從多數，決定歸類到哪一類

classVotes = {}
for x in range(len(neighbors)):
    response = neighbors[x][-1]  # 統計每一個分類的多少
    if response in classVotes:
        classVotes[response] += 1
    else:
        classVotes[response] = 1
print(classVotes.items())
sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True) #reverse按降序的方式排列
return sortedVotes[0][0]

計算模型準確度
def getAccuracy(self,testSet, predictions): # 準確率計算

correct = 0
for x in range(len(testSet)):
    if testSet[x][-1] == predictions[x]:   #predictions是預測的和testset實際的比對
        correct += 1
print('共有{}個預測正確，共有{}個測試數據'.format(correct,len(testSet)))
return (correct/float(len(testSet)))*100.0

接着整個模型的訓練，種子數定義等等：
def Run(self):

trainingSet = []
testSet = []
split = 0.75
self.loadDataset(r'data.txt', split, trainingSet, testSet)   #數據劃分
print('Train set: ' + str(len(trainingSet)))
print('Test set: ' + str(len(testSet)))
#generate predictions
predictions = []
k = 5    # 取最近的5個數據
# correct = []
for x in range(len(testSet)):    # 對所有的測試集進行測試
    neighbors = self.getNeighbors(trainingSet, testSet[x], k)   #找到5個最近的鄰居
    result = self.getResponse(neighbors)    # 找這5個鄰居歸類到哪一類
    predictions.append(result)
    # print('predictions: ' + repr(predictions))
    # print('>predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))
# print(correct)
accuracy = self.getAccuracy(testSet,predictions)
print('Accuracy: ' + repr(accuracy) + '%')

最終模型的準確度爲90%。

圖2-10 模型運行結果圖

源碼地址：https://pan.baidu.com/s/1Vcc_bHQMHmQpe-F6A-mFdQ

【雲棲號在線課堂】每天都有產品技術專家分享！
課程地址：https://yqh.aliyun.com/live

立即加入社羣，與專家面對面，及時瞭解課程最新動態！
【雲棲號在線課堂社羣】https://c.tb.cn/F3.Z8gvnK

原文發佈時間：2020-07-06
本文作者：李秋鍵
本文來自：“csdn”，瞭解相關信息可以關注“csdn”

乾貨！如何用 Python+KNN 算法實現城市空氣質量分析與預測？

文件的名字

禁用科學計數法

讀取數據

DataFrame轉化爲array

DAPPER 事務 TRANSACTION

Java中線程的創建方式

一鍵自動化博客發佈工具,chrome和firfox詳細配置

阿里推出「阿里雲網盤」App，爲網盤發展提供更強勁推動力

【雲棲號直播】本週重磅：阿里雲CDN產品解讀及全站加速在遊戲行業的最佳實踐

基於 Flink 的典型 ETL 場景實現

mPaaS：全新移動開發平臺，只爲打造性能更優越的App

零基礎開發 nginx 模塊

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結