Python分析2019國慶熱門景點

之前爬取了2019國慶熱門景點python爬取2019國慶熱門景點1——數據爬取與保存，現在來簡要分析一下景點的相關信息。文章思路來源自公衆號【裸睡的豬】（這個作者真的好膩害，向他學習，他的csdn賬號名也叫裸睡的豬），不過文章的源代碼是圖片，爲了加深影響，我全部自己進行重寫，期間發現該文作者代碼中不適合我數據的情況，進行一些修改。

1.數據清洗

因爲得到的excel表格數據比較亂，主要有幾個問題：

price和month_sales爲文本型，需要強制轉換成數字
在爬取的時候，沒有值的我統一用“暫無”替換了，但是在程序處理時，必須將這些值用數字替換，一般用0；景點等級沒有的，我統一爲1A了（雖然不合理）
爬取的數據中地區爲三級，現在需要將地區中的省份提取出來
直接上代碼：

#數據清洗
def new_excel():
    """
    數據清洗
    :return:
    """
    #1.價格轉爲整型，同時暫無變成0
    place_path = "qunar.xlsx"
    df = pd.read_excel(place_path)
    raw_sales=df["month_sales"].values
    new_sales=[]
    for sales in raw_sales:
        if "暫無"==sales:
            sales=int(sales.replace("暫無",'0'))
        new_sales.append(int(sales))
    df["month_sales"]=new_sales

    #2.地區中只提取省份
    district=df["district"].values
    new_district=[]
    for dist in district:
        dist=dist[0:2]
        new_district.append(dist)
    df["district"]=new_district

    #3.價格數據處理
    price=df["price"].values
    new_price=[]
    for pre in price:
        if "暫無"==pre:
            pre = int(pre.replace("暫無", '0'))
        #對於浮點數，不能只能int，必須先轉換成float
        new_price.append(int(float(pre)))
    df["price"]=new_price

    #4.景點熱度數據處理
    hotsum=df["hotsum"].values
    new_hotsum=[]
    for hsm in hotsum:
        hsm=hsm[3:7]
        new_hotsum.append(hsm)
    df["hotsum"]=new_hotsum

    #5.景點星級處理
    star=df["star"].values
    new_star=[]
    for sr in star:
        if "暫無"==sr:
            sr=sr.replace("暫無","1A")
        sr=sr[0:2]
        new_star.append(sr)
    df["star"]=new_star

    #保存爲新的文件，防止破壞原始數據
    if os.path.exists("qunar_new.xlsx"):
        os.path.remove()
    writer=pd.ExcelWriter("qunar_new.xlsx")
    df.to_excel(excel_writer=writer,columns=["name","star","info","price","month_sales","district","hotsum"],index=False,
                encoding="utf-8",sheet_name="去哪兒國慶熱門景點")
    writer.save()
    writer.close()

代碼比較簡單，不過實際自己寫的時候真的出現問題，主要問題有：

對於浮點數，不能直接加int，必須先轉換成float，即int(float())
記住replace()中兩個參數都應該是字符型，並且結果也是字符，要記得轉換
原始數據特別重要，一定要保存好，所以有條件情況下，最好新建一個excel來保存

2.景點銷量排行top20

首先導入需要的包pyecharts，Bar用於生成柱狀圖，Page用於將不同圖表合併一起，輸出爲一個html文件，options用於實現圖表的各種配置。

from pyecharts.charts import Bar,Page
import pandas as pd
import os
import numpy as np
from pyecharts import options as opts

#分析景點銷量排行top20
def sale_rank():
    '''
    景點銷量排行top20
    :return:
    '''
    global  df
    place_sale = pd.pivot_table(df, index="name", values="month_sales", aggfunc=np.sum)
    place_sale.sort_values(by=["month_sales"], axis=0, inplace=True, ascending=True)
    # 3.生成柱狀圖
    place_sale_bar = (
        Bar()
            .add_xaxis(place_sale.index.tolist()[-20:])
            .add_yaxis("", list(map(int, np.ravel(place_sale)))[-20:])
            .reversal_axis()
            .set_series_opts(label_opts=opts.LabelOpts(position="right"))
            .set_global_opts(title_opts=opts.TitleOpts(title="2019國慶熱門景點銷量排行top20"),
                             yaxis_opts=opts.AxisOpts(name="景點名稱"),
                             xaxis_opts=opts.AxisOpts(name="銷量")
                             )
    )
    place_sale_bar.render('place_sale_bar.html')

在上面的代碼中，涉及到幾個比較重要的函數，功能很強大：

pivot_table（）：pandas中的數據透視表，類比excel中的數據透視表或者sql中的groupby功能，代碼簡潔，功能強大，具體可百度相關文檔
sort_values()：pandas中用於排序的函數Pandas—排序sort_values
在上述柱狀圖中可以發現，Bar()模塊後面以點的形式連接着多個函數，這稱之爲Python的鏈式調用，可以讓代碼更加簡潔，pyecharts中所有方法都支持鏈式調用，當然如果不想用鏈式調用，可以選擇普通的方法
在pyecharts中，圖表完成製作後通過render()函數輸出爲html文件，你可以在render()中傳遞輸出地址參數，將html文件保存到自定義的位置。

由圖可知，兵馬俑的月銷量最高。

3.景點銷售額排行top20

銷售額=單價*銷量，某種程度上可以看出景點的熱門情況

#景點銷售額排序
def amount_rank():
    '''
    景點銷售額排行top20
    :return:
    '''
    global  df
    amount_list=[]
    #對df進行遍歷
    for index,row in df.iterrows():
        try:
            amount=row["price"]*row["month_sales"]
        except Exception as e:
            amount=0
        amount_list.append(amount)
    df["amount"]=amount_list
    place_amount = pd.pivot_table(df, index="name", values="amount", aggfunc=np.sum)
    place_amount.sort_values(by=["amount"], axis=0, inplace=True, ascending=True)
    # 3.生成柱狀圖
    place_amount_bar = (
        Bar()
            .add_xaxis(place_amount.index.tolist()[-20:])
            .add_yaxis("", list(map(int, np.ravel(place_amount)))[-20:])
            .reversal_axis()
            .set_series_opts(label_opts=opts.LabelOpts(position="right"))
            .set_global_opts(title_opts=opts.TitleOpts(title="2019國慶熱門景點銷售額排行top20"),
                             yaxis_opts=opts.AxisOpts(name="景點名稱"),
                             xaxis_opts=opts.AxisOpts(name="銷售額")
                             )
    )
     place_sale_bar.render('place_amount_bar.html')

核心代碼段與景點銷量排行分析類似，不過在計算銷售額時，增加一個for循環，來對DataFrame進行遍歷，df.iterrows()，以實現景點對應價格*月銷量。運行之後得到

由圖可知，兵馬俑銷售額最高，主要是因爲月銷量高，同時兵馬俑的門票不便宜，190RMB/人。

4.各省景點數分析

最開始本來是想做各省各星級景點數統計，可是在pivot_table()函數中反覆設置參數，也沒能得到理想的結果，最後就乾脆直接統計各省的景點數了，因爲不用考慮星級，就比較簡單（如果有緣人能看見，並且想要嘗試做，希望可以在評論出留言告知一聲如何統計各省不同星級的景點個數，不勝感激，之後自己也會再試試）

#各省景點數分析
def star_rank():
    star_amount = pd.pivot_table(df,index=["district"],values=["star"],aggfunc="count")
    star_amount.sort_values(by=["star"], axis=0, inplace=True, ascending=True)
    #3.生成柱狀圖
    star_amount_bar = (
        Bar()
            .add_xaxis(star_amount.index.tolist()[-20:])
            .add_yaxis("", list(map(int, np.ravel(star_amount)))[-20:])
            .reversal_axis()
            .set_series_opts(label_opts=opts.LabelOpts(position="right"))
            .set_global_opts(title_opts=opts.TitleOpts(title="2019各省國慶熱門景點數排行top20"),
                             yaxis_opts=opts.AxisOpts(name="省份"),
                             xaxis_opts=opts.AxisOpts(name="景點數")
                             )
    )
    # star_amount_bar.render('star_amount_bar.html')
    return star_amount_bar

雖然對大神來說，這些真的很簡單，但是小白今天接觸了pyecharts庫，真的好喜歡這個界面，非常好看，之後會嘗試做不同的圖表，並學習如何在一個html文件中顯示多個圖表，以及圖表擺放位置，加油喔~

Python分析2019國慶熱門景點

1.數據清洗

2.景點銷量排行top20

3.景點銷售額排行top20

4.各省景點數分析

5G知識大全（有時候就更新）

python製作萬年曆

大學生靠譜兼職幾個平臺(親測有效）

5G的時隙配置

python數據庫基本操作—pymysql

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結