數據集Advertising.csv——分析出廣告媒體投入與銷售額之間的關係

原創

2019-05-01 00:51

現有數據集Advertising.csv。數據集包含了200個不同市場的產品銷售額，每個銷售額對應3種廣告媒體投入成本，分別是：TV, radio, 和 newspaper。如果我們能分析出廣告媒體投入與銷售額之間的關係，我們就可以更好地分配廣告開支並且使銷售額最大化。

現需要進行如下實驗：

1、使用pandas庫讀取數據集，得到相應矩陣。使用matplotlib庫畫出：TV、Radio、Newspaper與產品銷售額的數據散點圖。

具體要求：

結果爲一張圖，TV, radio, 和 newspaper需要用不同形狀的點表示。
圖的X軸爲廣告花費、Y軸爲銷售額的值。
需要畫出虛線形式的網格參考線。

def graph1( data ):
    TV = data.TV
    Radio = data.Radio
    Newspaper = data.Newspaper
    Sales = data.Sales

    plt.scatter(TV, Sales,  c='r',marker='o',label='TV')
    plt.scatter(Radio, Sales,  c='b', marker='x', label='Radio')
    plt.scatter(Newspaper, Sales,  c='y', marker='d', label='Newspaper')
    plt.legend()
    plt.ylabel("銷售額",fontproperties=zhfont1)
    plt.xlabel('廣告花費',fontproperties=zhfont1)
    plt.grid(linestyle='-.')
    plt.savefig('D://Ml_lab_result/ProblemA_1.png')
    plt.show()

2、再次使用matplotlib庫分別畫出：TV與產品銷售額、 Radio與產品銷售額、Newspaper與產品銷售額的數據散點圖。

具體要求：

結果爲三張子圖組成的一個大圖，分爲三行。從上到下的順序依次爲：TV與產品銷售額、 Radio與產品銷售額、Newspaper與產品銷售額的數據散點圖。
圖的X軸爲廣告花費、Y軸爲銷售額的值。
需要畫出虛線形式的網格參考線。

def graph2(data):
    TV = data.TV
    Radio = data.Radio
    Newspaper = data.Newspaper
    Sales = data.Sales
    plt.ylabel("銷售額", fontproperties=zhfont1)
    plt.figure()
    plt.subplot(311)
    plt.scatter(TV, Sales,  c='r',marker='o')
    plt.grid(linestyle='-.')
    plt.subplot(312)
    plt.scatter(Radio, Sales,  c='b', marker='x')
    plt.grid(linestyle='-.')
    plt.subplot(313)
    plt.scatter(Newspaper, Sales,  c='y', marker='d')
    plt.xlabel('廣告花費', fontproperties=zhfont1)
    plt.grid(linestyle='-.')
    plt.savefig('D://Ml_lab_result/ProblemA_2.png')
    plt.show()

從圖表可看出Newspaper的投入與產品銷售額最無關係。

3、先對數據進行標準化後，建立線性迴歸中的多項式擬合模型，分別採用多項式的次數爲1-9進行訓練。最後根據預測結果與真實結果繪圖。

具體要求：

測試集取20%，訓練集取80%。因爲數據特徵有三個（TV,Radio,NewsPaper），無法繪製特徵與預測結果的二維圖形。因此X軸換爲測試樣本下標，Y軸爲產品銷售額。
分別畫出9個圖，在圖中使用綠色線條代表模型針對測試集得出的預測銷售額，使用紅色線條代表測試集對應的實際產品銷售額。圖的標題表明線性模型多項式次數。

def mlr(data):
    X = data[['TV','Radio','Newspaper']]
    y = data['Sales']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=1)

    for degree in range(1,10) :
        model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        plt.grid()
        plt.plot(range(len(y_pred)),y_pred, c='g',label="predict")
        plt.plot(range(len(y_pred)),y_test, c='r', label="true")
        plt.title("degree %d" % degree)
        plt.legend()
        path_img = 'D://Ml_lab_result/degree_%d.png' % degree
        plt.savefig(path_img)
        plt.show()