上篇文章我們解決了Steam是否打折的問題，這篇文章我們要解決的是到底打折幅度有多少，這裏我們就不能使用分類模型，而需要使用迴歸的模型了。

主要目標

在這個項目中，我將試圖找出什麼樣的因素會影響Steam的折扣率並建立一個線性迴歸模型來預測折扣率。

數據

數據將直接從Steam的官方網站上獲取。

https://store.steampowered.com/tags/en/Strategy/

我們使用Python編寫抓取程序，使用的庫包括：

“re”— regex”，用於模式查找。

“CSV”— 用於將刮下的數據寫入.CSV文件中，使用pandas進行處理。

“requests”— 向Steam網站發送http/https請求

“BeautifulSoup”—用於查找html元素/標記。

當數據加載到Pandas中時，大概的顯示如下所示：

我們訓練模型的目標是:數據集中預測的目標是“折扣百分比”,DiscountPercentage

數據清洗

採集的原始數據包含許多我們不需要的東西：

一、免費遊戲，沒有價格，包括演示和即將發佈。

二、不打折的遊戲。

三、非數值的數據

我們在把他們清洗的同時，還可以做一些特徵工程。

在後面的章節中，我將介紹在建模和測試時所做的所有特性工程，但是對於基線模型，可以使用以下方式

添加一個“季節”欄，查看遊戲發佈的季節：

完成上述過程後，我們現在可以從dataframe中刪除所有基於字符串的列：

這個過程還將把我們的結果從14806個和12個特徵縮小到370個條目和7個特徵。

不好的消息是這意味着由於樣本量較小，該模型很容易出現誤差。

數據分析

分析部分包括三個步驟：

數據探索分析（EDA）
特徵工程（FE）
建模

一般工作流程如下所示：

EDA以找到的特徵-目標關係（通過對圖/熱圖、Lasso 係數等）

根據可用信息，進行特徵工程（數學轉換、裝箱、獲取虛擬條目

使用R方和/或其他指標（RMSE、MAE等）建模和評分

沖洗並重復以上步驟，直到嘗試並用盡所有潛在的特徵工程想法或達到可接受的評分分數（例如R方）。

# Plotting heat map as part of EDA
sns.heatmap(df.corr(),cmap="YlGnBu",annot=True)
plt.show()

# A compiled function to automatically split data, make and score models. And showing you what's the most relevant features.
def RSquare(df, col):
    X, y = df.drop(col,axis=1), df[col]

    X, X_test, y, y_test = train_test_split(X, y, test_size=.2, random_state=10) #hold out 20% of the data for final testing

    #this helps with the way kf will generate indices below
    X, y = np.array(X), np.array(y)
    
    kf = KFold(n_splits=5, shuffle=True, random_state = 50)
    cv_lm_r2s, cv_lm_reg_r2s, cv_lm_poly_r2s, cv_lasso_r2s = [], [], [], [] #collect the validation results for both models

    for train_ind, val_ind in kf.split(X,y):

        X_train, y_train = X[train_ind], y[train_ind]
        X_val, y_val = X[val_ind], y[val_ind] 

        #simple linear regression
        lm = LinearRegression()
        lm_reg = Ridge(alpha=1)
        lm_poly = LinearRegression()

        lm.fit(X_train, y_train)
        cv_lm_r2s.append(lm.score(X_val, y_val))

        #ridge with feature scaling
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_val_scaled = scaler.transform(X_val)
        lm_reg.fit(X_train_scaled, y_train)
        cv_lm_reg_r2s.append(lm_reg.score(X_val_scaled, y_val))

        poly = PolynomialFeatures(degree=2) 
        X_train_poly = poly.fit_transform(X_train)
        X_val_poly = poly.fit_transform(X_val)
        lm_poly.fit(X_train_poly, y_train)
        cv_lm_poly_r2s.append(lm_poly.score(X_val_poly, y_val))
        
        #Lasso
        std = StandardScaler()
        std.fit(X_train)
        
        X_tr = std.transform(X_train)
        X_te = std.transform(X_test)
        
        X_val_lasso = std.transform(X_val)
        
        alphavec = 10**np.linspace(-10,10,1000)

        lasso_model = LassoCV(alphas = alphavec, cv=5)
        lasso_model.fit(X_tr, y_train)
        cv_lasso_r2s.append(lasso_model.score(X_val_lasso, y_val))
        
        test_set_pred = lasso_model.predict(X_te)
        
        column = df.drop(col,axis=1)
        to_print = list(zip(column.columns, lasso_model.coef_))
        pp = pprint.PrettyPrinter(indent = 1)
    
        rms = sqrt(mean_squared_error(y_test, test_set_pred))
        
    print('Simple regression scores: ', cv_lm_r2s, '\n')
    print('Ridge scores: ', cv_lm_reg_r2s, '\n')
    print('Poly scores: ', cv_lm_poly_r2s, '\n')
    print('Lasso scores: ', cv_lasso_r2s, '\n')

    print(f'Simple mean cv r^2: {np.mean(cv_lm_r2s):.3f} +- {np.std(cv_lm_r2s):.3f}')
    print(f'Ridge mean cv r^2: {np.mean(cv_lm_reg_r2s):.3f} +- {np.std(cv_lm_reg_r2s):.3f}')
    print(f'Poly mean cv r^2: {np.mean(cv_lm_poly_r2s):.3f} +- {np.std(cv_lm_poly_r2s):.3f}', '\n')
    
    print('lasso_model.alpha_:', lasso_model.alpha_)
    print(f'Lasso cv r^2: {r2_score(y_test, test_set_pred):.3f} +- {np.std(cv_lasso_r2s):.3f}', '\n')
    
    print(f'MAE: {np.mean(np.abs(y_pred - y_true)) }', '\n')
    print('RMSE:', rms, '\n')
    
    print('Lasso Coef:')
    pp.pprint (to_print)

先貼一些代碼，後面做詳細解釋

第一次嘗試：基本模型,刪除評論少於30條的遊戲

# Setting a floor limit of 30
df1 = df1[df1.Reviews > 30]

Best Model: Lasso
Score: 0.419 +- 0.073

第二次：“Reviews” & “OriginalPrice” 進行對數變換

df2.Reviews = np.log(df2.Reviews)
df2.OriginalPrice = df2.OriginalPrice.astype(float)
df2.OriginalPrice = np.log(df2.OriginalPrice)

Best Model: Lasso
Score: 0.437 +- 0.104

第三次：將mantag進行onehot編碼

# Checking to make sure the dummies are separated correctly
pd.get_dummies(df3.Main_Tag).head(5)

# Adding dummy categories into the dataframe
df3 = pd.concat([df3, pd.get_dummies(df3.Main_Tag).astype(int)], axis = 1)

# Drop original string based column to avoid conflict in linear regression
df3.drop('Main_Tag', axis = 1, inplace=True)

Best Model: Lasso
Score: 0.330 +- 0.073

第四次：嘗試把所有非數值數據都進行onehot編碼

# we can get dummies for each tag listed separated by comma
split_tag = df4.All_Tags.astype(str).str.strip('[]').str.get_dummies(', ')

# Now merge the dummies into the data frame to start EDA
df4= pd.concat([df4, split_tag], axis=1)

# Remove any column that only has value of 0 as precaution
df4 = df4.loc[:, (df4 != 0).any(axis=0)]

Best Model: Lasso
Score: 0.359 +- 0.080

第五次：整合2和4次操作

# Dummy all top 5 tags
split_tag = df.All_Tags.astype(str).str.strip('[]').str.get_dummies(', ')
df5= pd.concat([df5, split_tag], axis=1)

# Log transform Review due to skewed pairplot graphs
df5['Log_Review'] = np.log(df5['Reviews'])

Best Model: Lasso
Score: 0.359 +- 0.080

看到結果後，發現與第4次得分完全相同，這意味着“評論”對摺扣百分比絕對沒有影響。所以這一步操作可以不做，對結果沒有任何影響

第六次：對將“評論”和“發佈後的天數”進行特殊處理

# Binning reviews (which is highly correlated with popularity) based on the above 75 percentile and 25 percentile
df6.loc[df6['Reviews'] < 33, 'low_pop'] = 1
df6.loc[(df6.Reviews >= 33) & (df6.Reviews < 381), 'mid_pop'] = 1
df6.loc[df6['Reviews'] >= 381, 'high_pop'] = 1

# Binning Days_Since_Release based on the above 75 percentile and 25 percentile
df6.loc[df6['Days_Since_Release'] < 418, 'new_game'] = 1
df6.loc[(df6.Days_Since_Release >= 418) & (df6.Days_Since_Release < 1716), 'established_game'] = 1
df6.loc[df6['Days_Since_Release'] >= 1716, 'old_game'] = 1

# Fill all the NaN's
df6.fillna(0, inplace = True)

# Drop the old columns to avoid multicolinearity
df6.drop(['Reviews', 'Days_Since_Release'], axis=1, inplace = True)

這兩列被分成三個特徵。

Best Model: Ridge
Score: 0.273 +- 0.044

第七次：拆分其他特徵並刪除不到30條評論的結果。

# Setting a floor threshold of 30 reivews for the data frame to remove some outliers
df7 = df7[df7.Reviews > 30]

# Binning based on 50%
df7.loc[df7['Reviews'] < 271, 'low_pop'] = 1
df7.loc[df7['Reviews'] >= 271, 'high_pop'] = 1

df7.loc[df7['Days_Since_Release'] < 1167, 'new_game'] = 1
df7.loc[df7['Days_Since_Release'] >= 1167, 'old_game'] = 1

# Fill all NaN's
df7.fillna(0, inplace= True)

# Drop old columns to avoid multicolinearity
df7.drop(['Reviews', 'Days_Since_Release'], axis=1, inplace = True)