周志華《機器學習》課後習題第三章解答：Ch3.3 - 編程實現對率迴歸

這裏採用Python-sklearn的方式，環境搭建可參考數據挖掘入門：Python開發環境搭建（eclipse-pydev模式）.

相關答案和源代碼託管在我的Github上：PY131/Machine-Learning_ZhouZhihua.

思路概要

編程實現對率迴歸：
* 採用sklearn邏輯斯蒂迴歸庫函數實現，通過查看混淆矩陣，繪製決策區域來查看模型分類效果；
* 自己編程實現，從極大化似然函數出發，採用梯度下降法得到最優參數，然後嘗試了隨機梯度下降法來優化過程。

3.3 編程實現對率迴歸

所使用的數據集如下：

本題是本書的第一個編程練習，採用了自己編程實現和調用sklearn庫函數兩種不同的方式，詳細解答和編碼過程：（查看完整代碼）：

1.獲取數據、查看數據、預處理：

觀察數據可知，X包含（密度、含糖量）兩個變量，y爲西瓜是否好瓜分類（二分），由此生成.csv數據文件，在Python中用Numpy讀取數據並採用matplotlib庫可視化數據：

樣例代碼：

'''
data importion
'''
import numpy as np
import matplotlib.pyplot as plt  

# load the CSV file as a numpy matrix
dataset = np.loadtxt('../data/watermelon_3a.csv', delimiter=",")

# separate the data from the target attributes
X = dataset[:,1:3]
y = dataset[:,3]

# draw scatter diagram to show the raw data
f1 = plt.figure(1)   
plt.title('watermelon_3a')  
plt.xlabel('density')  
plt.ylabel('ratio_sugar')  
plt.scatter(X[y == 0,0], X[y == 0,1], marker = 'o', color = 'k', s=100, label = 'bad')
plt.scatter(X[y == 1,0], X[y == 1,1], marker = 'o', color = 'g', s=100, label = 'good')
plt.legend(loc = 'upper right')  
plt.show()

數據散點圖：

2.採用sklearn邏輯迴歸庫函數直接擬合：

雖然樣本量很少，這裏還是先劃分訓練集和測試集，採用sklearn.model_selection.train_test_split()實現，然後採用sklearn.linear_model.LogisticRegression，基於訓練集直接擬合出邏輯迴歸模型，然後在測試集上評估模型（查看混淆矩陣和F1值）。

樣例代碼：

''' 
using sklearn lib for logistic regression
'''
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn import metrics


# generalization of test and train set
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.5, random_state=0)

# model training
log_model = LogisticRegression() 
log_model.fit(X_train, y_train) 

# model testing
y_pred = log_model.predict(X_test)

# summarize the accuracy of fitting
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))

得出混淆矩陣和相關度量（查準率（準確率）、查全率（召回率），F1值）結果如下：

[[4 1]
 [1 3]]
             precision    recall  f1-score   support

        0.0       0.80      0.80      0.80         5
        1.0       0.75      0.75      0.75         4

avg / total       0.78      0.78      0.78         9

由混淆矩陣可以看到，由於樣本本身數量較少，模型擬合效果一般，總體預測精度約爲0.78。爲提升精度，可以採用自助法進行重抽樣擴充數據集，或是採用交叉驗證選擇最優模型。

下圖是採用matplotlib.contourf繪製的決策區域和邊界，可以看出對率迴歸分類器還是成功的分出了絕大多數類：

3.自己編程實現邏輯斯蒂迴歸

編程實現邏輯迴歸的主要工作是求取參數w和b（見書p59），最常用的參數估計方法是極大似然法，由於題3.1已經證得對數似然函數（見書3.27）是凸函數，存在最優解，這裏考慮採用梯度下降法來迭代尋優。

回顧一下Sigmoid函數，即邏輯斯蒂迴歸分類器的基礎模型：

目的是基於數據集求出最優參數w和b，最常採用的是極大似然法，參數的似然函數爲：

根據書p59，最大化上式等價於最小化下式：

題3.2已證上式爲凸函數，一定存在最小值，但按照導數爲零的解析求解方式較爲困難，於是考慮採用梯度下降法來求解上式最小值時對應的參數。

注：梯度下降法基本知識可參考書中附錄p409頁，也可直接採用書中p60式3.30偏導數公式。書中關於參數迭代改變式子如下：

對於迭代，可每次先根據(B.16)計算出梯度▽f(β)，然後由(B.17)更新得出下一步的Δβ。

接下來編程實現基本的梯度下降法：

(1)首先編程實現對象式3.27：

def likelihood_sub(x, y, beta):
    '''
    @param x: one sample variables
    @param y: one sample label
    @param beta: the parameter vector in 3.27
    @return: the sub_log-likelihood of 3.27  
    ''' 
    return -y * np.dot(beta, x.T) + np.math.log(1 + np.math.exp(np.dot(beta, x.T)))   

def likelihood(X, y, beta):
    '''
    @param X: the sample variables matrix
    @param y: the sample label matrix
    @param beta: the parameter vector in 3.27
    @return: the log-likelihood of 3.27  
    '''
    sum = 0
    m,n = np.shape(X)  

    for i in range(m):
        sum += likelihood_sub(X[i], y[i], beta)

    return sum

(2)然後基於訓練集（注意x->[x,1]），給出基於3.27似然函數的定步長梯度下降法，注意這裏的偏梯度實現技巧：

'''
@param X: X is the variable matrix 
@param y: y is the label array
@return: the best parameter estimate of 3.27
'''
def gradDscent_1(X, y):  #implementation of basic gradDscent algorithms

    h = 0.1  # step length of iteration
    max_times= 500  # give the iterative times limit
    m, n = np.shape(X)

    beta = np.zeros(n)  # parameter and initial to 0
    delta_beta = np.ones(n)*h  # delta parameter and initial to h
    llh = 0
    llh_temp = 0

    for i in range(max_times):
        beta_temp = beta.copy()

        # for partial derivative 
        for j in range(n): 
            beta[j] += delta_beta[j]
            llh_tmp = likelihood(X, y, beta)
            delta_beta[j] = -h * (llh_tmp - llh) / delta_beta[j]                
            beta[j] = beta_temp[j]

        beta += delta_beta            
        llh = likelihood(X, y, beta)

    return beta

通過追蹤參數，查看其收斂曲線，然後來調節相關參數（步長h，迭代次數max_times）。下圖是在當前參數取值下的beta曲線，可以看到其收斂良好：

(3)最後建立Sigmoid預測函數，對測試集數據進預測，得到混淆矩陣如下：

[[ 4.  1.]
 [ 1.  3.]]

可以看出其總體預測精度（7/9 ≈ 0.78）與調用sklearn庫得出的結果相當。

(4)採用隨機梯度下降法來優化：上面採用的是全局定步長梯度下降法（稱之爲批量梯度下降），這種方法在可能會面臨收斂過慢和收斂曲線波動情況的同時，每次迭代需要全局計算，計算量隨數據量增大而急劇增大。所以嘗試採用隨機梯度下降來改善參數迭代尋優過程。

隨機梯度下降法的核心思想是增量學習：一次只用一個新樣本來更新迴歸係數，從而形成在線流式處理。

同時爲了加快收斂，採用變步長的策略，h隨着迭代次數逐漸減小。

給出變步長隨機梯度下降法的代碼如下：

def gradDscent_2(X, y):  #implementation of stochastic gradDscent algorithms
  '''
   @param X: X is the variable matrix 
   @param y: y is the label array
   @return: the best parameter estimate of 3.27
   '''
   import matplotlib.pyplot as plt  

   m, n = np.shape(X)
   h = 0.5  #  step length of iterator and initial
   beta = np.zeros(n)  # parameter and initial
   delta_beta = np.ones(n) * h
   llh = 0
   llh_temp = 0

   for i in range(m):
       beta_temp = beta.copy()  # for partial derivative 

       for j in range(n): 
           h = 0.5 * 1 / (1 + i + j)  # change step length of iterator 
           beta[j] += delta_beta[j]
           llh_tmp = likelihood_sub(X[i], y[i], beta)
           delta_beta[j] = -h * (llh_tmp - llh) / delta_beta[j]   
           beta[j] = beta_temp[j]  

       beta += delta_beta    
       llh = likelihood_sub(X[i], y[i], beta)

   return beta

得出混淆矩陣：

[[ 3.  2.]
 [ 0.  4.]]

從結果看到的是：由於這裏的西瓜數據集並不大，所以隨機梯度下降法採用一次遍歷所得的結果不太好，參數也沒有完成收斂。這裏只是給出隨機梯度下降法的實現樣例，這種方法在大數據集下相比批量梯度法應會有明顯的優勢。

參考鏈接：
由於這是本書第一個編程，索引資料較多，擇其重要的一些列出如下：

轉自https://blog.csdn.net/snoopy_yuan/article/details/63684219

周志華《機器學習》課後習題第三章解答：Ch3.3 - 編程實現對率迴歸

思路概要

3.3 編程實現對率迴歸

1.獲取數據、查看數據、預處理：

2.採用sklearn邏輯迴歸庫函數直接擬合：

3.自己編程實現邏輯斯蒂迴歸

lightdb hash index的性能和限制

MIT Python學習基礎系列（二） - dictionary

傳智播客學面向對象的java，配置環境/常見錯誤/命名筆記

傳智播客學習java：運算符

傳智學習java系列：if & switch

傳智博客學習面向對象的java：常量/變量

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結