BAT面試題5：關於LR

接下來，每天一道BAT面試題陪伴你，只要你堅持下來，日積月累，你會在不知不覺中就步入機器學習的大門，並且越走越遠。同時，還能助你順利拿到OFFER. 你應該學會爲自己鼓掌，同樣的掌聲送給一直奮鬥的你們。

BAT面試題5：關於LR

理解透LR，最直白的方法就是通過一個例子，動手實現LR源碼，而不是調包。好的，下面我們直接開始動手實現LR源碼。

Step1 生成模擬的數據集

爲了編寫代碼模擬二分類任務，我們的第一步工作是先生成用於測試的數據集。首先看下生成的用於模擬的數據集長得樣子，它有兩個特徵w1，w2組成，共有200個樣本點，現在的任務是要對這個數據集進行分類。

下面是用於模擬上圖數據的代碼，按照一定規律均勻分佈含有兩個特徵的數據點,因此data表示以上所有的樣本點和標籤值組成的數據集，head10：

 1      w1               w2          y
 2array([[ 0.78863156,  0.45879449,  1.        ],
 3
 4       [ 0.70291388,  0.03437041,  1.        ],
 5
 6       [ 0.89775764,  0.24842968,  1.        ],
 7
 8       [ 0.92674416,  0.13579184,  1.        ],
 9
10       [ 0.80332783,  0.71211063,  1.        ],
11
12       [ 0.7208047 ,  0.48432214,  1.        ],
13
14       [ 0.8523947 ,  0.06768344,  1.        ],
15
16       [ 0.49226351,  0.24969169,  1.        ],
17
18       [ 0.85094261,  0.79031018,  1.        ],
19
20       [ 0.76426901,  0.07703571,  1.        ]])

下面介紹，如何用梯度下降法，求出兩個特徵對應的權重參數，進而能正確的預測，當一個新的樣本點來的時候，能預測出屬於0類，還是1類。

Step2 梯度下降求權重參數

設定一個學習率迭代參數，當與前一時步的代價函數與當前的代價函數的差小於閾值時，計算結束，我們將得到3個權重參數，其中包括兩個特徵的權重參數，和偏置項的權重參數。

假定模型的決策邊界爲線性模型，梯度下降求邏輯迴歸模型的權重參數的基本思路和四個公式如下：

1'model' 建立的邏輯迴歸模型：包括Sigmoid映射
2
3'cost' 代價函數
4
5'gradient'  梯度公式
6
7'theta update'  參數更新公式
8
9'stop stratege'  迭代停止策略：代價函數小於閾值法

不要忘記初始化一列偏置項：做一個偏移量和2個特徵的組合，這樣與前面推送的理論部分銜接在一起，代碼如下所示：

  1'偏移量 b shape=(200,1)'
  2
  3b = np.ones(200)
  4
  5'將偏移量與2個特徵值組合 shape = (200,3)'
  6
  7X = np.hstack((b,X))
  8
  9'model'
 10
 11def sigmoid(x):
 12
 13    return 1/(1+ np.exp(-x))
 14
 15def model(theta,X):
 16
 17    theta = np.array(theta)
 18
 19    return sigmoid( X.dot(theta) )
 20
 21'cost'
 22
 23def cost(m,theta,X,y):
 24
 25    ele = y*np.log(model(theta,X)) + (1-y)*np.log(1-model(theta,X))
 26
 27    item_sum = np.sum(ele)
 28
 29    return -item_sum/m
 30
 31'gradient'
 32
 33
 34
 35
 36def gradient(m,theta,X,y,cols):
 37
 38    grad_theta = []
 39
 40    for j in range(cols):
 41
 42        grad = (model(theta,X) - y).dot(X[:,j])
 43
 44        grad_sum = np.sum(grad)    
 45
 46        grad_theta.append(grad_sum/m)
 47
 48    return np.array(grad_theta)
 49
 50'theta update'
 51
 52def theta_update(grad_theta,theta,sigma):
 53
 54    return theta - sigma * grad_theta
 55
 56'stop stratege'
 57
 58def stop_stratege(cost,cost_update,threshold):
 59
 60    return cost-cost_update < threshold
 61
 62'邏輯迴歸算法'
 63
 64
 65def LogicRegression(X,y,threshold,m,xcols):
 66
 67    start = time.clock()
 68
 69    '設置權重參數的初始值'
 70
 71    theta = np.zeros(xcols)
 72
 73    '迭代步數'
 74
 75    iters = 0;
 76
 77    '記錄代價函數的值'
 78
 79    cost_record=[]
 80
 81    '學習率'
 82
 83    sigma = 0.01
 84
 85    cost_val = cost(m,theta,X,y)
 86
 87    cost_record.append(cost_val)
 88
 89    while True:
 90
 91        grad = gradient(m,theta,X,y,xcols)
 92
 93        '參數更新'
 94
 95        theta = theta_update(grad,theta,sigma)
 96
 97        cost_update = cost(m,theta,X,y)
 98
 99        if stop_stratege(cost_val,cost_update,threshold):
100
101            break
102
103        iters=iters+1
104
105        cost_val = cost_update
106
107        print("cost_val:%f" %cost_val)
108
109        cost_record.append(cost_val)
110
111    end = time.clock()
112
113    print("LogicRegressionconvergence duration: %f s" % (end - start))
114
115    return cost_record, iters,theta

3 分析結果

調用邏輯迴歸函數：LogicRegression(data[:,[0,1,2]],data[:,3],0.00001,200,3)

結果顯示經過，邏輯迴歸梯度下降經過如下時間得到初步收斂，LogicRegression convergence duration:18.076398 s，經過 56172萬多個時步迭代，每個時步計算代價函數的取值，如下圖所示：

收斂時，得到的權重參數爲：

1array([ 0.48528656,  9.48593954, -9.42256868])

參數的含義：第一個權重參數爲偏置項，第二、三個權重參數相當，只不過貢獻方向相反而已。

下面畫出，二分類的決策邊界，

1plt.scatter(x1_pos,x2_pos)
2
3plt.scatter(x1_neg,x2_neg)
4
5wp = np.linspace(0.0,1.0,200)
6
7plt.plot(wp,-(theta[0]+theta[1]*wp)/theta[2],color='g')
8
9plt.show()

可以看到分類效果非常不錯。

如果你對文章感興趣，同樣這個星球會幫助到你，歡迎加入。

BAT面試題5：關於LR

redis的key亂碼問題和值自增問題

一個開源且全面的C#算法實戰教程

一款.NET開源、功能強大、跨平臺的繪圖庫 - OxyPlot

CORS error 但是 status code 是200 OK

壓縮上傳的GPU數據的方案

使用skopeo同步鏡像

BAT面試14: 談談 docker 在深度學習任務中的應用

BAT面試題13：請簡要說說一個完整機器學習項目的流程

均分紙牌（經典貪心）

Git 分支的原理和應用實戰，看這篇就夠了！

BAT面試題12：機器學習爲何要經常對數據做歸一化？

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結