常用的梯度下降法分爲:
- 批量梯度下降法(Batch Gradient Descent)
- 隨機梯度下降法(Stochastic Gradient Descent)
- 小批量梯度下降法(Mini-Batch Gradient Descent)
簡單的算法示例
數據
x = np.random.uniform(-3,3,100)
X = x.reshape(-1,1)
y = x * 2 + 5 + np.random.normal(0, 1, 100)
BGD
批量梯度下降法的簡單實現:
def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
def J(theta):
return np.mean((X_b.dot(theta) - y) ** 2)
def dj(theta):
return X_b.T.dot((X_b.dot(theta) - y)) * (2 / len(y))
theta = initial_theta
for i in range(1, int(n_iters)):
gradient = dj(theta) # 獲得梯度
last_theta = theta
theta = theta - eta * gradient # 迭代梯度
if np.absolute(J(theta) - J(last_theta)) < epsilon:
break # 滿足條件就跳出
return theta
結果是:
X_b = np.hstack([np.ones((len(y), 1)), X])
initial_theta = np.ones(X_b.shape[1])
eta = 0.1
%time s_gradient_descent(X_b, y, initial_theta, eta, n_iters=1)
## array([4.72619109, 3.08239321])
SGD
這裏n_iters
表示將所有數據迭代的輪數。
def s_gradient_descent(X_b, y, initial_theta, eta, batch_size=10, n_iters=10, epsilon=1e-8):
def J(theta):
return np.mean((X_b.dot(theta) - y) ** 2)
# 這是隨機梯度下降的,隨機一個樣本的梯度
def dj_sgd(X_b_i, y_i, theta):
# return X_b.T.dot((X_b.dot(theta) - y)) * (2 / len(y))
return 2 * X_b_i.T.dot(X_b_i.dot(theta) - y_i)
theta = initial_theta
for i in range(0, int(n_iters)):
for j in range(batch_size, len(y), batch_size):
gradient = dj_sgd(X_b[j,:], y[j], theta)
last_theta = theta
theta = theta - eta * gradient # 迭代梯度
if np.absolute(J(theta) - J(last_theta)) < epsilon:
break # 滿足條件就跳出
return theta
結果是:
X_b = np.hstack([np.ones((len(y), 1)), X])
initial_theta = np.ones(X_b.shape[1])
eta = 0.1
%time s_gradient_descent(X_b, y, initial_theta, eta, n_iters=1)
## array([4.72619109, 3.08239321])
MBGD
在隨機梯度下降的基礎上,對dj做了一點點修改,batch_size
指定批量的大小,dj每次計算batch_size
個樣本的梯度並取平均值。
不得不說,同樣是迭代一輪數據,小批量梯度下降法的準確度要比隨機梯度下降法高多了。
def b_gradient_descent(X_b, y, initial_theta, eta, batch_size=10, n_iters=10, epsilon=1e-8):
def J(theta):
return np.mean((X_b.dot(theta) - y) ** 2)
# 這是小批量梯度下降的,隨機一個樣本的梯度
def dj_bgd(X_b_b, y_b, theta):
# return X_b.T.dot((X_b.dot(theta) - y)) * (2 / len(y))
return X_b_b.T.dot(X_b_b.dot(theta) - y_b) * (2 / len(y_b))
theta = initial_theta
for i in range(0, int(n_iters)):
for j in range(batch_size, len(y), batch_size):
gradient = dj_bgd(X_b[j-batch_size:j,:], y[j-batch_size:j], theta)
last_theta = theta
theta = theta - eta * gradient # 迭代梯度
if np.absolute(J(theta) - J(last_theta)) < epsilon:
break # 滿足條件就跳出
return theta
結果是:
X_b = np.hstack([np.ones((len(y), 1)), X])
initial_theta = np.ones(X_b.shape[1])
eta = 0.1
%time b_gradient_descent(X_b, y, initial_theta, eta, n_iters=1)
array([4.4649369 , 2.27164876])