幾何和線性代數算子

幾何和線性代數算子

Geometry and Linear Algebraic Operations

瞭解了線性代數的基礎知識,並瞭解瞭如何使用來表示轉換數據的常見操作。線性代數是進行深度學習和更廣泛的機器學習的主要數學支柱之一。雖然包含了足夠多的機制來交流現代深度學習模型的機制,但是這個主題還有很多內容。將更深入地介紹線性代數運算的一些幾何解釋,並介紹一些基本概念,包括特徵值和特徵向量。

  1. Geometry of Vectors

首先,需要討論向量的兩種常見的幾何解釋,即空間中的點或方向。基本上,向量是一個數字列表,如下面的Python列表。

v = [1, 7, 0, 1]

數學家通常把寫成列向量或行向量,也就是說
在這裏插入圖片描述
通常有不同的解釋,其中數據點是列向量,用於形成加權和的權重是行向量。然而,保持靈活性是有益的。矩陣是有用的數據結構:允許組織具有不同變化模式的數據。例如,矩陣中的行可能對應於不同的房屋(數據點),而列可能對應於不同的屬性。如果曾經使用過電子表格軟件或閱讀過這些內容,這聽起來應該很熟悉。因此,雖然單個向量的默認方向是列向量,但在表示表格數據集的矩陣中,將每個數據點視爲矩陣中的行向量更爲傳統。而且,正如將在後面幾章中看到的,這項公約將使共同的深入學習實踐成爲可能。例如,沿着張量的最外軸,可以訪問或枚舉數據點的小批量,如果沒有小批量,則只訪問或枚舉數據點。

給定一個向量,首先要解釋的是空間中的一個點。在二維或三維空間中,可以通過使用向量的分量來定義這些點在空間中的位置,與稱爲原點的固定參考相比。如圖1所示。
在這裏插入圖片描述
Fig. 1. An illustration of visualizing vectors as points in the
plane. The first component of the vector gives the x-coordinate, the second component gives the y-coordinate. Higher dimensions are analogous,
although much harder to visualize.

這種幾何觀點使能夠在更抽象的層面上考慮這個問題。不再面臨一些難以逾越的表面問題,如將圖片分類爲貓或狗,可以開始抽象地將任務視爲空間中點的集合,並將任務想象爲發現如何分離兩個不同的點簇。

同時,還有第二種觀點,人通常把矢量當作空間的方向。不僅能想到矢量v=[2,3]⊤ 作爲位置2右邊的單位和3單位從原點向上,也可以把看作是自己要走的方向2往右邊走3站起來。這樣,認爲圖2中的所有向量都是相同的。
在這裏插入圖片描述

Fig. 2. Any vector can be visualized as an arrow in the plane. In this case, every vector drawn is a representation of the vector (2,3).

這種轉變的好處之一是可以從視覺上理解向量加法的行爲。特別是,遵循一個向量給出的方向,然後遵循另一個向量給出的方向,如圖3所示。
在這裏插入圖片描述
Fig. 3. We can visualize vector addition by first following one
vector, and then another.

向量減法也有類似的解釋。考慮到u=v+(u−v)u=v+(u−v),,看到矢量u−v是從這一點出發的方向u到點v。

  1. Dot Products and Angles

如果取兩個列向量u和v,可以通過計算形成點積:
在這裏插入圖片描述
因爲上式是對稱的,將鏡像乘法運算:
在這裏插入圖片描述
爲了強調這樣一個事實,交換向量的順序將得到相同的答案。

點積也允許幾何解釋:與兩個向量之間的角度密切相關。考慮圖4所示的角度。
在這裏插入圖片描述
Fig. 4. Between any two vectors in the plane there is a well defined angle θθ. We will see this angle is intimately tied to the dot product.

首先,讓考慮兩個特定的向量:

v=(r,0) and w=(scos(θ), ssin(θ)).

矢量v是長度,並與x軸和向量w,長度爲s,以一定角度θ和x軸,如果計算這兩個向量的點積,會看到

v⋅w=rscos(θ)=∥v∥∥w∥cos(θ).

通過一些簡單的代數操作,可以重新排列項以獲得
在這裏插入圖片描述
簡言之,對於這兩個特定的向量,點積和範數的結合告訴兩個向量之間的角度。一般來說,這同樣的事實是正確的。但是,如果考慮寫作的話,不會在這裏推導出這個表達式在這裏插入圖片描述,有兩種方法:一種是用點積,另一種是幾何上用餘弦定律,可以得到完整的關係。實際上,對於任何兩個向量v和w,兩個矢量之間的夾角爲
在這裏插入圖片描述
這是一個很好的結果,因爲計算中沒有提到二維。事實上,可以毫無疑問地在三百萬或三百萬個維度上使用。

作爲一個簡單的例子,讓看看如何計算一對向量之間的角度:

%matplotlib inline

from d2l
import mxnet as d2l

from IPython
import display

from mxnet
import gluon, np, npx

npx.set_np()

def angle(v, w):

return

np.arccos(v.dot(w) / (np.linalg.norm(v) * np.linalg.norm(w)))

angle(np.array([0, 1, 2]), np.array([2, 3, 4]))

array(0.41899002)

We will not use it right now, but it is useful to know that we will refer to vectors for which the angle is π/2π/2 (or equivalently 90∘90∘) as being orthogonal. By examining the equation above, we see that this happens when θ=π/2θ=π/2, which is the same thing as cos(θ)=0cos⁡(θ)=0. The only way this can happen is if the dot product itself is zero, and two vectors are orthogonal if and only if v⋅w=0v⋅w=0. This will prove to be a helpful formula when understanding objects geometrically.

It is reasonable to ask: why is computing the angle useful? The answer
comes in the kind of invariance we expect data to have. Consider an image, and a duplicate image, where every pixel value is the same but 10%10% the brightness. The values of the individual pixels are in general far from the original values. Thus, if one computed the distance between the original image and the darker one, the distance can be large.

However, for most ML applications, the content is the same—it is still an image of a cat as far as a cat/dog classifier is concerned. However, if we consider the angle, it is not hard to see that for any vector vv, the angle between vv and 0.1⋅v0.1⋅v is zero. This corresponds to the fact that scaling vectors keeps the same direction and just changes the length. The angle considers the darker image identical.

Examples like this are everywhere. In text, we might want the topic being discussed to not change if we write twice as long of document that says the same thing. For some encoding (such as counting the number of occurrences of words in some vocabulary), this corresponds to a doubling of the vector encoding the document, so again we can use the angle.

2.1. Cosine Similarity

In ML contexts where the angle is employed to measure the closeness of two vectors, practitioners adopt the term cosine similarity to refer to the portion
在這裏插入圖片描述
The cosine takes a maximum value of 11 when the two vectors point in the same direction, a minimum value of −1−1 when they point in opposite directions, and a value of 00 when the two vectors are orthogonal. Note that if the components of high-dimensional vectors are sampled randomly with mean 00, their cosine will nearly always be close to 00.

  1. Hyperplanes

In addition to working
with vectors, another key object that you must understand to go far in linear
algebra is the hyperplane, a generalization to higher dimensions of a line
(two dimensions) or of a plane (three dimensions). In an dd-dimensional vector space, a hyperplane has d−1d−1 dimensions and divides the space into two half-spaces.
在這裏插入圖片描述
在這裏插入圖片描述
Fig. 5. Recalling trigonometry, we see the formula ∥v∥cos(θ)‖v‖cos⁡(θ)
is the length of the projection of the vector vv onto the direction of ww

If we consider the geometric meaning of this expression, we see that this is
equivalent to saying that the length of the projection of vv onto the direction of ww is exactly 1/∥w∥1/‖w‖, as is shown in :numref:fig_vector-project. The set of all points where this is true is a line at right angles to the vector ww. If we wanted, we could find the equation for this line and see that it is 2x+y=12x+y=1 or equivalently y=1−2xy=1−2x.

If we now look at what happens when we ask about the set of points with w⋅v>1w⋅v>1 or w⋅v<1w⋅v<1, we can see that these are cases where the projections are longer or shorter than 1/∥w∥1/‖w‖, respectively. Thus, those two inequalities define either side of the line. In this way, we have found a way to cut our space into two halves, where all the points on one side have dot product below a threshold, and the other side above as we see in Fig.6.
在這裏插入圖片描述
Fig. 6. If we now consider the inequality version of the expression, we see that our hyperplane (in this case: just a line) separates the space into two halves.

The story in higher dimension is much the same. If we now take w=[1,2,3]⊤w=[1,2,3]⊤ and ask about the points in three dimensions with w⋅v=1w⋅v=1, we obtain a plane at right angles to the given vector ww. The two inequalities again define the two sides of the plane as is shown in Fig…7.
在這裏插入圖片描述
Fig.7 Hyperplanes in any dimension separate the space into two halves.

While our ability to visualize runs out at this point, nothing stops us from doing this in tens, hundreds, or billions of dimensions. This occurs often when thinking about machine learned models. For instance, we can understand linear classification models, as methods to find hyperplanes
that separate the different target classes. In this context, such hyperplanes
are often referred to as decision planes. The majority of deep learned classification models end with a linear layer fed into a softmax, so one can interpret the role of the deep neural network to be to find a non-linear embedding such that the target classes can be separated cleanly by hyperplanes.

To give a hand-built example, notice that we can produce a reasonable model to classify tiny images of t-shirts and trousers from the Fashion MNIST dataset by just taking the vector between their means to define the decision plane and eyeball a crude threshold. First we will load the data and compute the averages.

# Load in the dataset

train = gluon.data.vision.FashionMNIST(train=True)

test = gluon.data.vision.FashionMNIST(train=False)

X_train_0 = np.stack([x[0] for x in train if x[1] == 0]).astype(float)

X_train_1 = np.stack([x[0] for x in train if x[1] == 1]).astype(float)

X_test = np.stack(

[x[0] for x in test if x[1] == 0 or x[1] == 1]).astype(float)

y_test = np.stack(

[x[1] for x in test if x[1] == 0 or x[1] == 1]).astype(float)

# Compute averages

ave_0 = np.mean(X_train_0, axis=0)

ave_1 = np.mean(X_train_1, axis=0)

It can be informative to examine these averages in
detail, so let us plot what they look like. In this case, we see that the
average indeed resembles a blurry image of a t-shirt.

# Plot average t-shirt

d2l.set_figsize()

d2l.plt.imshow(ave_0.reshape(28, 28).tolist(), cmap=‘Greys’)

d2l.plt.show()
在這裏插入圖片描述
In the second case, we again see that the average resembles a blurry image of trousers.

# Plot average trousers

d2l.plt.imshow(ave_1.reshape(28, 28).tolist(), cmap=‘Greys’)

d2l.plt.show()
在這裏插入圖片描述
In a fully machine learned solution, we would learn the threshold from the dataset. In this case, I simply eyeballed a threshold that looked good on the training data by hand.

# Print test set accuracy with eyeballed threshold

w = (ave_1 - ave_0).T

predictions = X_test.reshape(2000, -1).dot(w.flatten()) > -1500000

# Accuracy

np.mean(predictions.astype(y_test.dtype) == y_test, dtype=np.float64)

array(0.801, dtype=float64)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章