前言
學習加載預訓練的詞向量,並使用餘弦相似度測量相似度(我們這裏沒有足夠的時間訓練,現階段直接加載就好)
使用單詞嵌入來解決單詞類比問題,例如“男人相對女人”,“國王相對____”。
修改詞嵌入以減少其性別偏見
代碼
# GRADED FUNCTION: cosine_similarity
def cosine_similarity(u, v):
"""
Cosine similarity reflects the degree of similariy between u and v
Arguments:
u -- a word vector of shape (n,)
v -- a word vector of shape (n,)
Returns:
cosine_similarity -- the cosine similarity between u and v defined by the formula above.
"""
distance = 0.0
### START CODE HERE ###
# Compute the dot product between u and v (≈1 line)
dot = np.dot(u,v)
# Compute the L2 norm of u (≈1 line)
norm_u = np.linalg.norm(u)
# Compute the L2 norm of v (≈1 line)
norm_v = np.linalg.norm(v)
# Compute the cosine similarity defined by formula (1) (≈1 line)
cosine_similarity = dot/(norm_u*norm_v)
### END CODE HERE ###
return cosine_similarity
單詞類比
# GRADED FUNCTION: complete_analogy
def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
"""
Performs the word analogy task as explained above: a is to b as c is to ____.
Arguments:
word_a -- a word, string
word_b -- a word, string
word_c -- a word, string
word_to_vec_map -- dictionary that maps words to their corresponding vectors.
Returns:
best_word -- the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
"""
# convert words to lower case
word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
### START CODE HERE ###
# Get the word embeddings v_a, v_b and v_c (≈1-3 lines)
### END CODE HERE ###
words = word_to_vec_map.keys()
max_cosine_sim = -100 # Initialize max_cosine_sim to a large negative number
best_word = None # Initialize best_word with None, it will help keep track of the word to output
# loop over the whole word vector set
for w in words:
# to avoid best_word being one of the input words, pass on them.
if w in [word_a, word_b, word_c] :
continue
### START CODE HERE ###
# Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c) (≈1 line)
cosine_sim = cosine_similarity(word_to_vec_map[word_b]-word_to_vec_map[word_a], word_to_vec_map[w]-word_to_vec_map[word_c])
# If the cosine_sim is more than the max_cosine_sim seen so far,
# then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
if cosine_sim>max_cosine_sim:
max_cosine_sim = cosine_sim
best_word = w
### END CODE HERE ###
return best_word
消除非性別特定詞的偏見
下圖應幫助你直觀地瞭解中和的作用。如果你使用的是50維詞嵌入,則50維空間可以分爲兩部分:偏移方向和其餘49維,我們將其稱爲。在線性代數中,我們說49維與垂直(或“正交”),這意味着它與成90度。中和步驟採用向量,例如,並沿的方向將分量清零,從而得到。
即使是49維的,鑑於我們可以在屏幕上繪製的內容的侷限性,我們還是使用下面的1維軸對其進行說明。
**圖2 **:在應用中和操作之前和之後,代表"receptionist"的單詞向量。
練習:實現neutralize()
以消除諸如"receptionist" 或 "scientist"之類的詞的偏見。給定嵌入的輸入,你可以使用以下公式來計算:
如果你是線性代數方面的專家,則可以將識別爲在方向上的投影。如果你不是線性代數方面的專家,請不必爲此擔心。
提醒:向量可分爲兩部分:在向量軸上的投影和在與正交的軸上的投影:
其中: and
def neutralize(word, g, word_to_vec_map):
"""
Removes the bias of "word" by projecting it on the space orthogonal to the bias axis.
This function ensures that gender neutral words are zero in the gender subspace.
Arguments:
word -- string indicating the word to debias
g -- numpy-array of shape (50,), corresponding to the bias axis (such as gender)
word_to_vec_map -- dictionary mapping words to their corresponding vectors.
Returns:
e_debiased -- neutralized word vector representation of the input "word"
"""
### START CODE HERE ###
# Select word vector representation of "word". Use word_to_vec_map. (≈ 1 line)
e = word_to_vec_map[word]
# Compute e_biascomponent using the formula give above. (≈ 1 line)
e_biascomponent = (np.dot(e,g)/np.square(np.linalg.norm(g)))*g
# Neutralize e by substracting e_biascomponent from it
# e_debiased should be equal to its orthogonal projection. (≈ 1 line)
e_debiased = e-e_biascomponent
### END CODE HERE ###
return e_debiased
消除性別特定詞的偏見
讓我們看一下如何將偏置也應用於單詞對,例如"actress"和"actor"。均衡僅應用與你希望通過性別屬性有所不同的單詞對。作爲具體示例,假設"actress"比"actor"更接近"babysit"。通過將中和應用於"babysit",我們可以減少與"babysit"相關的性別刻板印象。但這仍然不能保證"actress"和"actor"與"babysit"等距,均衡算法負責這一點。
均衡背後的關鍵思想是確保一對特定單詞與49維 等距。均衡步驟還確保了兩個均衡步驟現在與或與任何其他已中和的作品之間的距離相同。圖片中展示了均衡的工作方式:
爲此,線性代數的推導要複雜一些。(詳細信息請參見Bolukbasi et al., 2016)但其關鍵方程式是:
KaTeX parse error: Expected '}', got '_' at position 36: …u * \text{bias_̲axis}}{||\text{…
def equalize(pair, bias_axis, word_to_vec_map):
"""
Debias gender specific words by following the equalize method described in the figure above.
Arguments:
pair -- pair of strings of gender specific words to debias, e.g. ("actress", "actor")
bias_axis -- numpy-array of shape (50,), vector corresponding to the bias axis, e.g. gender
word_to_vec_map -- dictionary mapping words to their corresponding vectors
Returns
e_1 -- word vector corresponding to the first word
e_2 -- word vector corresponding to the second word
"""
### START CODE HERE ###
# Step 1: Select word vector representation of "word". Use word_to_vec_map. (≈ 2 lines)
w1, w2 = pair
e_w1, e_w2 = word_to_vec_map[w1], word_to_vec_map[w2]
# Step 2: Compute the mean of e_w1 and e_w2 (≈ 1 line)
mu = (e_w1 + e_w2) / 2
# Step 3: Compute the projections of mu over the bias axis and the orthogonal axis (≈ 2 lines)
mu_B = np.dot(mu, bias_axis) / np.sum(bias_axis * bias_axis) * bias_axis
mu_orth = mu - mu_B
# Step 4: Use equations (7) and (8) to compute e_w1B and e_w2B (≈2 lines)
e_w1B = np.dot(e_w1, bias_axis) / np.sum(bias_axis * bias_axis) * bias_axis
e_w2B = np.dot(e_w2, bias_axis) / np.sum(bias_axis * bias_axis) * bias_axis
# Step 5: Adjust the Bias part of e_w1B and e_w2B using the formulas (9) and (10) given above (≈2 lines)
corrected_e_w1B = np.sqrt(np.abs(1 - np.sum(mu_orth * mu_orth))) * (e_w1B - mu_B) / np.linalg.norm(e_w1 - mu_orth - mu_B)
corrected_e_w2B = np.sqrt(np.abs(1 - np.sum(mu_orth * mu_orth))) * (e_w2B - mu_B) / np.linalg.norm(e_w2 - mu_orth - mu_B)
# Step 6: Debias by equalizing e1 and e2 to the sum of their corrected projections (≈2 lines)
e1 = corrected_e_w1B + mu_orth
e2 = corrected_e_w2B + mu_orth
### END CODE HERE ###
return e1, e2