Triplet SemiHard Loss 代碼詳解

導讀

這段時間的triplet loss真是讓我頭痛
當然也看了非常多不錯的解析
像

Triplet-Loss原理及其實現
Tensorflow實現Triplet Loss
都是非常不錯的解析，讓人清晰易懂

triplet loss中，可以說最關鍵的就是semihard loss，原論文中也是使用了這種訓練方式。所以在自己的項目中也是強調了這一部分。
triplet loss理應有三個input， anchor， positive，negative。爲什麼我選擇這個呢？
因爲這個是通過輸入label以及embedding，就可以對應的計算出來semihard loss了，非常的簡單易用。如果自己去做篩選A P N，就會有預處理數據篩選三個樣本的情況。

所以在原始框架下看看tensorflow代碼：
tf.contrib.losses.metric_learning.triplet_semihard_loss
從tf 0.8開始就支持，而且代碼可以copy出來用，畢竟arrays_op等基礎package在tensorflow裏面都是通用的

這一部分的代碼直接拿過來用了，但是咱們畢竟是需要去思考其背後實現原理的。（其實我只是覺得tensorflow很煩躁，想換成torch去實現哈哈哈）

於是我實現了2份代碼，算上tensorflow提供的，一共3份代碼版本

1 tensorflow版本

2 numpy version

3 torch version
首先我們觀測

tensorflow version 核心代碼

def triplet_semihard_loss(labels, embeddings, margin=1.0):
    """Computes the triplet loss with semi-hard negative mining.
    The loss encourages the positive distances (between a pair of embeddings with
    the same labels) to be smaller than the minimum negative distance among
    which are at least greater than the positive distance plus the margin constant
    (called semi-hard negative) in the mini-batch. If no such negative exists,
    uses the largest negative distance instead.
    See: https://arxiv.org/abs/1503.03832.
    Args:
      labels: 1-D tf.int32 `Tensor` with shape [batch_size] of
        multiclass integer labels.
      embeddings: 2-D float `Tensor` of embedding vectors. Embeddings should
        be l2 normalized.
      margin: Float, margin term in the loss definition.
    Returns:
      triplet_loss: tf.float32 scalar.
    """
    # Reshape [batch_size] label tensor to a [batch_size, 1] label tensor.
    lshape = array_ops.shape(labels)
    assert len(lshape.shape) == 1
    labels = array_ops.reshape(labels, [lshape[0], 1])
    # Build pairwise squared distance matrix.
    pdist_matrix = pairwise_distance(embeddings, squared=True)
    # Build pairwise binary adjacency matrix.
    adjacency = math_ops.equal(labels, array_ops.transpose(labels))
    # Invert so we can select negatives only.
    adjacency_not = math_ops.logical_not(adjacency)

    batch_size = array_ops.size(labels)

    # Compute the mask.
    pdist_matrix_tile = array_ops.tile(pdist_matrix, [batch_size, 1])
    mask = math_ops.logical_and(
        array_ops.tile(adjacency_not, [batch_size, 1]),
        math_ops.greater(
            pdist_matrix_tile, array_ops.reshape(
                array_ops.transpose(pdist_matrix), [-1, 1])))

    mask_final = array_ops.reshape(
        math_ops.greater(
            math_ops.reduce_sum(
                math_ops.cast(mask, dtype=dtypes.float32), 1, keepdims=True),
            0.0), [batch_size, batch_size])
    mask_final = array_ops.transpose(mask_final)

    adjacency_not = math_ops.cast(adjacency_not, dtype=dtypes.float32)
    mask = math_ops.cast(mask, dtype=dtypes.float32)

    # negatives_outside: smallest D_an where D_an > D_ap.
    negatives_outside = array_ops.reshape(
        masked_minimum(pdist_matrix_tile, mask), [batch_size, batch_size])
    negatives_outside = array_ops.transpose(negatives_outside)

    # negatives_inside: largest D_an.
    negatives_inside = array_ops.tile(
        masked_maximum(pdist_matrix, adjacency_not), [1, batch_size])
    semi_hard_negatives = array_ops.where(
        mask_final, negatives_outside, negatives_inside)
    loss_mat = math_ops.add(margin, pdist_matrix - semi_hard_negatives)

    mask_positives = math_ops.cast(
        adjacency, dtype=dtypes.float32) - array_ops.diag(
        array_ops.ones([batch_size]))

    # In lifted-struct, the authors multiply 0.5 for upper triangular
    #   in semihard, they take all positive pairs except the diagonal.
    num_positives = math_ops.reduce_sum(mask_positives)

    triplet_loss = math_ops.truediv(
        math_ops.reduce_sum(
            math_ops.maximum(
                math_ops.multiply(loss_mat, mask_positives), 0.0)),
        num_positives,
        name='triplet_semihard_loss')

    return triplet_loss

其中pairwise_distance太簡單了，此處略過，如果不懂可以看上述鏈接

Numpy 版本代碼

因爲tensorflow的debug起來太麻煩了（懂得都懂）
所以我根據tf實現了numpy版本的triplet semihard loss，並根據其來解析triplet semihard loss的具體工作流程，方便大家可以用來使用

import numpy as np
# 測試樣例
labels = np.array([0,1,1,0,1])
embeddings = np.array([[0.20251631, 0.49964871, 0.31357543, 0.99332346, 0.40536699,
        0.05654062, 0.07307319, 0.2950833 , 0.5154805 , 0.43801481],
       [0.05170506, 0.92920793, 0.50820659, 0.80957615, 0.59039356,
        0.83899964, 0.3024558 , 0.29522561, 0.90828209, 0.7059259 ],
       [0.06045745, 0.73130719, 0.81192888, 0.37673241, 0.41282683,
        0.00261911, 0.54569239, 0.52696678, 0.94666249, 0.4798159 ],
       [0.9031102 , 0.09828223, 0.67050717, 0.77313736, 0.47979198,
        0.93205683, 0.30714715, 0.66625816, 0.11693463, 0.75662641],
       [0.13010331, 0.70302084, 0.29719897, 0.4037086 , 0.60219295,
        0.18917132, 0.0928293 , 0.70829784, 0.6350869 , 0.74187586]], dtype=np.float32)
margin = 1.0


pairwise_distances_squared = np.add(
    np.sum(np.square(feature), axis=1, keepdims=True),
    np.sum(np.square(np.transpose(feature)), axis=0,keepdims=True)) - 2.0 * np.matmul(feature, np.transpose(feature))

# pairwise_distances_squared = np.maximum(pairwise_distances_squared, 0.0)

error_mask = np.less_equal(pairwise_distances_squared, 0.0)

pairwise_distances = np.multiply(pairwise_distances, np.logical_not(error_mask)+0.0)

def masked_maximum(data, mask, dim=1):
    axis_minimums = np.min(data, dim, keepdims=True)
    masked_maximums = np.max(np.multiply(data - axis_minimums, mask), dim, keepdims=True) + axis_minimums
    return masked_maximums


def masked_minimum(data, mask, dim=1):
    axis_maximums = np.max(data, dim, keepdims=True)
    masked_minimums = np.min(np.multiply(data - axis_maximums, mask), dim, keepdims=True) + axis_maximums
    return masked_minimums


lshape = np.shape(labels)
assert len(lshape) == 1
labels = np.reshape(labels, [lshape[0], 1])

pdist_matrix = pairwise_distance(embeddings)
adjacency = np.equal(labels, np.transpose(labels))
# only the instances with different labels should be trained.
adjacency_not = np.logical_not(adjacency)
batch_size = np.size(labels)


# compute the mask
pdist_matrix_tile = np.tile(pdist_matrix, [batch_size, 1])

# 不同label，並且
# B * B 個element，每一個作爲standard進行對比。
mask = np.logical_and(np.tile(adjacency_not, [batch_size, 1]),
                      np.greater(pdist_matrix_tile, np.reshape(np.transpose(pdist_matrix), [-1, 1])))


mask_final = np.reshape(
    np.greater(np.sum(mask+0.0, 1, keepdims=True),0.0), [batch_size, batch_size])
mask_final = np.transpose(mask_final)

adjacency_not = adjacency_not + 0.0
mask = mask + 0.0


# negatives_outside: smallest D_an where D_an > D_ap.
negatives_outside = np.reshape(masked_minimum(pdist_matrix_tile, mask), [batch_size, batch_size])

negatives_outside = np.transpose(negatives_outside)


# negatives_inside: largest D_an.
negatives_inside = np.tile(
    masked_maximum(pdist_matrix, adjacency_not), [1, batch_size])
semi_hard_negatives = np.where(
    mask_final, negatives_outside, negatives_inside)


loss_mat = np.add(margin, pdist_matrix - semi_hard_negatives)

mask_positives = adjacency+0.0 - np.diag(np.ones([batch_size]))

num_positives = np.sum(mask_positives)

triplet_loss = np.true_divide(
        np.sum(np.maximum(np.multiply(loss_mat, mask_positives), 0.0)),
        num_positives)

我們來看一些重點numpy代碼

adjacency = np.equal(labels, np.transpose(labels))
adjacency_not = np.logical_not(adjacency)

這一部分中label的size爲[ $B$ , 1]，與其轉秩矩陣size爲[1, $B$ ]，所以其結果爲[ $B$ , $B$ ]
$B$ 爲batch的大小, 且其軸座標與列座標對應了兩個對應index的embedding的label是否是相同的。這樣就篩選出了是否爲P或者N
adjacency這個變量就是去衡量二者是否屬於同一label，若相同則爲True
adjacency_not則是相反，也就是隻有兩個embedding屬於不同label，才爲True

# compute the mask
pdist_matrix_tile = np.tile(pdist_matrix, [batch_size, 1])

# 不同label，並且
# B * B 個element，每一個作爲standard進行對比。
mask = np.logical_and(np.tile(adjacency_not, [batch_size, 1]),
                      np.greater(pdist_matrix_tile, np.reshape(np.transpose(pdist_matrix), [-1, 1])))

這一處，首先將pdist_matrix（也就是pairwise distance的矩陣。其中每一個element對應了兩個embedding的距離）做了一個縱向複製，且複製倍數爲 $B$
你就先記着，一開始我也無法理解的。這一步是爲了後面的比較進行操作。

pdist_matrix:（其中有 $B$ * $B$ 個distance（雖然其中有接近一半是重複的，對稱矩陣））

然後我們開始計算mask，這個mask是什麼呢？
我們可以看到其中的adjacency_not，也就是說僅有不同label才奏效。（這個是在logical and之前）
後面一部分，則是判斷pdist_matrix 是否比對應的距離大。
我們這裏具體的列出來。將pdist_matrix reshape成-1, 1之後，其實是 $B*B$ 大小的一個列向量。
然後我們可以發現是這樣的對比：
$\begin{array}{ccc} d(e_1, e_1)& d(e_1, e_2)& d(e_1, e_3)& \cdots & d(e_1, e_b) \end{array} 與 d(e_1, e_1)$

$\begin{array}{ccc} d(e_2, e_1)& d(e_2, e_2)& d(e_2, e_3)& \cdots & d(e_2, e_b) \end{array} 與 d(e_1, e_2)$

$\begin{array}{ccc} d(e_3, e_1)& d(e_3, e_2)& d(e_3, e_3)& \cdots & d(e_3, e_b) \end{array} 與 d(e_1, e_3)$
…
$\begin{array}{ccc} d(e_b, e_1)& d(e_b, e_2)& d(e_b, e_3)& \cdots & d(e_b, e_b) \end{array} 與 d(e_1, e_b)$

$\begin{array}{ccc} d(e_1, e_1)& d(e_1, e_2)& d(e_1, e_3)& \cdots & d(e_1, e_b) \end{array} 與 d(e_2, e_1)$
…

其實是不同label下，不同label，並且 B * B 個element，每一個作爲standard進行對比。度量其在行列中哪些比它本身更大。更大的作爲True（爲什麼？因爲正式這一個個的distance，其中可能存在AP的pair距離，需要通過哪些比這個距離更大，從而篩選出其中的negative。）
因爲我們此時已經擁有了adjacency_not，我們知道哪一對的距離是anchor和positive的，所以我們此時知道哪些是AP的距離，哪些是在 $B*B$ 矩陣中比AP更大的距離。（semihard loss正是要求那些AN距離大於AP，但不足以大過margin的pair，因此這裏要更大）
所以只需要同時符合這兩個條件的:

屬於不同的label下

距離大於AP

就是我們的AN semihard loss候選

def masked_minimum(data, mask, dim=1):
    axis_maximums = np.max(data, dim, keepdims=True)
    masked_minimums = np.min(np.multiply(data - axis_maximums, mask), dim, keepdims=True) + axis_maximums
    return masked_minimums

# negatives_outside: smallest D_an where D_an > D_ap.
negatives_outside = np.reshape(masked_minimum(pdist_matrix_tile, mask), [batch_size, batch_size])

negatives_outside = np.transpose(negatives_outside)

這一處，我們將函數 masked_minimum放進去，
可以發現在mask的情況下，我們求得最小的距離，注意，返回的結果，就已經是一大堆的distance了。並且是對應了每一個pair，是否滿足AP條件的情況下，求的最小的距離。注意看註釋 最小的滿足 $D_{an} > D_{ap}$ 情況的 $D_{an}$ 。**這，就是我們所需要的semihard loss呀！**滿足這樣的情況下，最小，我們再儘可能的將這個 $D_{an}$ 優化的遠一些。

那麼問題來了，既然都求出來了semihard_loss，後面的操作是幹什麼呢？

def masked_maximum(data, mask, dim=1):
    axis_minimums = np.min(data, dim, keepdims=True)
    masked_maximums = np.max(np.multiply(data - axis_minimums, mask), dim, keepdims=True) + axis_minimums
    return masked_maximums

mask_final = np.reshape(
    np.greater(np.sum(mask+0.0, 1, keepdims=True),0.0), [batch_size, batch_size])
    
# negatives_inside: largest D_an.
negatives_inside = np.tile(
    masked_maximum(pdist_matrix, adjacency_not), [1, batch_size])
semi_hard_negatives = np.where(
    mask_final, negatives_outside, negatives_inside)

這邊我們可以看到，masked_maximum函數就是在mask情況下，求的最大的distance。註釋中也寫了，是最大的 $D_{an}$ ，也就是說，我們也需要求得，對於滿足條件的 $AP$ 來說，距離他最遠的 $D_{an}$ 是什麼。
主要原因：

因爲semihard triplet loss滿足了一個重大缺陷，就是要大於 $D_{ap}$ ，但是某些情況比這個更嚴重。那就是有一些 $D_{an} < D_{ap}$ ，這是我們更需要去優化的，在這邊的代碼下，可能會出現不存在 $D_{an} > D_{ap}$ ，但是存在 $D_{an}$ 的情況，那麼這個時候，我們需要去計算最遠的 $D_{an}$ ，去優化 $D_{an}$ 與 $D_{ap}$ 的距離。這種情況叫easy triplet.

並且最終我們求得了mask_final, 也就是滿足存在 $D_{an}$ 情況下的一些distance，我們纔對其求semihard_negatives。也就是這一行代碼：

semi_hard_negatives = np.where(
    mask_final, negatives_outside, negatives_inside)

滿足mask_final的情況下，選取negatives_outside和inside中最小的距離來作爲優化目標。
最終的loss_matrix也就是：

loss_mat = np.add(margin, pdist_matrix - semi_hard_negatives)

這邊就是二者相見然後加上定義好的margin，使得二者之間的距離滿足我們定義的空間中的間隔。這是一個很重要的hyperparameter，大家可以好好調一下。

mask_positives = adjacency+0.0 - np.diag(np.ones([batch_size]))

num_positives = np.sum(mask_positives)

triplet_loss = np.true_divide(
        np.sum(np.maximum(np.multiply(loss_mat, mask_positives), 0.0)),
        num_positives)

最後一個關鍵點就是，我們並不能使用對角線上的distance，首先他們是anchor與其本身的距離作爲度量。我們要選取的是有anchor， positive的，也就是同樣label下的不同的兩個embedding的距離。而不是anchor自己與自己的距離。那樣距離肯定爲0，優化沒有意義。
所以我們將adjacency對角線上的都至爲0，最後我們計算求和取平均。得到了triplet semihard loss

【個人思考】Tensorflow Triplet SemiHard Loss 代碼詳解

Triplet SemiHard Loss 代碼詳解

導讀

tensorflow version 核心代碼

Numpy 版本代碼

我們來看一些重點numpy代碼

druid數據源 xml配置

【論文筆記】Auto-Encoding Variational Bayes

【論文筆記】Deep Metric Learning via Facility Location

【論文筆記】Joint Unsupervised Learning of Deep Representations and Image Clusters

【論文筆記】On How to Perform a Gold Standard Based Evaluation of Ontology Learning

【Python3】深層結構中的值刪除問題/ python列表刪除值出錯

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結