動態規劃之——萊文斯坦距離和最長公共子序列

原創

seniusen

2020-06-17 19:31

1. 如何衡量字符串的相似性

如何量化兩個字符串之間的相似性呢？我們可以用編輯距離，也就是將一個字符串通過增、刪、替換字符轉化成另一個字符串需要的最少編輯次數。編輯距離越小，說明兩個字符串越相似。

其中，萊文斯坦距離允許增、刪和替換操作，表示兩個字符串差異的大小；最長公共子序列只允許增刪操作，表示兩個字符串相似程度的大小。下面的例子中萊文斯坦距離爲 3，最長公共子序列爲 4。

2. 萊文斯坦距離

對於兩個字符串 s 和 t，當 $s[i] == t[j]$ 時，我們繼續考察 s[i+1] 和 t[j+1]；當 $s[i] != t[j]$ 時，我們有下述選擇：

刪除 s[i] 或者在 t 中增加一個與 s[i] 相同的字符，繼續考察 s[i+1] 和 t[j]；
刪除 t[j] 或者在 s 中增加一個與 t[j] 相同的字符，繼續考察 s[i] 和 t[j+1]；
替換 s[i] 爲 t[j] 或者 t[j] 爲 s[i]，繼續考察 s[i+1] 和 t[j+1]；

我們定義 edist[i][j] 表示子串 s[0, i] 與 t[0, j] 的最小編輯距離，那麼由上面的分析可知，edist[i][j] 可以由 edist[i-1][j]、edist[i][j-1] 和 edist[i-1][j-1] 這三個狀態轉化而來。

當 edist[i][j] 由 edist[i-1][j] 轉化而來的時候，我們只能通過刪除 s[i] 或者在 t 中增加一個與 s[i] 相同的字符，那麼編輯距離一定增 1。
當 edist[i][j] 由 edist[i][j-1] 轉化而來的時候，我們只能通過刪除 t[j] 或者在 s 中增加一個與 t[j] 相同的字符，那麼編輯距離一定增 1。
當 edist[i][j] 由 edist[i-1][j-1] 轉化而來的時候，如果 $s[i] == t[j]$ ，編輯距離保持不變；如果 $s[i] != t[j]$ ，我們只能替換 s[i] 爲 t[j] 或者 t[j] 爲 s[i]，編輯距增 1。

最終的 edist[i][j] 取這三者中最小的一個即可，所以我們有：

$edist[i][j] = \begin{cases} min(edist[i-1][j]+1, edist[i][j-1]+1, edist[i-1][j-1]) &\text{如果 } s[i] == t[j] \\ min(edist[i-1][j]+1, edist[i][j-1]+1, edist[i-1][j-1]+1) &\text{如果 } s[i] != t[j] \end{cases}$

import numpy as np

s = 'mitcmu'
t = 'mtacnu'


def lwst_edit_dis(s, t):

    m = len(s)
    n = len(t)
    edist = np.ones((m, n), 'uint8')
    # 初始化 s[0] 與 t[0, i] 的編輯距離
    for i in range(0, n):
        if s[0] == t[i]:
            edist[0][i] = i
        elif i != 0:
            edist[0][i] = edist[0][i-1] + 1
        else:
            edist[0][i] = 1
    # 初始化 s[0, i] 與 t[0] 的編輯距離
    for i in range(0, m):
        if s[i] == t[0]:
            edist[i][0] = i
        elif i != 0:
            edist[i][0] = edist[i-1][0] + 1
        else:
            edist[i][0] = 1

    for i in range(1, m):
        for j in range(1, n):
            temp = min(edist[i-1][j], edist[i][j-1]) + 1
            if s[i] == t[j]:
                edist[i][j] = min(temp, edist[i-1][j-1])
            else:
                edist[i][j] = min(temp, edist[i-1][j-1]+1)

    print(edist)
    return edist[m-1][n-1]


print(lwst_edit_dis(s, t))

3. 最長公共子序列

最長公共子序列只允許增刪兩種操作。對於兩個字符串 s 和 t，當 $s[i] == t[j]$ 時，最長公共子序列長度加 1，我們繼續考察 s[i+1] 和 t[j+1]；當 $s[i] != t[j]$ 時，最長公共子序列長度不變，我們有下述選擇：

刪除 s[i] 或者在 t 中增加一個與 s[i] 相同的字符，繼續考察 s[i+1] 和 t[j]；
刪除 t[j] 或者在 s 中增加一個與 t[j] 相同的字符，繼續考察 s[i] 和 t[j+1]；

我們定義 lcs[i][j] 表示子串 s[0, i] 與 t[0, j] 的最長公共子序列長度，那麼由上面的分析可知，lcs[i][j] 可以由 lcs[i-1][j]、lcs[i][j-1] 和 lcs[i-1][j-1] 這三個狀態轉化而來。

當 lcs[i][j] 由 lcs[i-1][j] 轉化而來的時候，我們只能通過刪除 s[i] 或者在 t 中增加一個與 s[i] 相同的字符，最長公共子序列長度保持不變。
當 lcs[i][j] 由 lcs[i][j-1] 轉化而來的時候，我們只能通過刪除 t[j] 或者在 s 中增加一個與 t[j] 相同的字符，最長公共子序列長度保持不變。
當 lcs[i][j] 由 lcs[i-1][j-1] 轉化而來的時候，如果 $s[i] == t[j]$ ，最長公共子序列長度加 1；如果 $s[i] != t[j]$ ，我們只能替換 s[i] 爲 t[j] 或者 t[j] 爲 s[i]，最長公共子序列長度保持不變。

最終的 lcs[i][j] 取這三者中最大的一個即可，所以我們有：

$lcs[i][j] = \begin{cases} max(lcs[i-1][j], lcs[i][j-1], lcs[i-1][j-1]+1) &\text{如果 } s[i] == t[j] \\ max(lcs[i-1][j], lcs[i][j-1], lcs[i-1][j-1]) &\text{如果 } s[i] != t[j] \end{cases}$

import numpy as np

s = 'mitcmu'
t = 'mtacnu'


def longest_commom_subsequence(s, t):

    m = len(s)
    n = len(t)
    lcs = np.ones((m, n), 'uint8')

    for i in range(0, n):
        if s[0] == t[i]:
            lcs[0][i] = 1
        elif i != 0:
            lcs[0][i] = lcs[0][i-1]
        else:
            lcs[0][i] = 0

    for i in range(0, m):
        if s[i] == t[0]:
            lcs[i][0] = 1
        elif i != 0:
            lcs[i][0] = lcs[i-1][0]
        else:
            lcs[i][0] = 0

    for i in range(1, m):
        for j in range(1, n):
            temp = max(lcs[i-1][j], lcs[i][j-1])
            if s[i] == t[j]:
                lcs[i][j] = max(temp, lcs[i-1][j-1]+1)
            else:
                lcs[i][j] = max(temp, lcs[i-1][j-1])

    print(lcs)
    return lcs[m-1][n-1]


print(longest_commom_subsequence(s, t))

獲取更多精彩，請關注「seniusen」!

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

動態規劃之——萊文斯坦距離和最長公共子序列

1. 如何衡量字符串的相似性

2. 萊文斯坦距離

3. 最長公共子序列

CORS error 但是 status code 是200 OK

壓縮上傳的GPU數據的方案

使用skopeo同步鏡像

線性代數之——正定矩陣

LeetCode 409——最長迴文串

拼多多學霸批算法工程師在線筆試 20190728 場

線性代數之——圖和網絡

動態規劃之——萊文斯坦距離和最長公共子序列

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結