輸出路徑的最小編輯距離

編輯距離,又稱Levenshtein距離,是指兩個字串之間,由一個轉成另一個所需的最少編輯操作次數。許可的編輯操作包括將一個字符替換成另一個字符,插入一個字符,刪除一個字符。

例如將kitten一字轉成sitting:

  1. sitten (k→s)
  2. sittin (e→i)
  3. sitting (→g)

俄羅斯科學家Vladimir Levenshtein在1965年提出這個概念。

(以上概念介紹來自維基百科,“編輯距離”,http://zh.wikipedia.org/wiki/%E7%B7%A8%E8%BC%AF%E8%B7%9D%E9%9B%A2)


求最小編輯距離,即是從一個字符串轉換成另一個所需要的最少的插入、刪除、替換的操作次數。


常用的一個解法是動態規劃。



具體的計算方法,請查閱相關文章,此不贅述。

空間複雜度爲O(mn)的方法,是計算上面的矩陣時,保留所有的結果。

Java工具包Apache的StringUtils類(在包commons-lang中,最新爲commons-lang3-3.3.2)中採用的則是僅保留上一行的結果。減少的空間,並且避免長字符串時的內容溢出。詳細請查看該包的源代碼org.apache.commons.lang3.StringUtils 中的StringUtils.getLevenshteinDistance(CharSequence s, CharSequence t);中。
// Misc
    //-----------------------------------------------------------------------
    /**
     * <p>Find the Levenshtein distance between two Strings.</p>
     *
     * <p>This is the number of changes needed to change one String into
     * another, where each change is a single character modification (deletion,
     * insertion or substitution).</p>
     *
     * <p>The previous implementation of the Levenshtein distance algorithm
     * was from <a href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ld.htm</a></p>
     *
     * <p>Chas Emerick has written an implementation in Java, which avoids an OutOfMemoryError
     * which can occur when my Java implementation is used with very large strings.<br>
     * This implementation of the Levenshtein distance algorithm
     * is from <a href="http://www.merriampark.com/ldjava.htm">http://www.merriampark.com/ldjava.htm</a></p>
     *
     * <pre>
     * StringUtils.getLevenshteinDistance(null, *)             = IllegalArgumentException
     * StringUtils.getLevenshteinDistance(*, null)             = IllegalArgumentException
     * StringUtils.getLevenshteinDistance("","")               = 0
     * StringUtils.getLevenshteinDistance("","a")              = 1
     * StringUtils.getLevenshteinDistance("aaapppp", "")       = 7
     * StringUtils.getLevenshteinDistance("frog", "fog")       = 1
     * StringUtils.getLevenshteinDistance("fly", "ant")        = 3
     * StringUtils.getLevenshteinDistance("elephant", "hippo") = 7
     * StringUtils.getLevenshteinDistance("hippo", "elephant") = 7
     * StringUtils.getLevenshteinDistance("hippo", "zzzzzzzz") = 8
     * StringUtils.getLevenshteinDistance("hello", "hallo")    = 1
     * </pre>
     *
     * @param s  the first String, must not be null
     * @param t  the second String, must not be null
     * @return result distance
     * @throws IllegalArgumentException if either String input {@code null}
     * @since 3.0 Changed signature from getLevenshteinDistance(String, String) to
     * getLevenshteinDistance(CharSequence, CharSequence)
     */
    public static int getLevenshteinDistance(CharSequence s, CharSequence t) {
        if (s == null || t == null) {
            throw new IllegalArgumentException("Strings must not be null");
        }

        /*
           The difference between this impl. and the previous is that, rather
           than creating and retaining a matrix of size s.length() + 1 by t.length() + 1,
           we maintain two single-dimensional arrays of length s.length() + 1.  The first, d,
           is the 'current working' distance array that maintains the newest distance cost
           counts as we iterate through the characters of String s.  Each time we increment
           the index of String t we are comparing, d is copied to p, the second int[].  Doing so
           allows us to retain the previous cost counts as required by the algorithm (taking
           the minimum of the cost count to the left, up one, and diagonally up and to the left
           of the current cost count being calculated).  (Note that the arrays aren't really
           copied anymore, just switched...this is clearly much better than cloning an array
           or doing a System.arraycopy() each time  through the outer loop.)

           Effectively, the difference between the two implementations is this one does not
           cause an out of memory condition when calculating the LD over two very large strings.
         */

        int n = s.length(); // length of s
        int m = t.length(); // length of t

        if (n == 0) {
            return m;
        } else if (m == 0) {
            return n;
        }

        if (n > m) {
            // swap the input strings to consume less memory
            final CharSequence tmp = s;
            s = t;
            t = tmp;
            n = m;
            m = t.length();
        }

        int p[] = new int[n + 1]; //'previous' cost array, horizontally
        int d[] = new int[n + 1]; // cost array, horizontally
        int _d[]; //placeholder to assist in swapping p and d

        // indexes into strings s and t
        int i; // iterates through s
        int j; // iterates through t

        char t_j; // jth character of t

        int cost; // cost

        for (i = 0; i <= n; i++) {
            p[i] = i;
        }

        for (j = 1; j <= m; j++) {
            t_j = t.charAt(j - 1);
            d[0] = j;

            for (i = 1; i <= n; i++) {
                cost = s.charAt(i - 1) == t_j ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                d[i] = Math.min(Math.min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);
            }

            // copy current distance counts to 'previous row' distance counts
            _d = p;
            p = d;
            d = _d;
        }

        // our last action in the above loop was to switch d and p, so p now
        // actually has the most recent cost counts
        return p[n];
    }


如果不僅是求出Levenshtein Distance, 還要輸出編輯的路徑,那麼只能保留矩陣,然後倒退求取編輯路徑。

定義操作類OperateObj保存修改的位置,還有替換目標。
package cn.com.sp.align.model;

/*操作類,保存每個操作的具體內容,
* 三個成員是,
*     操作位置 index
*     替換的目標 targetStr,(刪除,替換爲空“”;替換,替換爲目標字符;添加,替換爲目標字符串)
*     操作的類型 operateType,定義爲枚舉類型OperateEnum。其實從替換目標就能判斷出操作類型,爲了簡便,省去了每步的判斷。
**/
public class OperateObj {
        //操作類型定義爲枚舉,有添加add、刪除delete、替換replace三種
	public enum OperateEnum {
		add, delete, replace;
	}
	
	//操作,在原始字符串中的位置,操作前
	private int index = 0;
	
        //操作時,替換目標串
	private String targetStr = "";
	
        //操作類型符號
	OperateEnum	operateType; 
	

	public OperateObj(int index, String targetStr, OperateEnum operateType) {
		this.index = index;
		this.targetStr = targetStr;
		this.operateType = operateType;
	}

	public int getIndex() {
		return index;
	}

	public void setIndex(int index) {
		this.index = index;
	}

	public String getTargetStr() {
		return targetStr;
	}

	public void setTargetStr(String targetStr) {
		this.targetStr = targetStr;
	}

	public OperateEnum getOperateType() {
		return operateType;
	}

	public void setOperateType(OperateEnum operateType) {
		this.operateType = operateType;
	}
	
}
具體實現求Levenshtein Distance,過程中,保存矩陣的所有結果,實現類爲StringUtils_SP:

package cn.com.sp.align.levenshtein;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;

import cn.com.sp.align.model.OperateObj;
import cn.com.sp.align.model.OperateObj.OperateEnum;

public class StringUtils_SP {
	
    public static int getLevenshteinDistance(CharSequence s, CharSequence t, List<OperateObj> operateList) {
        
	if (s == null || t == null) {
            throw new IllegalArgumentException("Strings must not be null");
        }

        int n = s.length(); // length of s
        int m = t.length(); // length of t

        if (n == 0) {
            return m;
        } else if (m == 0) {
            return n;
        }

        int distance[][] = new int[s.length()+1][t.length()+1];
        
        for(int i=0; i<s.length()+1; ++i){
        	distance[i][0] = i;
        }
        
        for(int j=1; j<t.length()+1; ++j){
        	distance[0][j] = j;
        }
        
		int cost = 0;
        for(int i=1; i<s.length()+1; ++i){
        	for(int j=1; j<t.length()+1; ++j){
        		int tempCost = Math.min(distance[i-1][j]+1, distance[i][j-1]+1);
        		if(s.charAt(i-1)==t.charAt(j-1)){
        			cost = 0;
        		}else{
        			cost = 1;
        		}
        		distance[i][j] = Math.min(distance[i-1][j-1]+cost, tempCost);
        	}
        	
        }
        
        
        int i = s.length(), j = t.length();
        int minDistance = distance[i][j];
        while(i>0 && j>0){
        	if(distance[i][j-1]+1 == minDistance){
        		OperateObj operateObj = new OperateObj(i-1, s.charAt(i-1)+""+t.charAt(j-1), OperateEnum.add);
        		operateList.add(operateObj);
        		
        		minDistance = distance[i][j-1];
        		j -= 1;
        	}else if(distance[i-1][j]+1 == minDistance){
        		OperateObj operateObj = new OperateObj(i-1, "", OperateEnum.delete);
        		operateList.add(operateObj);
        		
        		minDistance = distance[i-1][j];
        		i -= 1;
        	}else if(distance[i-1][j-1]+1 == minDistance){
        		OperateObj operateObj = new OperateObj(i-1, t.charAt(j-1)+"", OperateEnum.replace);
        		operateList.add(operateObj);
        		
        		minDistance = distance[i-1][j-1];
        		i -= 1;
        		j -= 1;
        	}else{
        		
        		i -= 1;
        		j -= 1;
        	}
        	
        }
        
        while(i>0){
        	OperateObj operateObj = new OperateObj(i-1, "", OperateEnum.delete);
        	operateList.add(operateObj);
        	
        	minDistance = distance[i-1][j];
        	i -= 1;
        }
        
        while(j>0){
        	OperateObj operateObj = new OperateObj(i, t.charAt(j-1)+""+s.charAt(i), OperateEnum.add);
    		operateList.add(operateObj);
    		
    		minDistance = distance[i][j-1];
    		j -= 1;
        }
        
        
        return distance[s.length()-1][t.length()-1];
   }

   <pre name="code" class="java">   public static void main(String[] args){
        String s = "中華人民共和國";
        String t = "中化人名和國";


        ArrayList<OperateObj> operateList = new ArrayList<OperateObj>();

        System.out.println("編輯距離爲 : "+StringUtils_SP.getLevenshteinDistance(s, t, operateList));
		
	String operateStr = s;
	for (int i = 0; i < operateList.size(); ++i) {
		OperateObj operateObj = operateList.get(i);
			
		System.out.println(operateStr);

		System.out.println(s.charAt(operateObj.getIndex())+"("+operateObj.getIndex()+","+operateObj.getOperateType()+") -> "+operateObj.getTargetStr());
			
		operateStr = operateStr.substring(0, operateObj.getIndex()) + operateObj.getTargetStr() + operateStr.substring(operateObj.getIndex() + 1);
	}
		
	System.out.println("");
	System.out.println(t);
    }
}

運行的結果如下:
編輯距離爲 : 3
中華人民共和國
共(4,delete) -> 
中華人民和國
民(3,replace) -> 名
中華人名和國
華(1,replace) -> 化

中化人名和國



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章