文本相似度算法----動態規劃求子串

原創

2020-02-23 11:40

public class Computeclass {
    /* 
     * 計算相似度 
     * */  
    public static double SimilarDegree(String strA, String strB){     
        String newStrA = removeSign(strA);      
        String newStrB = removeSign(strB);  
        //用較大的字符串長度作爲分母，相似子串作爲分子計算出字串相似度  
        int temp = Math.max(newStrA.length(), newStrB.length());      
        int temp2 = longestCommonSubstring(newStrA, newStrB).length();     
        return temp2 * 1.0 / temp;      
    }    


    /* 
     * 將字符串的所有數據依次寫成一行 
     * */  
    public static String removeSign(String str) {     
        StringBuffer sb = new StringBuffer();   
        //遍歷字符串str,如果是漢字數字或字母，則追加到ab上面  
        for (char item : str.toCharArray())     
            if (charReg(item)){      
                sb.append(item);    
            }    
        return sb.toString();    
    }    


    /* 
     * 判斷字符是否爲漢字，數字和字母， 
     * 因爲對符號進行相似度比較沒有實際意義，故符號不加入考慮範圍。 
     * */  
    public static boolean charReg(char charValue) {      
        return (charValue >= 0x4E00 && charValue <= 0X9FA5) || (charValue >= 'a' && charValue <= 'z')  
                || (charValue >= 'A' && charValue <= 'Z')  || (charValue >= '0' && charValue <= '9');      
    }      


    /* 
     * 求公共子串，採用動態規劃算法。 
     * 其不要求所求得的字符在所給的字符串中是連續的。 
     *  
     * */  
    public static String longestCommonSubstring(String strA, String strB) {     
        char[] chars_strA = strA.toCharArray();  
        char[] chars_strB = strB.toCharArray();   
        int m = chars_strA.length;     
        int n = chars_strB.length;   

        /* 
         * 初始化矩陣數據,matrix[0][0]的值爲0， 
         * 如果字符數組chars_strA和chars_strB的對應位相同，則matrix[i][j]的值爲左上角的值加1， 
         * 否則，matrix[i][j]的值等於左上方最近兩個位置的較大值， 
         * 矩陣中其餘各點的值爲0. 
        */  
        int[][] matrix = new int[m + 1][n + 1];     
        for (int i = 1; i <= m; i++) {    
            for (int j = 1; j <= n; j++) {      
                if (chars_strA[i - 1] == chars_strB[j - 1])     
                    matrix[i][j] = matrix[i - 1][j - 1] + 1;      
                else     
                    matrix[i][j] = Math.max(matrix[i][j - 1], matrix[i - 1][j]);     
            }     
        }  
        /* 
         * 矩陣中，如果matrix[m][n]的值不等於matrix[m-1][n]的值也不等於matrix[m][n-1]的值， 
         * 則matrix[m][n]對應的字符爲相似字符元，並將其存入result數組中。 
         *  
         */  
        char[] result = new char[matrix[m][n]];      
        int currentIndex = result.length - 1;     
        while (matrix[m][n] != 0) {     
            if (matrix[n] == matrix[n - 1]) n--;     
            else if (matrix[m][n] == matrix[m - 1][n]) m--;     
            else {     
                result[currentIndex] = chars_strA[m - 1];     
                currentIndex--;    
                n--;     
                m--;    
            }    
        }      
       return new String(result);     
    }    


    /* 
     * 結果轉換成百分比形式  
     * */     
    public static String similarityResult(double resule){      
        return  NumberFormat.getPercentInstance(new Locale( "en ", "US ")).format(resule);     
    }

    public static void main(String[] args){
        double result = SimilarDegree("我喜歡看電影你呢 愛我自己我是愛自己 偏愛","我不喜歡看電影 我愛我自己我");
        System.out.println(result);
    }
}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

文本相似度算法----動態規劃求子串

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

TCP/IP 協議：三次握手與四次揮手

java中的樂觀鎖與悲觀鎖

循環刪除List中的元素

maven 打包.bat文件和安裝jar到倉庫

SpringBean的作用域

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結