今天看到這樣一個題:
請統計出以下這段文字中,出現頻率最高的二元字符(兩個字符)的組合。
(舉例:在字符串“1252336528952”中,二元字符組合“52”出現3次,頻率最高。)
oneofthecentralresultsofairesearchinthe1970swasthattoachievegoodperformanceaisystemsmusthav
elargeamountsofknowledgeknowledgeispowertheslogangoeshumansclearlyusevastamountsofknowledge
andifaiistoachieveitslongtermgoalsaisystemsmustalsousevastamountssincehandcodinglargeamount
sofknowledgeintoasystemisslowtediousanderrorpronemachinelearningtechniqueshavebeendeveloped
toautomaticallyacquireknowledgeoftenintheformofifthenrulesproductionsunfortunatelythishasof
tenledtoautilityproblemminton1988bthelearninghascausedanoverallslowdowninthesystemforexampl
einmanysystemslearnedrulesareusedtoreducethenumberofbasicstepsthesystemtakesinordertosolvep
roblemsbypruningthesystemssearchspaceforinstancebutinordertodetermineateachstepwhichrulesar
eapplicablethesystemmustmatchthemagainstitscurrentsituationusingcurrenttechniquesthematcher
slowsdownasmoreandmorerulesareacquiredsoeachsteptakeslongerandlongerthisectcanoutweighthere
ductioninthenumberofstepstakensothatthenetresultisaslowdownthishasbeenobservedinseveralrece
ntsystemsminton1988aetzioni1990tambeetal1990cohen1990ofcoursetheproblemofslowdownfromincrea
singmatchcostisnotrestrictedtosystemsinwhichthepurposeofrulesistoreducethenumberofproblemso
lvingstepsasystemacquiringnewrulesforanypurposecanslowdowniftherulessignicantlyincreasethem
atchcostandintuitivelyoneexpectsthatthemoreproductionsthereareinasystemthehigherthetotalmat
chcostwillbethethesisofthisresearchisthatwecansolvethisprobleminabroadclassofsystemsbyimpro
vingthematchalgorithmtheyuseinessenceouraimistoenablethescalingupofthenumberofrulesinproduc
tionsystemsweadvancethestateoftheartinproductionmatchalgorithmsdevelopinganimprovedmatchalg
orithmwhoseperformancescaleswellonasignicantlybroaderclassofsystemsthanexistingalgorithmsfu
rthermorewedemonstratethatbyusingthisimprovedmatchalgorithmwecanreduceoravoidtheutilityprob
leminalargeclassofmachinelearningsystems
我覺的這道題很有意思,以前寫過c語言的實現,現在用java來解決這個問題,直接看代碼:
package com.company;
import java.io.*;
import java.util.*;
import java.util.List;
/**
* * @projectName test
* * @title Test3
* * @package com.company
* * @description 查找字符串中出現最多的子類
* * @author IT_CREAT
* * @date 2020 2020/5/24/024 22:32
* * @version 1.0.0
*/
public class Test3 {
/**
* 待測試的字符串
*/
public static String testTtr = "oneofthecentralresultsofairesearchinthe1970swasthattoachievegoodper" +
"formanceaisystemsmusthavelargeamountsofknowledgeknowledgeispowertheslogangoeshumansclearly" +
"usevastamountsofknowledgeandifaiistoachieveitslongtermgoalsaisystemsmustalsousevastamounts" +
"sincehandcodinglargeamountsofknowledgeintoasystemisslowtediousanderrorpronemachinelearning" +
"techniqueshavebeendevelopedtoautomaticallyacquireknowledgeoftenintheformofifthenrulesprodu" +
"ctionsunfortunatelythishasoftenledtoautilityproblemminton1988bthelearninghascausedanoveral" +
"lslowdowninthesystemforexampleinmanysystemslearnedrulesareusedtoreducethenumberofbasicstep" +
"sthesystemtakesinordertosolveproblemsbypruningthesystemssearchspaceforinstancebutinorderto" +
"determineateachstepwhichrulesareapplicablethesystemmustmatchthemagainstitscurrentsituation" +
"usingcurrenttechniquesthematcherslowsdownasmoreandmorerulesareacquiredsoeachsteptakeslonge" +
"randlongerthisectcanoutweighthereductioninthenumberofstepstakensothatthenetresultisaslowdo" +
"wnthishasbeenobservedinseveralrecentsystemsminton1988aetzioni1990tambeetal1990cohen1990ofc" +
"oursetheproblemofslowdownfromincreasingmatchcostisnotrestrictedtosystemsinwhichthepurposeo" +
"frulesistoreducethenumberofproblemsolvingstepsasystemacquiringnewrulesforanypurposecanslow" +
"downiftherulessignicantlyincreasethematchcostandintuitivelyoneexpectsthatthemoreproduction" +
"sthereareinasystemthehigherthetotalmatchcostwillbethethesisofthisresearchisthatwecansolvet" +
"hisprobleminabroadclassofsystemsbyimprovingthematchalgorithmtheyuseinessenceouraimistoenab" +
"lethescalingupofthenumberofrulesinproductionsystemsweadvancethestateoftheartinproductionma" +
"tchalgorithmsdevelopinganimprovedmatchalgorithmwhoseperformancescaleswellonasignicantlybro" +
"aderclassofsystemsthanexistingalgorithmsfurthermorewedemonstratethatbyusingthisimprovedmat" +
"chalgorithmwecanreduceoravoidtheutilityprobleminalargeclassofmachinelearningsystems";
/**
* 用作返回map的key
*/
public enum ReturnKey {
COUNT, SUBSTRINGS
}
/**
* 找出文本文件中出現最多的字串的集合
*
* @param chainNumber 連續多少個字符算一個字串,也就是字串這個單詞的長度
* @param filePath 需要讀取文件路徑
* @return 出現最多的字串的集合和次數
*/
public static Map<ReturnKey, Object> searchMostSubstringsByFile(int chainNumber, String filePath) {
List<String> mostSubstrings = new ArrayList<>();
Map<ReturnKey, Object> returnMap = new LinkedHashMap<>(2);
returnMap.put(ReturnKey.COUNT, 0);
returnMap.put(ReturnKey.SUBSTRINGS, mostSubstrings);
if (strIsEmpty(filePath)) {
return returnMap;
}
File file = new File(filePath);
if (file.exists()) {
FileReader fileReader = null;
try {
fileReader = new FileReader(file);
char[] readChar = new char[1024];
StringBuilder waitParsingStr = new StringBuilder();
int readLength = 0;
while ((readLength = fileReader.read(readChar)) != -1) {
waitParsingStr.append(readChar, 0, readLength);
}
return searchMostSubstrings(chainNumber, waitParsingStr.toString());
} catch (IOException e) {
System.out.println(e.getMessage());
} finally {
try {
if (fileReader != null) {
fileReader.close();
}
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
}
return returnMap;
}
/**
* 找出字符產中出現做多的字串集合
*
* @param chainNumber 連續多少個字符算一個字串,也就是字串這個單詞的長度
* @param waitParsingStr 需要被解析的字符串
* @return 出現最多的字串的集合和次數
*/
public static Map<ReturnKey, Object> searchMostSubstrings(int chainNumber, String waitParsingStr) {
//需要返回的查找出來的最多的字串的集合
List<String> mostSubstrings = new ArrayList<>();
Map<ReturnKey, Object> returnMap = new LinkedHashMap<>(2);
returnMap.put(ReturnKey.COUNT, 0);
returnMap.put(ReturnKey.SUBSTRINGS, mostSubstrings);
//等待解析的字符串的長度
int waitParsingStrSize = waitParsingStr.length();
System.out.println("待解析字符串大小 : " + waitParsingStrSize + " , 待解析字符串內容 : " + waitParsingStr);
if (strIsEmpty(waitParsingStr) || chainNumber > waitParsingStrSize) {
return returnMap;
}
//最多字串的數量
int mostSubstringCount = 0;
//解析出來的所有字串的集合
Set<String> substrings = new HashSet<>();
//從字符串開頭每個字符開始循環解析
for (int i = 0; i < waitParsingStrSize; i++) {
//如果查找字串所在的最後的索引小於待解析的字符串則取出該子字符串
if (i + (chainNumber - 1) < waitParsingStrSize) {
String substr = waitParsingStr.substring(i, i + chainNumber);
//如果字串集合中已經包含了本次獲取到的字串則跳出進行下一次字串解析
if (substrings.contains(substr)) {
continue;
}
substrings.add(substr);
//獲得字串在待解析字符串中出現的次數
int substrCount = countStr(waitParsingStr, substr);
//如果當前獲得的字串的數量大於之前出現的最大字串的數量,則清除之前的字串,添加當前的字串
if (substrCount > mostSubstringCount) {
mostSubstrings.clear();
mostSubstrings.add(substr);
} else if (substrCount == mostSubstringCount) {//如果當前獲得的字串的數量等於之前出現的最大字串的數量,則添加當前的字串
mostSubstrings.add(substr);
}
//比較獲取當前字串最大的次數進行臨時賦值
mostSubstringCount = Math.max(substrCount, mostSubstringCount);
}
}
returnMap.put(ReturnKey.COUNT, mostSubstringCount);
return returnMap;
}
/**
* @param str 原字符串
* @param sToFind 需要查找的字符串
* @return 返回在原字符串中sToFind出現的次數
*/
private static int countStr(String str, String sToFind) {
int num = 0;
while (str.contains(sToFind)) {
str = str.substring(str.indexOf(sToFind) + sToFind.length());
num++;
}
return num;
}
/**
* 判斷字符串是否爲空
*
* @param str 需要判斷的字符串
* @return boolean值,爲空返回true,不爲空返回true
*/
private static boolean strIsEmpty(String str) {
return str == null || str.isEmpty();
}
public static void main(String[] args) {
Map<ReturnKey, Object> returnKeyObjectMap1 = searchMostSubstrings(2, testTtr);
System.out.println("字符串中出現子串出現最多的次數是 : " + returnKeyObjectMap1.get(ReturnKey.COUNT));
System.out.println("字符串中出現最多的子串集合是 : " + returnKeyObjectMap1.get(ReturnKey.SUBSTRINGS));
Map<ReturnKey, Object> returnKeyObjectMap2 = searchMostSubstringsByFile(2, "C:\\Users\\Administrator\\Desktop\\test\\src\\com\\company\\test.txt");
System.out.println("字符串中出現子串出現最多的次數是 : " + returnKeyObjectMap2.get(ReturnKey.COUNT));
System.out.println("字符串中出現最多的子串集合是 : " + returnKeyObjectMap2.get(ReturnKey.SUBSTRINGS));
}
}
效果是這樣的:
待解析字符串大小 : 1860 , 待解析字符串內容 : oneofthecentralresultsofairesearchinthe1970swasthattoachievegoodperformanceaisystemsmusthavelargeamountsofknowledgeknowledgeispowertheslogangoeshumansclearlyusevastamountsofknowledgeandifaiistoachieveitslongtermgoalsaisystemsmustalsousevastamountssincehandcodinglargeamountsofknowledgeintoasystemisslowtediousanderrorpronemachinelearningtechniqueshavebeendevelopedtoautomaticallyacquireknowledgeoftenintheformofifthenrulesproductionsunfortunatelythishasoftenledtoautilityproblemminton1988bthelearninghascausedanoverallslowdowninthesystemforexampleinmanysystemslearnedrulesareusedtoreducethenumberofbasicstepsthesystemtakesinordertosolveproblemsbypruningthesystemssearchspaceforinstancebutinordertodetermineateachstepwhichrulesareapplicablethesystemmustmatchthemagainstitscurrentsituationusingcurrenttechniquesthematcherslowsdownasmoreandmorerulesareacquiredsoeachsteptakeslongerandlongerthisectcanoutweighthereductioninthenumberofstepstakensothatthenetresultisaslowdownthishasbeenobservedinseveralrecentsystemsminton1988aetzioni1990tambeetal1990cohen1990ofcoursetheproblemofslowdownfromincreasingmatchcostisnotrestrictedtosystemsinwhichthepurposeofrulesistoreducethenumberofproblemsolvingstepsasystemacquiringnewrulesforanypurposecanslowdowniftherulessignicantlyincreasethematchcostandintuitivelyoneexpectsthatthemoreproductionsthereareinasystemthehigherthetotalmatchcostwillbethethesisofthisresearchisthatwecansolvethisprobleminabroadclassofsystemsbyimprovingthematchalgorithmtheyuseinessenceouraimistoenablethescalingupofthenumberofrulesinproductionsystemsweadvancethestateoftheartinproductionmatchalgorithmsdevelopinganimprovedmatchalgorithmwhoseperformancescaleswellonasignicantlybroaderclassofsystemsthanexistingalgorithmsfurthermorewedemonstratethatbyusingthisimprovedmatchalgorithmwecanreduceoravoidtheutilityprobleminalargeclassofmachinelearningsystems
字符串中出現子串出現最多的次數是 : 53
字符串中出現最多的子串集合是 : [th]
待解析字符串大小 : 1860 , 待解析字符串內容 : oneofthecentralresultsofairesearchinthe1970swasthattoachievegoodperformanceaisystemsmusthavelargeamountsofknowledgeknowledgeispowertheslogangoeshumansclearlyusevastamountsofknowledgeandifaiistoachieveitslongtermgoalsaisystemsmustalsousevastamountssincehandcodinglargeamountsofknowledgeintoasystemisslowtediousanderrorpronemachinelearningtechniqueshavebeendevelopedtoautomaticallyacquireknowledgeoftenintheformofifthenrulesproductionsunfortunatelythishasoftenledtoautilityproblemminton1988bthelearninghascausedanoverallslowdowninthesystemforexampleinmanysystemslearnedrulesareusedtoreducethenumberofbasicstepsthesystemtakesinordertosolveproblemsbypruningthesystemssearchspaceforinstancebutinordertodetermineateachstepwhichrulesareapplicablethesystemmustmatchthemagainstitscurrentsituationusingcurrenttechniquesthematcherslowsdownasmoreandmorerulesareacquiredsoeachsteptakeslongerandlongerthisectcanoutweighthereductioninthenumberofstepstakensothatthenetresultisaslowdownthishasbeenobservedinseveralrecentsystemsminton1988aetzioni1990tambeetal1990cohen1990ofcoursetheproblemofslowdownfromincreasingmatchcostisnotrestrictedtosystemsinwhichthepurposeofrulesistoreducethenumberofproblemsolvingstepsasystemacquiringnewrulesforanypurposecanslowdowniftherulessignicantlyincreasethematchcostandintuitivelyoneexpectsthatthemoreproductionsthereareinasystemthehigherthetotalmatchcostwillbethethesisofthisresearchisthatwecansolvethisprobleminabroadclassofsystemsbyimprovingthematchalgorithmtheyuseinessenceouraimistoenablethescalingupofthenumberofrulesinproductionsystemsweadvancethestateoftheartinproductionmatchalgorithmsdevelopinganimprovedmatchalgorithmwhoseperformancescaleswellonasignicantlybroaderclassofsystemsthanexistingalgorithmsfurthermorewedemonstratethatbyusingthisimprovedmatchalgorithmwecanreduceoravoidtheutilityprobleminalargeclassofmachinelearningsystems
字符串中出現子串出現最多的次數是 : 53
字符串中出現最多的子串集合是 : [th]
本次編寫的代碼,可以通過指定組合字符的個數,不管是題中給出的2個還是更多或者更少,都可以查找出來,同時也可以查找文本文件中出現最多的組合字符。