之前有一個需求,有一批項目數據,需要對項目數據的標題按照一定進行清洗,清洗完之後去除重複標題的數據,得到最終結果。
已知項目數據一共有四種狀態,分別爲公告,預告,結果與變更。在這個需求裏,公告、預告爲同一規則,結果與變更爲另一規則,規則關鍵詞如下:
具體規則如下:
由此可知,我需要去除目標數據中所有規則關鍵詞的組合詞。
由於當時需求比較緊急,因此便馬上根據規則寫了一套很粗糙的程序。
具體思路如下:
- 由於擔心在程序運行期間跑關鍵詞組合不太方便,因此先用幾個for循環依次將所有組合詞跑出來。然後將這些組合詞複製後賦值給一些常量數組。
- 依據規則對數據進行清洗。
- 將清洗完的數據插入result表,在插入的時候判斷該表是否有相同標題的數據,若有則不插入。從而獲得去重後的數據。
看起來似乎不錯,但是其實第一步就沒有成功,因爲跑出來的關鍵詞實在是太多了。如果如果賦值給常量數組,空間不足,項目無法啓動。
因此對第一步進行改進。
- 由於無法將這些數據賦值給常量數組,所以我將這些跑出來的數據存在了表裏面。程序運行的時候用一個list去接受這些表的數據就可以。
- 此時,需求方說需要保留原標題,於是啓用了備用字段spare4,用來存放標題。只要在規則程序開始前,將title字段複製給spare4,針對spare4進行清洗即可。
- 同時,我發現上一個思路的最後一個步驟似乎並不需要。我只需要對清洗完的數據進行一個group by spare,progid 就可以獲得最終數據。
在修改完程序之後,測試了一下效率,後來發現並不是很快,因爲我將所有狀態的數據的清洗都放在了一個函數裏面,需要串行執行。假設我有一百萬的數據,公告、預告佔五十萬,結果、變更佔五十萬。這樣的話,我就需要等待前一個狀態清洗完畢纔可以對另一個狀態進行清洗。
既然如此,爲什麼不多開幾個線程來跑不同狀態的數據呢?
於是我開了四個線程,分別跑四個狀態的兩類數據。
因此,初代程序如下:
/**去重流程
*1.將標題複製到spare4,並且對spare4的標題進行清洗。
* 提供參數
*2.將標題複製到另一張表,根據spare4判斷是否有內容相等的標題,如果有,只複製第一個。第二個則不復制,完成去重的最後步驟。最後,從這個表中導出數據即可(暫時不做, 可自行在數據庫中group by spare4)。
*/
/**
* 去重函數 progid=3
*/
@Scheduled(fixedDelay = 60*60*60*1000)
public void clean_3(){
// 該sql用於傳入給cleanTheTitle。
String spare4_sql_1="select * from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=3 limit ";
// 該sql用於判斷何時跳出循環。
String spare4_sql_2="select title,id,contentId from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=3 limit ";
// 清洗類別,如果是2、3類別的招中標信息,則type=1否則type=1;
int type=3;
// 源數據所在的表名
String table="big_customer_data_zhengfu_2";
cleanTheSpare4(spare4_sql_1,0,spare4_sql_2,type,table);
}
/**
* 去重函數 progid:=2
*/
@Scheduled(fixedDelay = 60*60*60*1000)
public void clean_2(){
//該sql用於傳入給cleanTheTitle。
String spare4_sql_1="select * from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=2 limit ";
// 該sql用於判斷何時跳出循環。
String spare4_sql_2="select title,id,contentId from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=2 limit ";
// 清洗類別,如果是2、3類別的招中標信息,則type=1否則type=1;
int type=2;
// 源數據所在的表名
String table="big_customer_data_zhengfu_2";
cleanTheSpare4(spare4_sql_1,0,spare4_sql_2,type,table);
}
/**
* 去重函數 1
*/
@Scheduled(fixedDelay = 60*60*60*1000)
public void clean_1(){
String spare4_sql_11="select * from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=1 limit ";
// 該sql用於判斷何時跳出循環。
String spare4_sql_22="select title,id,contentId from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=1 limit ";
// 清洗類別,如果是2、3類別的招中標信息,則type=1否則type=1;
// 源數據所在的表名
String table="big_customer_data_zhengfu_2";
int type_1=1;
cleanTheSpare4(spare4_sql_11,0,spare4_sql_22,type_1,table);
}
/**
* 去重函數 0
*/
@Scheduled(fixedDelay = 60*60*60*1000)
public void clean_0(){
String spare4_sql_11="select * from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=0 limit ";
// 該sql用於判斷何時跳出循環。
String spare4_sql_22="select title,id,contentId from big_customer_data_zhengfu_2 where taskId=201907081530 and progid=0 limit ";
// 源數據所在的表名
String table="big_customer_data_zhengfu_2";
// 清洗類別,如果是2、3類別的招中標信息,則type=1否則type=1;
int type_1=0;
cleanTheSpare4(spare4_sql_11,0,spare4_sql_22,type_1,table);
}
/**
* 將方法抽象出來,便於直接調用。
* 需要輸入表名,taskId,progid。
* 不同的progid有不同的清洗方式
* @param table
* @param taskId
* @param progid
*/
public void cleanAbstract(String table,int taskId,int progid){
String spare4Sql1="select * from" +table+" where taskId="+taskId+"and progid="+progid+" limit ";
// 該sql用於判斷何時跳出循環。
String spare4Sql2="selectid from "+table+" where taskId="+taskId+"and progid="+progid+" limit ";
// 源數據所在的表名
cleanTheSpare4(spare4Sql1,0,spare4Sql2,progid,table);
}
/**
* 將標題複製到spare4,並且對spare4的標題進行清洗。
*/
public void cleanTheSpare4(String sql_1,int startnum,String sql_2,int type,String table){
int k=0;
int num=1000;
int start=startnum;
while(true){
// 循環取出1000條數據,對每一條數據遍歷關鍵詞
String sql=sql_1+start+","+num;
List<Map<String, Object>> WORD = shujuzuJdbcTemplate.queryForList(sql_2+start+","+num);
cleanTheTitle(sql,type,table);
start+=1000;
log.info("類型"+type+"執行到了:"+start);
if(WORD.size()<1000){
break;
}
}
log.info("類型"+type+"全部結束了!");
}
/**
* 對標題進行去重
* 先根據傳入的sql獲取到list。再根據第一規則去去重。classify代表的是哪一種分類,1表示公告、預告,2表示結果、變更。
* sql="select * from 表 order by 重複的字段,ID",
* 獲取到對應的title,去數據庫中查該title的數據。並獲取該list。
* 最後在走一遍整體去重。
* @param sql
*/
public void cleanTheTitle(String sql,int classify,String table) {
// 獲取所需要的數據
saveTheTitle(sql,table);
if(classify==1||classify==0){
ruleFirstTwoTwo(sql,table);
}
if(classify==2 ||classify==3){
ruleSecondTwoTwo(sql,table);
}
}
/**目前未執行。
* 複製,並且去重。
*/
public void cloneAndClean(String sql_1,String sql_2,String sql_3){
// 把符合條件的數據存在數據庫中,並打上標籤。
int k=0;
int num=1000;
int start=0;
while(true){
// 循環取出1000條數據,對每一條數據遍歷,如果沒有重複的title,那就插入
List<Map<String, Object>> WORD = shujuzuJdbcTemplate.queryForList(sql_1+start+","+num);
for(Map<String,Object> map:WORD){
//判斷是否有該標題 若有就不弄了
if(map.get("spare4")!=null){
if(shujuzuJdbcTemplate.queryForList(sql_2,map.get("spare4")).size()==0){
// 存入
shujuzuJdbcTemplate.update(sql_3,map.get("id"));
}
}
}
start+=1000;
log.info("執行到了:"+start);
if(WORD.size()<1000){
break;
}
}
log.info("複製結束");
}
/**
*針對progid=0||progid=1的數據進行去重
* 獲取要去重的列表cleanList
* 調用clean_it函數進行相應去重
* @param sql
* @param table
*/
public void ruleFirstTwoTwo(String sql,String table){
List<Map<String, Object>> cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"first_xbd_bdx","first_xbd_bdx");
cleanList=shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"first_abd_abbd","first_abd_abbd");
cleanList=shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"first_bbbd","first_bbbd");
cleanList=shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"first_bbbed","first_bbbed");
cleanList=shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"first_bbd_bbed","first_bbd_bbed");
cleanList=shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"first_bd_bed","first_bd_bed");
String str="";
cleanList=shujuzuJdbcTemplate.queryForList(sql);
for(Map<String, Object> clean:cleanList){
str=clean.get("spare4").toString();
str=str.replaceAll("\\[.*?]","");
str=str.replaceAll("\\【.*?】","");
str=str.replaceAll(":","");
str=str.replaceAll(":","");
str=str.replaceAll("\\[|\\]","");
str=str.replaceAll("\\【|\\】","");
str=str.replaceAll("_","");
str=str.replaceAll("-","");
str=str.replaceAll("\\.","");
if(!clean.get("spare4").equals(str)){
shujuzuJdbcTemplate.update("update "+table+" set spare4=? where id=?",str,clean.get("id").toString());
}
}
// log.info("words_x結束了");
}
/**
* 針對變更、結果的次序二
* 1順次:刪除“cdx”“xcd”的完全匹配的詞組;
* 2順次:刪除“abcd”的完全匹配的詞組;
* 3順次:刪除“abbcd”的完全匹配的詞組;
* 4順次:刪除“bbbcd”“bbbed”的完全匹配的詞組;
* 5順次:刪除“bbcd”“bbced”的完全匹配的詞組;
* 6順次:刪除“bcd”“bced”的完全匹配的詞組;
* 7順次:刪除“ccd”“cced”的完全匹配的詞組;
* 8順次:刪除“cc”“cd”“cec”“ced”的完全匹配的詞組;
* 9順次:刪除“x”;
*/
public void saveTheTitle(String sql,String table){
// 獲取數據
String str="";
List<Map<String, Object>> cleanList = shujuzuJdbcTemplate.queryForList(sql);
// 將每一個對象的title存在spare1。
for(Map<String, Object> clean:cleanList){
if(null!=clean.get("title")){
str=clean.get("title").toString();
shujuzuJdbcTemplate.update("update "+table+" set spare4=? where id=?",str,clean.get("id").toString());
}
}
}
public void clean_it( List<Map<String, Object>> cleanList,String table,String colunmName,String tableName){
String str=null;
List<Map<String, Object>> cleanWords = shujuzuJdbcTemplate.queryForList("select "+colunmName+" from "+tableName);
for(Map<String, Object> clean:cleanList){
if( null!=clean.get("spare4")){
str=clean.get("spare4").toString();
if(str!= null ||str!= ""){
for(Map<String, Object> cleanwords:cleanWords){
if (str.contains(cleanwords.get(colunmName).toString())){
str=str.replaceAll(cleanwords.get(colunmName).toString(),"");
shujuzuJdbcTemplate.update("update "+table+" set spare4=? where id=?",str,clean.get("id").toString());
}
}
}
}
}
}
/**
* 針對progid=2||progid=3的數據進行去重
* 獲取要去重的列表cleanList
* 調用clean_it函數進行相應去重
* @param sql
* @param table
*/
public void ruleSecondTwoTwo(String sql,String table){
List<Map<String, Object>> cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_cdx_xcd","second_cdx_xcd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_abcd","second_abcd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_abbcd","second_abbcd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_bbbcd","second_bbbcd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_bbbed","second_bbbed");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_bbcd","second_bbcd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_bbced","second_bbced");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_bcd","second_bcd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_bced","second_bced");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_ccd","second_ccd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_cc","second_cc");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_cd","second_cd");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_cec","second_cec");
cleanList = shujuzuJdbcTemplate.queryForList(sql);
clean_it(cleanList,table,"second_ced","second_ced");
List<Map<String, Object>> WORDS_X = shujuzuJdbcTemplate.queryForList("select words_x from words_x");
cleanList=shujuzuJdbcTemplate.queryForList(sql);
String str="";
for(Map<String, Object> clean:cleanList){
if(null!=clean.get("spare4")){
str=clean.get("spare4").toString();
str=str.replaceAll("\\[.*?]","");
str=str.replaceAll("\\【.*?】","");
str=str.replaceAll(":","");
str=str.replaceAll(":","");
str=str.replaceAll("\\[|\\]","");
str=str.replaceAll("\\【|\\】","");
str=str.replaceAll("_","");
str=str.replaceAll("-","");
str=str.replaceAll("\\.","");
if(!clean.get("spare4").equals(str)){
shujuzuJdbcTemplate.update("update "+table+" set spare4=? where id=?",str,clean.get("id").toString());
}
}
}
}
測試了一下,如果是在本地跑數據的話,速度很慢,大概十五分鐘跑完兩萬的數據。
正好那時候遇上一個需求,需要對標題進行去重,數據量爲八十萬。按這麼來算的話,就需要十個小時。
這誰頂得住呀。必須得放服務器,肯定得放服務器。
將程序放在服務器上跑了一批測試數據,大概三分鐘跑完兩萬數據。我們組只有30%的帶寬,不過速度看起來還可以。按這麼來算的話,八十萬數據就需要兩個小時,還是等得起的。
理論上來說,這個去重的程序並不是簡單地對數據庫進行讀寫,規則較爲複雜,關鍵詞組太多,這都是程序運行緩慢的原因。但是,它實在是太慢了!必須得收拾它,肯定得收拾它!
怎麼收拾呢?首先我們來看看我上面貼出來的破代碼。其實上面的程序還算可以看,因爲我將部分重複步驟的代碼提出爲了函數,否則,那將更加的又臭又長。
好吧,我承認,它現在看起來也是又臭又長。
你看,這裏,每一次匹配出組合之前,我都要去數據庫拿關鍵詞,在清洗完這一批關鍵詞之後,我都要對spare4進行更新。
public void clean_it( List<Map<String, Object>> cleanList,String table,String colunmName,String tableName){
String str=null;
List<Map<String, Object>> cleanWords = shujuzuJdbcTemplate.queryForList("select "+colunmName+" from "+tableName);
for(Map<String, Object> clean:cleanList){
if( null!=clean.get("spare4")){
str=clean.get("spare4").toString();
if(str!= null ||str!= ""){
for(Map<String, Object> cleanwords:cleanWords){
if (str.contains(cleanwords.get(colunmName).toString())){
str=str.replaceAll(cleanwords.get(colunmName).toString(),"");
shujuzuJdbcTemplate.update("update "+table+" set spare4=? where id=?",str,clean.get("id").toString());
}
}
}
}
}
}
而一個數據,需要經過七組或者九組關鍵詞的清洗,而一批數據爲1000條,100萬數據,那就是1000批數據。這麼下來,IO操作就太多了。
所以,從這一段代碼,就可以發現兩處需要改進的地方。
- 首先,每一次清洗的時候都需要獲取關鍵詞,這一點沒有必要。可以在程序剛啓動的時候獲取到關鍵詞,賦值給String數組,後續直接匹配數組即可。
- 其次,每一次關鍵詞的清洗,都需要對spare4進行update操作,沒有必要。完全可以用一個string 變量去接收spare4的值,對該變量進行清洗,在清洗結束的時候再進行update操作即可。那既然想到這裏,該點還可以繼續優化。可以直接用變量去接收title的值,對變量進行清洗,清洗完後將該值更新給spare4字段。因此,清洗啓動前也無需將title賦值到spare4中了。
此外,剛纔說到我將跑出來的數據全部都放在數據庫裏面了,存了多個表,那其它表中的數據想要直接用我這個程序進行清洗,豈不是還需要將我的表給複製到對應的數據庫中去?那也太麻煩了吧。
你瞧,這麼多個表,多麻煩呀。
想到這裏,我突然又納悶了,我當初爲何要存這麼多個表呢?
噢。是因爲規則清洗是有步驟的,所以用到關鍵詞的時候,只需要依次獲取對應表中的數據即可。
可是,既然有順序,那我將關鍵詞在一個表中按順序存儲不就得了???何須這麼多表?
一共有兩類數據,對應兩類規則,那我存兩個表就得了。
若是覺得兩個表也不合適,那直接加一個字段,存0、1,用來區分兩類規則的關鍵詞即可,這樣就只需要一個表就可以了。反正檢索數據的時候,按id來排序,又不會亂序,爲什麼不呢?
所以又有了一個改進:
- 將數據庫中的關鍵詞表存在一個表中,增加一個type字段,用以區分兩種類別的關鍵詞,這樣的話,程序只需要獲取一次數據,也只需要進行一次清洗,無需對多個數組進行匹配。
那麼,又一個問題來了。若是我偏偏不想將關鍵詞組存在數據庫中呢?
也行,不存在數據庫中,那麼就放在內存中唄,從內存中取數據總比走磁盤快吧。
因此,在進行清洗之前,先將關鍵詞組合跑出來就行。
private String[] GONG_YU_GAO_A = {"定點", "協議", "單一來源", "單一", "資格", "資格預審", "競爭", "競爭性", "公開"};
private String[] GONG_YU_GAO_B = {"招標", "採購", "磋商", "入圍", "談判", "議價", "詢價", "比價", "詢比價", "比選", "項目"};
private String[] GONG_YU_GAO_D = {"公告", "公示", "信息"};
private String[] GONG_YU_GAO_E = {"的"};
private String[] BIAN_RESULT_A = {"定點", "協議", "單一來源", "單一", "資格", "資格預審", "競爭", "競爭性", "公開"};
private String[] BIAN_RESULT_B = {"招標", "採購", "磋商", "入圍", "談判", "議價", "詢價", "比價", "詢比價", "比選", "項目"};
private String[] BIAN_RESULT_C = {"合同", "成交", "結果", "中標", "變更", "候選", "候選人", "成交人", "中標人"};
private String[] BIAN_RESULT_D = {"公告", "公示", "結果"};
private String[] BIAN_RESULT_E = {"的"};
private String[] ALL_X = {":", ":", "\\[", "\\]", "\\【", "\\】", "_", "-", "."};
private List<String> GONG_YU_GAO_bdx = make(GONG_YU_GAO_B, GONG_YU_GAO_D, ALL_X);
private List<String> GONG_YU_GAO_xbd = make(ALL_X, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_abd = make(GONG_YU_GAO_A, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_abbd = make(GONG_YU_GAO_A, GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbbd = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbbed = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_E, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbd = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbed = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_E, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bd = make(GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bed = make(GONG_YU_GAO_B, GONG_YU_GAO_E, GONG_YU_GAO_D);
private List<String> BIAN_RESULT_cdx = make(BIAN_RESULT_C, BIAN_RESULT_D, ALL_X);
private List<String> BIAN_RESULT_xcd = make(BIAN_RESULT_C, BIAN_RESULT_D, ALL_X);
private List<String> BIAN_RESULT_abcd = make(BIAN_RESULT_A, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_abbcd = make(BIAN_RESULT_A, BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbbcd = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbbed = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbcd = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbced = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bcd = make(BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bced = make(BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_ccd = make(BIAN_RESULT_C, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_cced = make(BIAN_RESULT_C, BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_cc = make(BIAN_RESULT_C, BIAN_RESULT_C);
private List<String> BIAN_RESULT_cd = make(BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_cec = make(BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_C);
make函數如下:
這樣就不存在一開始的無法啓動項目的問題了。
改進後的標題去重完整代碼如下:
public class CleanTitleService {
private String[] GONG_YU_GAO_A = {"定點", "協議", "單一來源", "單一", "資格", "資格預審", "競爭", "競爭性", "公開"};
private String[] GONG_YU_GAO_B = {"招標", "採購", "磋商", "入圍", "談判", "議價", "詢價", "比價", "詢比價", "比選", "項目"};
private String[] GONG_YU_GAO_D = {"公告", "公示", "信息"};
private String[] GONG_YU_GAO_E = {"的"};
private String[] BIAN_RESULT_A = {"定點", "協議", "單一來源", "單一", "資格", "資格預審", "競爭", "競爭性", "公開"};
private String[] BIAN_RESULT_B = {"招標", "採購", "磋商", "入圍", "談判", "議價", "詢價", "比價", "詢比價", "比選", "項目"};
private String[] BIAN_RESULT_C = {"合同", "成交", "結果", "中標", "變更", "候選", "候選人", "成交人", "中標人"};
private String[] BIAN_RESULT_D = {"公告", "公示", "結果"};
private String[] BIAN_RESULT_E = {"的"};
private String[] ALL_X = {":", ":", "\\[", "\\]", "\\【", "\\】", "_", "-", "."};
private List<String> GONG_YU_GAO_bdx = make(GONG_YU_GAO_B, GONG_YU_GAO_D, ALL_X);
private List<String> GONG_YU_GAO_xbd = make(ALL_X, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_abd = make(GONG_YU_GAO_A, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_abbd = make(GONG_YU_GAO_A, GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbbd = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbbed = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_E, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbd = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bbed = make(GONG_YU_GAO_B, GONG_YU_GAO_B, GONG_YU_GAO_E, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bd = make(GONG_YU_GAO_B, GONG_YU_GAO_D);
private List<String> GONG_YU_GAO_bed = make(GONG_YU_GAO_B, GONG_YU_GAO_E, GONG_YU_GAO_D);
private List<String> BIAN_RESULT_cdx = make(BIAN_RESULT_C, BIAN_RESULT_D, ALL_X);
private List<String> BIAN_RESULT_xcd = make(BIAN_RESULT_C, BIAN_RESULT_D, ALL_X);
private List<String> BIAN_RESULT_abcd = make(BIAN_RESULT_A, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_abbcd = make(BIAN_RESULT_A, BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbbcd = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbbed = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbcd = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bbced = make(BIAN_RESULT_B, BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bcd = make(BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_bced = make(BIAN_RESULT_B, BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_ccd = make(BIAN_RESULT_C, BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_cced = make(BIAN_RESULT_C, BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_D);
private List<String> BIAN_RESULT_cc = make(BIAN_RESULT_C, BIAN_RESULT_C);
private List<String> BIAN_RESULT_cd = make(BIAN_RESULT_C, BIAN_RESULT_D);
private List<String> BIAN_RESULT_cec = make(BIAN_RESULT_C, BIAN_RESULT_E, BIAN_RESULT_C);
@Autowired
@Qualifier("shujuzuJdbcTemplate")
private JdbcTemplate shujuzuJdbcTemplate;
// @Scheduled(fixedDelay = 60*60*60*1000)
public void goToCheck() {
log.info("去重開始");
Integer limit = 1000;
int num=0;
String taskId = "201907081530";
String table=" big_customer_data ";
while (true) {
List<Map<String, Object>> maps = shujuzuJdbcTemplate.queryForList("select id,contentId,title,progid from "+table+" where taskId = ? AND spare4 is null order by id limit ?", taskId, limit);
for (Map<String, Object> map : maps) {
shujuzuJdbcTemplate.update("update "+table+" set spare4 = ? where id = ?", gongTitle(map.get("title").toString(), map.get("progid").toString()), map.get("id"));
}
num+=1000;
log.info("執行到了:"+num);
if (maps.size() < limit) {
log.info("去重結束");
break;
}
}
}
private String gongTitle(String orgTitle, String progid) {
if (orgTitle == null || orgTitle.equals("")) {
return "";
}
Integer sum = 0;
if (progid.equals("0") || progid.equals("1")) {
sum = 7;
} else if (progid.equals("2") || progid.equals("3")) {
sum = 9;
}
boolean have = false;
for (Integer i = 1; i <= sum; i++) {
boolean x=(sum==7&&i==7)||(sum==9&&i==9);
if(x){
orgTitle=deleteX(orgTitle);
}
else{
for (String s : getKeys(i, progid)) {
if (orgTitle.contains(s)) {
orgTitle = orgTitle.replaceAll(s, "");
have = true;
break;
}
}
}
if (have) {
break;
}
}
return orgTitle;
}
private List<String> getKeys(Integer i, String progid) {
List<String> keys = new ArrayList<>();
if (progid.equals("0") | progid.equals("1")) {
switch (i) {
case 1:
keys.addAll(GONG_YU_GAO_bdx);
keys.addAll(GONG_YU_GAO_xbd);
break;
case 2:
keys.addAll(GONG_YU_GAO_abd);
break;
case 3:
keys.addAll(GONG_YU_GAO_abbd);
break;
case 4:
keys.addAll(GONG_YU_GAO_bbbd);
keys.addAll(GONG_YU_GAO_bbbed);
break;
case 5:
keys.addAll(GONG_YU_GAO_bbd);
keys.addAll(GONG_YU_GAO_bbed);
break;
case 6:
keys.addAll(GONG_YU_GAO_bd);
keys.addAll(GONG_YU_GAO_bed);
break;
case 7:
Collections.addAll(keys, ALL_X);
break;
default:
break;
}
} else if (progid.equals("2") | progid.equals("3")) {
switch (i) {
case 1:
keys.addAll(BIAN_RESULT_cdx);
keys.addAll(BIAN_RESULT_xcd);
break;
case 2:
keys.addAll(BIAN_RESULT_abcd);
break;
case 3:
keys.addAll(BIAN_RESULT_abbcd);
break;
case 4:
keys.addAll(BIAN_RESULT_bbbcd);
keys.addAll(BIAN_RESULT_bbbed);
break;
case 5:
keys.addAll(BIAN_RESULT_bbcd);
keys.addAll(BIAN_RESULT_bbced);
break;
case 6:
keys.addAll(BIAN_RESULT_bcd);
keys.addAll(BIAN_RESULT_bced);
break;
case 7:
keys.addAll(BIAN_RESULT_ccd);
keys.addAll(BIAN_RESULT_cced);
break;
case 8:
keys.addAll(BIAN_RESULT_cc);
keys.addAll(BIAN_RESULT_cd);
keys.addAll(BIAN_RESULT_cec);
break;
case 9:
Collections.addAll(keys, ALL_X);
break;
default:
break;
}
}
return keys;
}
private String deleteX(String str){
str=str.replaceAll("\\[.*?]","");
str=str.replaceAll("\\【.*?】","");
str=str.replaceAll(":","");
str=str.replaceAll(":","");
str=str.replaceAll("\\[|\\]","");
str=str.replaceAll("\\【|\\】","");
str=str.replaceAll("_","");
str=str.replaceAll("-","");
str=str.replaceAll("\\.","");
return str;
}
private List<String> make(String[] stringOne, String[] stringTwo) {
List<String> stringSet = new ArrayList<>();
for (String s1 : stringOne) {
for (String s2 : stringTwo) {
stringSet.add(s1 + s2);
}
}
return stringSet;
}
private List<String> make(String[] stringOne, String[] stringTwo, String[] stringThree) {
List<String> stringSet = new ArrayList<>();
for (String s1 : stringOne) {
for (String s2 : stringTwo) {
for (String s3 : stringThree) {
stringSet.add(s1 + s2 + s3);
}
}
}
return stringSet;
}
private List<String> make(String[] stringOne, String[] stringTwo, String[] stringThree, String[] stringFor) {
List<String> stringSet = new ArrayList<>();
for (String s1 : stringOne) {
for (String s2 : stringTwo) {
for (String s3 : stringThree) {
for (String s4 : stringFor) {
stringSet.add(s1 + s2 + s3 + s4);
}
}
}
}
return stringSet;
}
private List<String> make(String[] stringOne, String[] stringTwo, String[] stringThree, String[] stringFor, String[] stringFive) {
List<String> stringSet = new ArrayList<>();
for (String s1 : stringOne) {
for (String s2 : stringTwo) {
for (String s3 : stringThree) {
for (String s4 : stringFor) {
for (String s5 : stringFive) {
stringSet.add(s1 + s2 + s3 + s4 + s5);
}
}
}
}
}
return stringSet;
}
}
嗯,的確,這段代碼讀起來會比第一版舒服一些,少了數據庫表的限制。
不過,對於上文提到的,如果只是在原版代碼上進行改進,在數據庫裏面只保留一個表,這樣的話代碼也會變得很簡潔,讀起來也會很好。
用第一版的同一批數據對這段代碼進行了測試,如果是在本地跑,兩萬數據需要9分鐘。如果需要是放在服務器,則一分鐘就跑完了。
放在表格裏對比一下:
(兩萬數據) | 本地 | 服務器 |
第一版 | 15分鐘 | 9分鐘 |
第二版 | 3分鐘 | 1分鐘 |
也就是說,更新之後的代碼,跑八十萬數據,放在服務器,只需要40分鐘即可,縮短三分之二的時間。還是很可觀的。
不過,第二版代碼目前只起了一個線程,如果將起四個線程,因爲項目數據中,基本上是公告和結果兩類數據平分天下,因此花費的時間則需折半,也久是說只要二十分鐘就可以將八十萬數據清洗完成,還是很舒服的。
好啦,今天的優化就到這裏,所以來總結一下:
- 首先,讀內存肯定比走磁盤快,所以根據自身情況來選擇是否需要在數據庫中建立額外的表來達到目的。
- 其次,儘量減少IO次數,比如第一版程序,每匹配完一組關鍵詞就對spare4字段進行update,這種方式建議不要使用,能最後進行update就最後進行。
- 再者,程序能放服務器就放服務器,本地跑程序肯定不急服務器的帶寬,有服務器不用是笨蛋。
- 最後,要多去思考,問題的解決方式肯定有多種,各有千秋。