ElasticSearch 重寫IK分詞器源碼設置mysql熱詞更新詞庫

常用熱詞詞庫的配置方式

1.採用IK 內置詞庫
優點:部署方便,不用額外指定其他詞庫位置
缺點:分詞單一化,不能指定想分詞的詞條

2.IK 外置靜態詞庫
優點:部署相對方便,可以通過編輯指定文件分詞文件得到想要的詞條
缺點:需要指定外部靜態文件,每次需要手動編輯整個分詞文件,然後放到指定的文件目錄下,重啓ES後才能生效

3.IK 遠程詞庫
優點:通過指定一個靜態文件代理服務器來設置IK分詞的詞庫信息
缺點:需要手動編輯整個分詞文件來進行詞條的添加, IK源碼中判斷頭信息Last-Modified  ETag 標識來判斷是否更新,有時不生效

結合上面的優缺點,決定採用Mysql作爲外置熱詞詞庫,定時更新熱詞 和 停用詞。

準備工作

1.下載合適的ElasticSearch對應版本的IK分詞器:https://github.com/medcl/elasticsearch-analysis-ik
2.我們來查看它config文件夾下的文件:
因爲我本地安裝的是ES是5.5.0版本,所以下載的IK爲5.5.0的適配版
3.分析IKAnalyzer.cfg.xml 配置文件:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 擴展配置</comment>
	<!--用戶可以在這裏配置自己的擴展字典 -->
	<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
	 <!--用戶可以在這裏配置自己的擴展停止詞字典-->
	<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
	<!--用戶可以在這裏配置遠程擴展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用戶可以在這裏配置遠程擴展停止詞字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

ext_dict:對應的擴展熱詞詞典的位置,多個熱詞文件之間使用分號來進行間隔
ext_stopwords:對應擴展停用詞詞典位置,多個之間用分號進行間隔
remote_ext_dict:遠程擴展熱詞位置 如:https://xxx.xxx.xxx.xxx/ext_hot.txt
remote_ext_stopwords:遠程擴展停用詞位置 如:https://xxx.xxx.xxx.xxx/ext_stop.txt

4.除了config/ 文件夾中IKAnalyzer.cfg.xml 文件,我們開下config文件夾下其他文件的作用:
Dictionary中單例方法public static synchronized Dictionary initial(Configuration cfg)
 

private DictSegment _MainDict;

private DictSegment _SurnameDict;

private DictSegment _QuantifierDict;

private DictSegment _SuffixDict;

private DictSegment _PrepDict;

private DictSegment _StopWords;
...
public static synchronized Dictionary initial(Configuration cfg) {
	if (singleton == null) {
		synchronized (Dictionary.class) {
			if (singleton == null) {
				singleton = new Dictionary(cfg);
				singleton.loadMainDict();
				singleton.loadSurnameDict();
				singleton.loadQuantifierDict();
				singleton.loadSuffixDict();
				singleton.loadPrepDict();
				singleton.loadStopWordDict();
				if(cfg.isEnableRemoteDict()){
					// 建立監控線程
					for (String location : singleton.getRemoteExtDictionarys()) {
						// 10 秒是初始延遲可以修改的 60是間隔時間 單位秒
						pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);
					}
					for (String location : singleton.getRemoteExtStopWordDictionarys()) {
						pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);
					}
				}
				
				return singleton;
			}
		}
	}
	return singleton;
}

initial中 load*中方法是利用config中其他文本文件來初始化Dictionary中的上面聲明的成員變量:
_MainDict : 主詞典對象,也是用來存儲熱詞的對象
_SurnameDict : 姓氏詞典
_QuantifierDict : 量詞詞典,例如1箇中的 個 2兩種的兩
_SuffixDict : 後綴詞典
_PrepDict : 副詞/介詞詞典
_StopWords : 停用詞詞典

修改Dictionary源碼

Dictionary類:更新詞典 this.loadMySQLExtDict()

private void loadMySQLExtDict() {
	Connection conn = null;
	Statement stmt = null;
	ResultSet rs = null;
	try {
		Path file = PathUtils.get(getDictRoot(), "jdbc-loadext.properties");
		prop.load(new FileInputStream(file.toFile()));

		logger.info("jdbc-reload.properties");
		for(Object key : prop.keySet()) {
			logger.info(key + "=" + prop.getProperty(String.valueOf(key)));
		}

		logger.info("query hot dict from mysql, " + prop.getProperty("jdbc.reload.sql") + "......");

		conn = DriverManager.getConnection(
				prop.getProperty("jdbc.url"),
				prop.getProperty("jdbc.user"),
				prop.getProperty("jdbc.password"));
		stmt = conn.createStatement();
		rs = stmt.executeQuery(prop.getProperty("jdbc.reload.sql"));

		while(rs.next()) {
			String theWord = rs.getString("word");
			logger.info("hot word from mysql: " + theWord);
			_MainDict.fillSegment(theWord.trim().toCharArray());
		}

	} catch (Exception e) {
		logger.error("erorr", e);
	} finally {
		if(rs != null) {
			try {
				rs.close();
			} catch (SQLException e) {
				logger.error("error", e);
			}
		}
		if(stmt != null) {
			try {
				stmt.close();
			} catch (SQLException e) {
				logger.error("error", e);
			}
		}
		if(conn != null) {
			try {
				conn.close();
			} catch (SQLException e) {
				logger.error("error", e);
			}
		}
	}
}

Dictionary類:更新停用詞 this.loadMySQLStopwordDict()

private void loadMySQLStopwordDict() {
	Connection conn = null;
	Statement stmt = null;
	ResultSet rs = null;

	try {
		Path file = PathUtils.get(getDictRoot(), "jdbc-loadext.properties");
		prop.load(new FileInputStream(file.toFile()));

		logger.info("jdbc-reload.properties");
		for(Object key : prop.keySet()) {
			logger.info(key + "=" + prop.getProperty(String.valueOf(key)));
		}

		logger.info("query hot stopword dict from mysql, " + prop.getProperty("jdbc.reload.stopword.sql") + "......");

		conn = DriverManager.getConnection(
				prop.getProperty("jdbc.url"),
				prop.getProperty("jdbc.user"),
				prop.getProperty("jdbc.password"));
		stmt = conn.createStatement();
		rs = stmt.executeQuery(prop.getProperty("jdbc.reload.stopword.sql"));

		while(rs.next()) {
			String theWord = rs.getString("word");
			logger.info("hot stopword from mysql: " + theWord);
			_StopWords.fillSegment(theWord.trim().toCharArray());
		}

	} catch (Exception e) {
		logger.error("erorr", e);
	} finally {
		if(rs != null) {
			try {
				rs.close();
			} catch (SQLException e) {
				logger.error("error", e);
			}
		}
		if(stmt != null) {
			try {
				stmt.close();
			} catch (SQLException e) {
				logger.error("error", e);
			}
		}
		if(conn != null) {
			try {
				conn.close();
			} catch (SQLException e) {
				logger.error("error", e);
			}
		}
	}
}

對外暴露方法:

public void reLoadSQLDict() {
	this.loadMySQLExtDict();
	this.loadMySQLStopwordDict();
}

MySQLDictReloadThread Runnable實現類,去執行reLoadSQLDict() 加載熱詞:

import org.apache.logging.log4j.Logger;
import org.elasticsearch.common.logging.ESLoggerFactory;


/**
 * Created with IntelliJ IDEA.
 *
 * @author: zhubo
 * @description: 定時執行
 * @time: 2018年07月22日 13:05:24
 * @modifytime:
 */
public class MySQLDictReloadThread implements Runnable {

    private static final Logger logger = ESLoggerFactory.getLogger(MySQLDictReloadThread.class.getName());

    @Override
    public void run() {
        logger.info("reloading hot_word and stop_worddict from mysql");
        Dictionary.getSingleton().reLoadSQLDict();
    }
}

最後代碼爲定時調用:

其中一些細節就不講述了。

jdbc-loadext.properties

jdbc.url=jdbc:mysql://xxx.xxx.xxx.xxx:3306/stop_word?useUnicode=true&characterEncoding=UTF-8&characterSetResults=UTF-8
jdbc.user=xxxxxx
jdbc.password=xxxxxxx
jdbc.reload.sql=select word from hot_words
jdbc.reload.stopword.sql=select stopword as word from hot_stopwords

文件放於此位置

打包

因爲我們鏈接的是mysql數據庫,所以maven項目要引入mysql驅動:

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>6.0.6</version>
</dependency>

僅僅這樣還不夠,還需要修改plugin.xml文件(遇到了這個坑,修改pom好久新引入的依賴打包總打不進去):

準備完畢:執行打包。 mvn clean package

打包完畢。 上傳,重啓進行實驗啦。^_^

實驗結果

數據庫插入記錄

GET http://172.16.11.119:9200/g_index/_analyze?text=真是山炮&analyzer=ik_smart
{
    "tokens": [
        {
            "token": "真是山炮",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 0
        }
    ]
}
GET http://172.16.11.119:9200/g_index/_analyze?text=大耳朵兔子&analyzer=ik_smart
{
    "tokens": [
        {
            "token": "大耳朵兔子",
            "start_offset": 0,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 0
        }
    ]
}
GET http://172.16.11.119:9200/g_index/_analyze?text=大耳朵兔子你真是山炮&analyzer=ik_smart
{
    "tokens": [
        {
            "token": "大耳朵兔子",
            "start_offset": 0,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "你",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "真是山炮",
            "start_offset": 6,
            "end_offset": 10,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}
GET http://172.16.11.119:9200/g_index/_analyze?text=大耳朵兔子你真是山炮&analyzer=ik_max_word
{
    "tokens": [
        {
            "token": "大耳朵兔子",
            "start_offset": 0,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "耳朵",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "耳",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "朵",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "兔子",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "兔",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "子",
            "start_offset": 4,
            "end_offset": 5,
            "type": "CN_CHAR",
            "position": 6
        },
        {
            "token": "你",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 7
        },
        {
            "token": "真是山炮",
            "start_offset": 6,
            "end_offset": 10,
            "type": "CN_WORD",
            "position": 8
        },
        {
            "token": "真是",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 9
        },
        {
            "token": "山炮",
            "start_offset": 8,
            "end_offset": 10,
            "type": "CN_WORD",
            "position": 10
        },
        {
            "token": "炮",
            "start_offset": 9,
            "end_offset": 10,
            "type": "CN_WORD",
            "position": 11
        }
    ]
}

(⊙o⊙)… 我也不知道爲什麼會舉出這種例子,算了就它吧。。。 山炮の

小弟比較笨中間遇到了一些坑,試了好幾次才完成,^_^ , 有啥不明白的地方可以交流額

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章