常用熱詞詞庫的配置方式
1.採用IK 內置詞庫
優點:部署方便,不用額外指定其他詞庫位置
缺點:分詞單一化,不能指定想分詞的詞條
2.IK 外置靜態詞庫
優點:部署相對方便,可以通過編輯指定文件分詞文件得到想要的詞條
缺點:需要指定外部靜態文件,每次需要手動編輯整個分詞文件,然後放到指定的文件目錄下,重啓ES後才能生效
3.IK 遠程詞庫
優點:通過指定一個靜態文件代理服務器來設置IK分詞的詞庫信息
缺點:需要手動編輯整個分詞文件來進行詞條的添加, IK源碼中判斷頭信息Last-Modified ETag 標識來判斷是否更新,有時不生效
結合上面的優缺點,決定採用Mysql作爲外置熱詞詞庫,定時更新熱詞 和 停用詞。
準備工作
1.下載合適的ElasticSearch對應版本的IK分詞器:https://github.com/medcl/elasticsearch-analysis-ik
2.我們來查看它config文件夾下的文件:
因爲我本地安裝的是ES是5.5.0版本,所以下載的IK爲5.5.0的適配版
3.分析IKAnalyzer.cfg.xml 配置文件:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 擴展配置</comment>
<!--用戶可以在這裏配置自己的擴展字典 -->
<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
<!--用戶可以在這裏配置自己的擴展停止詞字典-->
<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
<!--用戶可以在這裏配置遠程擴展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用戶可以在這裏配置遠程擴展停止詞字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
ext_dict:對應的擴展熱詞詞典的位置,多個熱詞文件之間使用分號來進行間隔
ext_stopwords:對應擴展停用詞詞典位置,多個之間用分號進行間隔
remote_ext_dict:遠程擴展熱詞位置 如:https://xxx.xxx.xxx.xxx/ext_hot.txt
remote_ext_stopwords:遠程擴展停用詞位置 如:https://xxx.xxx.xxx.xxx/ext_stop.txt
4.除了config/ 文件夾中IKAnalyzer.cfg.xml 文件,我們開下config文件夾下其他文件的作用:
Dictionary中單例方法public static synchronized Dictionary initial(Configuration cfg)
private DictSegment _MainDict;
private DictSegment _SurnameDict;
private DictSegment _QuantifierDict;
private DictSegment _SuffixDict;
private DictSegment _PrepDict;
private DictSegment _StopWords;
...
public static synchronized Dictionary initial(Configuration cfg) {
if (singleton == null) {
synchronized (Dictionary.class) {
if (singleton == null) {
singleton = new Dictionary(cfg);
singleton.loadMainDict();
singleton.loadSurnameDict();
singleton.loadQuantifierDict();
singleton.loadSuffixDict();
singleton.loadPrepDict();
singleton.loadStopWordDict();
if(cfg.isEnableRemoteDict()){
// 建立監控線程
for (String location : singleton.getRemoteExtDictionarys()) {
// 10 秒是初始延遲可以修改的 60是間隔時間 單位秒
pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);
}
for (String location : singleton.getRemoteExtStopWordDictionarys()) {
pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);
}
}
return singleton;
}
}
}
return singleton;
}
initial中 load*中方法是利用config中其他文本文件來初始化Dictionary中的上面聲明的成員變量:
_MainDict : 主詞典對象,也是用來存儲熱詞的對象
_SurnameDict : 姓氏詞典
_QuantifierDict : 量詞詞典,例如1箇中的 個 2兩種的兩
_SuffixDict : 後綴詞典
_PrepDict : 副詞/介詞詞典
_StopWords : 停用詞詞典
修改Dictionary源碼
Dictionary類:更新詞典 this.loadMySQLExtDict()
private void loadMySQLExtDict() {
Connection conn = null;
Statement stmt = null;
ResultSet rs = null;
try {
Path file = PathUtils.get(getDictRoot(), "jdbc-loadext.properties");
prop.load(new FileInputStream(file.toFile()));
logger.info("jdbc-reload.properties");
for(Object key : prop.keySet()) {
logger.info(key + "=" + prop.getProperty(String.valueOf(key)));
}
logger.info("query hot dict from mysql, " + prop.getProperty("jdbc.reload.sql") + "......");
conn = DriverManager.getConnection(
prop.getProperty("jdbc.url"),
prop.getProperty("jdbc.user"),
prop.getProperty("jdbc.password"));
stmt = conn.createStatement();
rs = stmt.executeQuery(prop.getProperty("jdbc.reload.sql"));
while(rs.next()) {
String theWord = rs.getString("word");
logger.info("hot word from mysql: " + theWord);
_MainDict.fillSegment(theWord.trim().toCharArray());
}
} catch (Exception e) {
logger.error("erorr", e);
} finally {
if(rs != null) {
try {
rs.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(stmt != null) {
try {
stmt.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(conn != null) {
try {
conn.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
}
}
Dictionary類:更新停用詞 this.loadMySQLStopwordDict()
private void loadMySQLStopwordDict() {
Connection conn = null;
Statement stmt = null;
ResultSet rs = null;
try {
Path file = PathUtils.get(getDictRoot(), "jdbc-loadext.properties");
prop.load(new FileInputStream(file.toFile()));
logger.info("jdbc-reload.properties");
for(Object key : prop.keySet()) {
logger.info(key + "=" + prop.getProperty(String.valueOf(key)));
}
logger.info("query hot stopword dict from mysql, " + prop.getProperty("jdbc.reload.stopword.sql") + "......");
conn = DriverManager.getConnection(
prop.getProperty("jdbc.url"),
prop.getProperty("jdbc.user"),
prop.getProperty("jdbc.password"));
stmt = conn.createStatement();
rs = stmt.executeQuery(prop.getProperty("jdbc.reload.stopword.sql"));
while(rs.next()) {
String theWord = rs.getString("word");
logger.info("hot stopword from mysql: " + theWord);
_StopWords.fillSegment(theWord.trim().toCharArray());
}
} catch (Exception e) {
logger.error("erorr", e);
} finally {
if(rs != null) {
try {
rs.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(stmt != null) {
try {
stmt.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(conn != null) {
try {
conn.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
}
}
對外暴露方法:
public void reLoadSQLDict() {
this.loadMySQLExtDict();
this.loadMySQLStopwordDict();
}
MySQLDictReloadThread Runnable實現類,去執行reLoadSQLDict() 加載熱詞:
import org.apache.logging.log4j.Logger;
import org.elasticsearch.common.logging.ESLoggerFactory;
/**
* Created with IntelliJ IDEA.
*
* @author: zhubo
* @description: 定時執行
* @time: 2018年07月22日 13:05:24
* @modifytime:
*/
public class MySQLDictReloadThread implements Runnable {
private static final Logger logger = ESLoggerFactory.getLogger(MySQLDictReloadThread.class.getName());
@Override
public void run() {
logger.info("reloading hot_word and stop_worddict from mysql");
Dictionary.getSingleton().reLoadSQLDict();
}
}
最後代碼爲定時調用:
其中一些細節就不講述了。
jdbc-loadext.properties
jdbc.url=jdbc:mysql://xxx.xxx.xxx.xxx:3306/stop_word?useUnicode=true&characterEncoding=UTF-8&characterSetResults=UTF-8
jdbc.user=xxxxxx
jdbc.password=xxxxxxx
jdbc.reload.sql=select word from hot_words
jdbc.reload.stopword.sql=select stopword as word from hot_stopwords
文件放於此位置
打包
因爲我們鏈接的是mysql數據庫,所以maven項目要引入mysql驅動:
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>6.0.6</version>
</dependency>
僅僅這樣還不夠,還需要修改plugin.xml文件(遇到了這個坑,修改pom好久新引入的依賴打包總打不進去):
準備完畢:執行打包。 mvn clean package
打包完畢。 上傳,重啓進行實驗啦。^_^
實驗結果
數據庫插入記錄
GET http://172.16.11.119:9200/g_index/_analyze?text=真是山炮&analyzer=ik_smart
{
"tokens": [
{
"token": "真是山炮",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 0
}
]
}
GET http://172.16.11.119:9200/g_index/_analyze?text=大耳朵兔子&analyzer=ik_smart
{
"tokens": [
{
"token": "大耳朵兔子",
"start_offset": 0,
"end_offset": 5,
"type": "CN_WORD",
"position": 0
}
]
}
GET http://172.16.11.119:9200/g_index/_analyze?text=大耳朵兔子你真是山炮&analyzer=ik_smart
{
"tokens": [
{
"token": "大耳朵兔子",
"start_offset": 0,
"end_offset": 5,
"type": "CN_WORD",
"position": 0
},
{
"token": "你",
"start_offset": 5,
"end_offset": 6,
"type": "CN_CHAR",
"position": 1
},
{
"token": "真是山炮",
"start_offset": 6,
"end_offset": 10,
"type": "CN_WORD",
"position": 2
}
]
}
GET http://172.16.11.119:9200/g_index/_analyze?text=大耳朵兔子你真是山炮&analyzer=ik_max_word
{
"tokens": [
{
"token": "大耳朵兔子",
"start_offset": 0,
"end_offset": 5,
"type": "CN_WORD",
"position": 0
},
{
"token": "耳朵",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 1
},
{
"token": "耳",
"start_offset": 1,
"end_offset": 2,
"type": "CN_WORD",
"position": 2
},
{
"token": "朵",
"start_offset": 2,
"end_offset": 3,
"type": "CN_WORD",
"position": 3
},
{
"token": "兔子",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 4
},
{
"token": "兔",
"start_offset": 3,
"end_offset": 4,
"type": "CN_WORD",
"position": 5
},
{
"token": "子",
"start_offset": 4,
"end_offset": 5,
"type": "CN_CHAR",
"position": 6
},
{
"token": "你",
"start_offset": 5,
"end_offset": 6,
"type": "CN_CHAR",
"position": 7
},
{
"token": "真是山炮",
"start_offset": 6,
"end_offset": 10,
"type": "CN_WORD",
"position": 8
},
{
"token": "真是",
"start_offset": 6,
"end_offset": 8,
"type": "CN_WORD",
"position": 9
},
{
"token": "山炮",
"start_offset": 8,
"end_offset": 10,
"type": "CN_WORD",
"position": 10
},
{
"token": "炮",
"start_offset": 9,
"end_offset": 10,
"type": "CN_WORD",
"position": 11
}
]
}
(⊙o⊙)… 我也不知道爲什麼會舉出這種例子,算了就它吧。。。 山炮の
小弟比較笨中間遇到了一些坑,試了好幾次才完成,^_^ , 有啥不明白的地方可以交流額