分佈式搜索引擎Elasticsearch—kafka數據同步插件

原創

2020-02-21 04:03

river代表es的一個數據源，也是其它存儲方式（如：數據庫）同步數據到es的一個方法。它是以插件方式存在的一個es服務，通過讀取river中的數據並把它索引到es中，官方的river有couchDB的，RabbitMQ的，Twitter的，Wikipedia的。關於kafka的介紹請參見之前的文章。

1. 開源插件：elasticsearch-river-kafka

插件的安裝和使用在github（https://github.com/endgameinc/elasticsearch-river-kafka）介紹的很詳細。這裏需要提到的是，插件對kafka中數據的格式有嚴格的定義：

1
2
3
4
5
6
{
"index" : "example_index",
"type" : "example_type",
"id" : "asdkljflkasjdfasdfasdf",
"source" : { ..... }
}

其中，index是索引名，type是索引類型，id是這條數據的id，source就是數據內容。而我們的新聞數據在kafka中的格式如下：

1

2

3

4

5

6


{

"id"
:
"asdkljflkasjdfasdfasdf",

"site_id"
:
100,

"title"
:
"hello
word！",

"media_type"
:
1

}

這樣，就需要修改插件源碼來滿足需求。

2. 工程搭建

git clone https://github.com/endgameinc/elasticsearch-river-kafka.git

使用eclipse打開工程。

3. 自定義MessageHandler

實現getSource方法：

1
2
3
4
5
6
7
8
9
10
11
protected Map\<String, Object> getSource() {
Map<String, Object> src = new HashMap<String, Object>();
try {
src.put("site_id", messageMap.get("site_id"));
src.put("title", messageMap.get("title"));
src.put("media_type", messageMap.get("media_type"));
} catch (Exception e) {
logger.warn("解析source錯誤，msg=" + messageMap.toString(), e);
}
return src;
}

實現getIndex和getType方法：

index和type在我們的數據裏面是沒有沒有的，那麼就需要自己通過配置載入。在配置文件中添加模塊：

1

2

3

4

"news":{

"index":
"test",

"type":
"news"

}

現在，MessageHandlerFactory內部需要得到配置文件，修改MessageHandler的構造函數和MessageHandlerFactory的接口，添加settings參數。例如MessageHandlerFactory：

public MessageHandler createMessageHandler(Client client, RiverSettings settings) throws Exception; 這樣，在NewsJsonMessageHandler中可以得到配置參數信息：

1
2
3
4
5
6
private Map<String, Object> newsSettings;
public NewsJsonMessageHandler(Client client, RiverSettings settings) {
this.client = client;
newsSettings = (Map<String, Object>) settings.settings().get("news");
logger.info("news settings: " + newsSettings.toString());
}

getIndex和getType方法分別爲：

1

2

3

4

5

6


protected
String
getIndex()
{

return
(String)
newsSettings.get("index");

}

protected
String
getType()
{

return
(String)
newsSettings.get("type");

}

4. kafka編譯版本問題

elasticsearch-river-kafka使用java 1.7編譯的，需要改爲1.6。另外mvn默認引入的kafka-0.7.2.jar也是java 1.7編譯的。需要使用我們自己使用java1.6編譯的版本。

5. 最終的添加同步任務命令

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
curl -XPUT 'local:9200/_river/news_kafka_river_0/_meta' -d '{
"type" : "kafka",
"kafka" : {
"broker_host" : "mota32",
"message_handler_factory_class" : "com.weidou.elasticsearch.river.NewsJsonMessageHandlerFactory",
"zookeeper" : "mota32",
"topic" : "es-test1",
"partition" : "0",
"broker_port" : 9092
},
"index" : {
"bulk_size_bytes" : 10000000,
"bulk_timeout" : "1000ms"
},
"statsd":{
"prefix": "es-kafka-river",
"host": "mota33",
"port": "8125"
},
"news":{
"index": "test",
"type": "news"
}
}'

6. 刪除和安裝elasticsearch-river-kafka的方法

1

2


./plugin
-remove
elasticsearch-river-kafka

./plugin
-url
http://www.xxoo.com/static/elasticsearch-river-kafka-1.0.2-SNAPSHOT.zip
-install elasticsearch-river-kafka