Logstash讀取kafka到ElasticSearch的配置文檔

一、架構

本項目是通過Logstash從kafka讀取release相關升級數據到ElasticSearch(下稱es)中，並支持升級數據需要按照taskId的updateTime最新時間對es更新。由於從kafka中讀取的數據是經過protobuf序列化的，logstash也需要對數據進行反序列化轉換成json。Logstash實現的功能拓撲圖如下所示。

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-egKF8j2A-1591927503626)(https://i.imgur.com/f93WbB3.png)]

二、input 讀取kafka配置

該模塊負責從kafka集羣中讀取已經被protobuf序列化的數據，並進行反序列化操作。

input{
  kafka {
    bootstrap_servers => ["192.168.144.34:9092,192.168.144.35:9092,192.168.144.36:9092"]
    group_id => "logstash-release-upgrade-test"
    consumer_threads => 5
    decorate_events => false
    topics => ["release_upgrade_ES_test"]
    type => "release-upgrade-es"
    auto_offset_reset => "latest"
    key_deserializer_class => "org.apache.kafka.common.serialization.ByteArrayDeserializer"
    value_deserializer_class => "org.apache.kafka.common.serialization.ByteArrayDeserializer"
    codec => protobuf{ 
      class_name => ["ReleaseRecords.ReleaseRecordOrigin"]
      include_path => ['/opt/logstash-6.5.4/config/fates/ReleaseRecordES_pb.rb']
      protobuf_version => 3
    }
  }
}

2.1、簡單配置及說明

- bootstrap_servers
- 值類型爲string
- 默認值爲"localhost:9092"
- 用於建立到集羣的初始連接的Kafka實例的url列表，這個列表應該是host1:port1,host2:port2的形式，這些url僅用於初始連接，以發現完整的集羣成員（可能會動態更改），因此這個列表不需要包含完整的服務器集（不過，如果一個服務器宕機，你可能需要多個服務器）。
group_id
- 值類型爲string
- 默認值爲"logstash"
- 此消費者所屬的組的標識符，消費者組是由多個處理器組成的單個邏輯訂閱服務器，主題中的消息將分發給具有相同group_id的所有Logstash實例。
consumer_threads
- 值類型爲number
- 默認值爲1
- 理想情況下，爲了達到完美的平衡，你應該擁有與分區數量一樣多的線程，線程多於分區意味着有些線程將處於空閒狀態。
decorate_events
- 值類型爲boolean
- 默認值爲false
- 可向事件添加Kafka元數據，比如主題、消息大小的選項，這將向logstash事件中添加一個名爲kafka的字段，其中包含以下屬性：topic：此消息關聯的主題、consumer_group：這個事件中用來讀取的消費者組、partition：此消息關聯的分區、offset：此消息關聯的分區的偏移量、key：包含消息key的ByteBuffer。
topics
- topics
- 值類型爲array
- 默認值爲[“logstash”]
- 要訂閱的主題列表，默認爲[“logstash”]。
enable_auto_commit
- 值類型爲string
- 默認值爲"true"
- 如果是true，消費者定期向Kafka提交已經返回的消息的偏移量，當進程失敗時，將使用這個提交的偏移量作爲消費開始的位置。
type
- 值類型爲string
- 這個設置沒有默認值
- 向該輸入處理的所有事件添加type字段，類型主要用於過濾器激活，該type作爲事件本身的一部分存儲，因此你也可以使用該類型在Kibana中搜索它。如果你試圖在已經擁有一個type的事件上設置一個type（例如，當你將事件從發送者發送到索引器時），那麼新的輸入將不會覆蓋現有的type，發送方的type集在其生命週期中始終與該事件保持一致，甚至在發送到另一個Logstash服務器時也是如此。
其它配置請參考
- 英文文檔：https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html#plugins-inputs-kafka-bootstrap_servers
- 中文文檔：https://segmentfault.com/a/1190000016595992

2.2 protobuf數據反序列化

2.2.1 protocol buffer簡介

相比無模式的JSON格式，Protobufs是一種有模式、高效的數據序列化格式。我們在Kafaka中傳輸的數據，很多都在使用Protocol Buffers進行編碼。它的優勢就在於：首先，編碼後的數據size明顯要比其他的編碼方式要小。以JSON編碼舉例，消息體中不僅僅包含實際數據，還有對應的Key值及很多的中括號。對於文檔結構基本不變的數據，傳輸中包含這些附加信息，是一種資源的浪費。當發送端和接收端對交互的文檔結構達成一致後，傳輸過程還攜帶這部分結構信息就顯得多餘。在整個日誌處理過程中，該部分消耗的資源是可以被節省下來。其次，消費者所處理的數據，數據格式都是約定好的，完全不會像JSON一樣，莫名奇妙多出一個字段。同時，給數據字段的理解產生誤解。

Logstash不支持Protobufs編解碼。目前，它支持純文本、JSON格式和其他別的消息格式。
Protobufs編解碼需要手動配置，其配置如下。

2.2.2 安裝插件

kafka中使用protobuf版本爲3.x，所以本地需要安裝的protobuf編譯器也是3.x的
To build protobuf from source, the following tools are needed:
- autoconf
- automake
- libtool
- make
- g++
- unzip

sudo yum install autoconf automake libtool curl make g++ unzip

安裝protobuf

git clone https://github.com/protocolbuffers/protobuf.git
cd protobuf
git submodule update --init --recursive
./autogen.sh
./configure --prefix=/usr/local/protobuf
make
make check
make install

配置環境變量
- vim /etc/profile

export PATH=$PATH:/usr/local/protobuf/bin/
export PKG_CONFIG_PATH=/usr/local/protobuf/lib/pkgconfig/
保存執行，source /etc/profile。同時在~/.profile中添加上面兩行代碼，否則會出現登錄用戶找不到protoc命令。

配置動態鏈接庫

vim /etc/ld.so.conf，在文件中添加/usr/local/protobuf/lib（注意: 在新行處添加），然後執行命令: ldconfig

2.2.3 ReleaseRecordES.proto文件

在/opt/logstash-6.5.4/config/fates/目錄下新建ReleaseRecordES.proto

syntax = "proto3";
// Compile: protoc --ruby_out=. ReleaseRecordES.proto 
package ReleaseRecords;
message ReleaseRecordOrigin {
  string taskId = 1;
  string device = 2;
  string appName = 3;
  string mediaAppName = 4;
  string version = 5;
  string oldVersion = 6;
  string mediaVersion = 7;
  int32 strategyId = 8;
  int32 sp = 9;
  int32 softBit = 10;
  int32 upgradeMode = 11;
  int32 strategyType = 12;
  int32 status = 13;
  string country = 14;
  string province = 15;
  string city = 16;
  string os = 17;
  string ip = 18;
  string systemBit = 19;
  int32 upgradeType = 20;
  int32 situation = 21;
  string time = 22;
  int32 patchId = 23;
}

2.2.4 protoc命令編譯proto文件爲ruby文件

只能把proto文件編譯成logstash識別的ruby文件才能解碼消息
編譯命令：

protoc --ruby_out=. ReleaseRecordES.proto
----------------------------------------------
#ls
ReleaseRecordES.proto
ReleaseRecordES_pb.rb

編譯後的文件ReleaseRecordES_pb.rb

# Generated by the protocol buffer compiler.  DO NOT EDIT!
# source: ReleaseRecordES.proto

require 'google/protobuf'

Google::Protobuf::DescriptorPool.generated_pool.build do
  add_message "ReleaseRecords.ReleaseRecordOrigin" do
    optional :taskId, :string, 1
    optional :device, :string, 2
    optional :appName, :string, 3
    optional :mediaAppName, :string, 4
    optional :version, :string, 5
    optional :oldVersion, :string, 6
    optional :mediaVersion, :string, 7
    optional :strategyId, :int32, 8
    optional :sp, :int32, 9
    optional :softBit, :int32, 10
    optional :upgradeMode, :int32, 11
    optional :strategyType, :int32, 12
    optional :status, :int32, 13
    optional :country, :string, 14
    optional :province, :string, 15
    optional :city, :string, 16
    optional :os, :string, 17
    optional :ip, :string, 18
    optional :systemBit, :string, 19
    optional :upgradeType, :int32, 20
    optional :situation, :int32, 21
    optional :time, :string, 22
    optional :patchId, :int32, 23
  end
end

module ReleaseRecords
  ReleaseRecordOrigin = Google::Protobuf::DescriptorPool.generated_pool.lookup("ReleaseRecords.ReleaseRecordOrigin").msgclass
end

2.2.5 input.kafka配置protobuf反序列化

在input.kafka中添加如下信息
- class_name => [“ReleaseRecords.ReleaseRecordOrigin”]
- 包名+類名
- include_path => [’/opt/logstash-6.5.4/config/fates/ReleaseRecordES_pb.rb’]
  - ReleaseRecordES_pb.rb文件所在路徑
- protobuf_version => 3
  - protobuf編譯器版本爲3.x

key_deserializer_class => "org.apache.kafka.common.serialization.ByteArrayDeserializer"
value_deserializer_class => "org.apache.kafka.common.serialization.ByteArrayDeserializer"
codec => protobuf{ 
  class_name => ["ReleaseRecords.ReleaseRecordOrigin"]
  include_path => ['/opt/logstash-6.5.4/config/fates/ReleaseRecordES_pb.rb']
  protobuf_version => 3
}

三、filter對數據源進行過濾

獲取升級日誌是通過taskId進行更新:

首先,input讀取kafka中的一條升級日誌記錄；

其次,filter的elasticsearch插件負責通過日誌的主鍵taskId到es中查詢該條記錄；

最後，判斷查詢到的es記錄是否存在，如果不存在就把kafka傳來的日誌記錄插入到es中；如果存在，判斷kafka中的updateTime是否新於es查詢到的升級日誌updateTime時間，新於
則把kafka數據插入到es中，老於或者時間一樣則阻止插入到es中。

下面是filter各個插件在上述功能的實現：

3.1 elasticsearch plugin

elasticsearch{
  hosts => ["192.168.144.23:19200","192.168.144.34:19200","192.168.144.35:19200","192.168.144.36:19200","192.168.162.137:19200"] 
  index => "release_upgrade_es_*"
  query => "routing:%{appName} and _id:%{taskId}"
  fields => {
    "updateTime" => "oldUpdateTime"
  }
}

index => “release_upgrade_es_*”
- 由於index是按照月份進行分庫,所以檢索的時候需要匹配所有按月份分庫
query => “routing:%{appName} and _id:%{taskId}”
- 通過routing:產品名稱和_id:taskId獲取唯一的升級記錄信息
把從es中獲取的updateTime字段寫入kafka中升級記錄對象中，並以oldUpdateTime重新命名
- oldUpdateTime供ruby插件進行時間判斷來用

3.2 data plugin

該插件重要把updateTime時間（ISO8601格式）寫入@timestamp中，在本項目中@timestamp沒有起到任何作用，該步驟也可以省略

date {
  match => ["[updateTime]", "ISO8601"]
  target => "[@timestamp]"
}

filters/date 插件支持五種時間格式：
- ISO8601
  - 類似 “2011-04-19T03:44:01.103Z” 這樣的格
- UNIX
  - UNIX 時間戳格式，記錄的是從 1970 年起始至今的總秒數
- UNIX_MS
  - 這個時間戳則是從 1970 年起始至今的總毫秒數
- TAI64N
  - TAI64N 格式比較少見，是這個樣子的：@4000000052f88ea32489532c
- Joda-Time 庫
  - Logstash 內部使用了 Java 的 Joda 時間庫來作時間處理
- 詳細說明請參考如下文章：
  - http://doc.yonyoucloud.com/doc/logstash-best-practice-cn/filter/date.html

3.3 ruby plugin

判斷查詢到的kafka記錄中的oldUpdateTime是否存在:

    如果不存在就把kafka傳來的日誌記錄插入到es中；

    如果存在，判斷kafka中的updateTime是否新於es查詢到的升級日誌updateTime時間:

        新於則把kafka數據插入到es中;

        老於或者時間一樣則阻止插入到es中

最後把的kafka記錄中的oldUpdateTime字段移除

ruby {
  code => "
      if event.get('oldUpdateTime') != nil 
        then 
          duration_hrs = (Time.parse(event.get('updateTime')).to_f*1000 - Time.parse(event.get('oldUpdateTime')).to_f*1000) / 3600
          if duration_hrs <= 0 
            then
              event.cancel
            end   
        end
    "
    remove_field => ["oldUpdateTime"]
}

四、output

該模塊主要負責把kafka中通過filter過濾的數據存到es中

output {
  stdout{codec=>rubydebug}
  if [type] == "release-upgrade-es" {
    elasticsearch{
      hosts => ["192.168.144.23:19200","192.168.144.34:19200","192.168.144.35:19200","192.168.144.36:19200","192.168.162.137:19200"] 
      index => "release_upgrade_es_%{+YYYY.MM}"
      action => "update"
      doc_as_upsert => "true"
      document_id => "%{taskId}"
      routing => "%{appName}" 
      document_type => "doc"
      manage_template => true
      template => "/opt/logstash-6.5.4/config/fates/templateRecord.json"
      template_name => "release_upgrade_record"
      template_overwrite => true
    }
  }
}

4.1 stdout plugin

測試時在終端界面上顯示向es中寫入的數據，此模塊可以在正式環境中刪除

對索引按照年月進行分庫

按照產品名稱設置路由，所以只要屬於某一個產品下的升級記錄都會存到指定的分片中

4.2 elasticsearch plugin

主要配置
- hosts
  - 設置遠程實例的主機，如果es是集羣可以按數組方式寫入
- index => “release_upgrade_es_%{+YYYY.MM}”
  - 對索引按照年月進行分庫
- action
  - index:將logstash.時間索引到一個文檔
  - delete:根據id刪除一個document(這個動作需要一個id)
  - create:建立一個索引document，如果id存在動作失敗.
  - update:根據id更新一個document，有一種特殊情況可以upsert–如果document不是已經存在的情況更新document 。參見upsert選項。
- routing
  - 默認情況下，索引數據的分片規則，是下面的公式：
```
shard_num = hash(_routing) % num_primary_shards
```
  - 此時我們按照產品名稱設置路由，所以只要屬於某一個產品下的升級記錄都會存到指定的分片中

shard_num = hash(_routing) % num_primary_shards
- document_id
- 爲索引提供document id ，對重寫elasticsearch中相同id詞目很有用
- document_type
- 事件要被寫入的document type，一般要將相似事件寫入同一type，可用%{}引用事件type，默認type=log
- manage_template=>true
- 一個默認的es mapping 模板將啓用（除非設置爲false 用自己的template）
- template
- 有效的filepath，設置自己的template文件路徑，不設置就用已有的
- template_name 在es內部模板的名字，可以任意命名

4.2.1 自動template模板

定義模板根據數據情況：

1.某一個產品的升級記錄值在一個分片中，並且每個分片只有一個備份;

2.對kafka中已經進行protobuf反序列化的對象中string類型字段轉換爲keyword類型（該類型不會進行分詞），其它類型按照默認類型即可；

3.對該模板下所有index起一個別名：release-upgrade-info；

4.如果kafka中已經進行protobuf反序列化的對象中string類型字段的名稱中有message會轉換爲text類型（會產生分詞，建立倒排索引，比較佔空間），該message_field設置可以刪除，暫時保留；

5.index.refresh_interval的值是5s(默認值是5s)，這迫使Elasticsearch集羣每5秒創建一個新的 segment（可以理解爲Lucene的索引文件）。增加這個值，例如30s，可以允許更大的segment寫入，減後以後的segment合併壓力；
6. geoip 獲取指定ip字段的地理信息，由於升級記錄已經有ip地理信息，該映射暫時沒用，後續可以刪除該影響。

模板文件放在指定路徑目錄下：
- template => “/opt/logstash-6.5.4/config/fates/templateRecord.json”
模板內容

{
  "template" : "release_upgrade_record_*",
  "version" : 60001,
  "settings" : {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "refresh_interval" : "5s"
    }
  },
  "mappings" : {
    "doc" : {
      "_source": {"enabled": true},
       "_all": {"enabled": false},
      "dynamic_templates" : [ {
        "message_field" : {
          "path_match" : "message",
          "match_mapping_type" : "string",
          "mapping" : {
            "type" : "text",
            "norms" : false
          }
        }
      }, {
        "string_fields" : {
          "match" : "*",
          "match_mapping_type" : "string",
          "mapping" : {
            "type" : "keyword"
          }
        }
      } ],
      "properties" : {
        "@timestamp": { "type": "date"},
        "@version": { "type": "keyword"},
        "geoip"  : {
          "dynamic": true,
          "properties" : {
            "ip": { "type": "ip" },
            "location" : { "type" : "geo_point" },
            "latitude" : { "type" : "half_float" },
            "longitude" : { "type" : "half_float" }
          }
        }
      }
    }
  },
  "aliases": {
    "release-upgrade-info": {}
  }
}

五、集羣

根據上訴方法編寫好logstash配置文件後，用n臺logstash服務器運行即可。由於這個集羣使用的是同一個groupid ，並不會出現logstash重複消費kafka集羣的問題。

六、啓動命令

cd /opt/logstash-6.5.4

bin/logstash -f config/fates/kafkaToES.conf

利用nohup扔到後臺運行

nohup  /opt/logstash-6.5.4/bin/logstash  -f   /opt/logstash-6.5.4/config/config/fates/kafkaToES.conf  >/dev/null &;

更多內容請關注微信公衆號：

Logstash讀取kafka到ElasticSearch的配置文檔

文章目錄

一、架構

二、input 讀取kafka配置

2.1、簡單配置及說明

2.2 protobuf數據反序列化

2.2.1 protocol buffer簡介

2.2.2 安裝插件

2.2.3 ReleaseRecordES.proto文件

2.2.4 protoc命令編譯proto文件爲ruby文件

2.2.5 input.kafka配置protobuf反序列化

三、filter對數據源進行過濾

3.1 elasticsearch plugin

3.2 data plugin

3.3 ruby plugin

四、output

4.1 stdout plugin

4.2 elasticsearch plugin

4.2.1 自動template模板

五、集羣

六、啓動命令

容器中nginx無法使用同一個網絡下的容器域名

Python: SunMoonTimeCalculator

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

nodejs學習07——API

避免DbContext同時在多個線程調用

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

vc++6.0貪喫蛇

redis 之四： link 鏈表結構

後端開發-通用說明及開發規範

java代碼質量相關插件(PMD|JaCoCo|sonar)關於maven及IDEA使用及配置

ceph組件介紹及基於ceph-deploy部署

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結