flume1.8taildirSource 轉

flume使用(一):入門demo 

flume使用(二):採集遠程日誌數據到MySql數據庫 

flume使用(三):實時log4j日誌通過flume輸出到MySql數據庫 

flume使用(四):taildirSource多文件監控實時採集 

本文針對【flume使用(四):taildirSource多文件監控實時採集】一文中提出的兩個flume的TailDirSource可能出現的問題進行解決。

一、問題思考
(1)log4j的日誌文件肯定是會根據規則進行滾動的:當*.log滿了就會滾動把前文件更名爲*.log.1,然後重新進行*.log文件打印。這樣flume就會把*.log.1文件當作新文件,又重新讀取一遍,導致重複。

(2)當flume監控的日誌文件被移走或刪除,flume仍然在監控中,並沒有釋放資源,當然,在一定時間後會自動釋放,這個時間根據官方文檔設置默認值是120000ms。

二、處理方式
我這裏不叫解決方式,在其他人的文章中說這兩個是bug,個人認爲這都不是bug。大家都知道flume作爲apache的頂級項目,真有這樣的bug在它的託管網站上肯定有相關pull並且肯定會有儘快的解決。至少,在flume1.8上會解決掉。個人查看了flume1.8處理的bug和功能的增加list中,對於(1)(2)沒有關於這樣解決項。

官方文檔1.8的release說明:只有這一項關於taildir,解決的是當flume關閉文件同時該文件正更新數據。

官網:http://flume.apache.org/releases/1.8.0.html

(1)flume會把重命名的文件重新當作新文件讀取是因爲正則表達式的原因,因爲重命名後的文件名仍然符合正則表達式。所以第一,重命名後的文件仍然會被flume監控;第二,flume是根據文件inode&&文件絕對路徑 、文件是否爲null&&文件絕對路徑,這樣的條件來判斷是否是同一個文件這個可以看源碼:下  載  源碼,放到maven項目(注意路徑名稱對應),找到taildirsource的包。

先看執行案例:

確實是有重複,然後看源碼:flume-taildir-source工程

ReliableTaildirEventReader 類的 updateTailFiles 方法
  public List<Long> updateTailFiles(boolean skipToEnd) throws IOException {
    updateTime = System.currentTimeMillis();
    List<Long> updatedInodes = Lists.newArrayList();
 
    for (TaildirMatcher taildir : taildirCache) {
      Map<String, String> headers = headerTable.row(taildir.getFileGroup());
 
      for (File f : taildir.getMatchingFiles()) {
        long inode = getInode(f);
        TailFile tf = tailFiles.get(inode);
        if (tf == null || !tf.getPath().equals(f.getAbsolutePath())) {
          long startPos = skipToEnd ? f.length() : 0;
          tf = openFile(f, headers, inode, startPos);
        } else {
          boolean updated = tf.getLastUpdated() < f.lastModified();
          if (updated) {
            if (tf.getRaf() == null) {
              tf = openFile(f, headers, inode, tf.getPos());
            }
            if (f.length() < tf.getPos()) {
              logger.info("Pos " + tf.getPos() + " is larger than file size! "
                  + "Restarting from pos 0, file: " + tf.getPath() + ", inode: " + inode);
              tf.updatePos(tf.getPath(), inode, 0);
            }
          }
          tf.setNeedTail(updated);
        }
        tailFiles.put(inode, tf);
        updatedInodes.add(inode);
      }
    }
    return updatedInodes;
  }
重點:
 for (File f : taildir.getMatchingFiles()) {
        long inode = getInode(f);
        TailFile tf = tailFiles.get(inode);
        if (tf == null || !tf.getPath().equals(f.getAbsolutePath())) {
          long startPos = skipToEnd ? f.length() : 0;
          tf = openFile(f, headers, inode, startPos);
        } 
TailFile 類的 updatePos 方法:
  public boolean updatePos(String path, long inode, long pos) throws IOException {
    <strong>if (this.inode == inode && this.path.equals(path)) {</strong>
      setPos(pos);
      updateFilePos(pos);
      logger.info("Updated position, file: " + path + ", inode: " + inode + ", pos: " + pos);
      return true;
    }
    return false;
  }
這樣帶來的麻煩就是當文件更名後仍然符合正則表達式時,會被flume進行監控,即使inode相同而文件名不同,flume就認爲是新文件。

實際上這是開發者自身給自己造成的不便,完全可以通過監控文件名的正則表達式來排除重命名的文件。

就如正則表達式:【.*.log.* 】這樣的正則表達式當然文件由 .ac.log 重命名爲.ac.log.1會帶來重複讀取的問題。

而正則表達式:【.*.log】 當文件由 .ac.log 重命名爲 .ac.log.1 就不會被flume監控,就不會有重複讀取的問題。

以上是針對這個問題並flume團隊沒有改正這個問題原因的思考。

當然,如果類似【.*.log.* 】這樣的正則表達式在實際生產中是非常必要使用的話,那麼flume團隊應該會根據github上issue的呼聲大小來考慮是否修正到項目中。

那麼實際生產中真需要這樣的正則表達式來監控目錄下的文件的話,爲了避免重複讀取,就需要對flume1.7源碼進行修改:

處理問題(1)方式
1.修改 ReliableTaildirEventReader
修改 ReliableTaildirEventReader 類的 updateTailFiles方法。

去除tf.getPath().equals(f.getAbsolutePath()) 。只用判斷文件不爲空即可,不用判斷文件的名字,因爲log4j 日誌切分文件會重命名文件。

 if (tf == null || !tf.getPath().equals(f.getAbsolutePath())) {
 
修改爲:
 if (tf == null) {
2.修改TailFile
修改TailFile 類的 updatePos方法。

inode 已經能夠確定唯一的 文件,不用加 path 作爲判定條件

    if (this.inode == inode && this.path.equals(path)) {
修改爲:
    if (this.inode == inode) {
3.將修改過的代碼打包爲自定義source的jar 
可以直接打包taildirSource組件即可,然後替換該組件的jar

此時可以進行測試。

處理問題(2)
問題(2)說的是,當監控的文件不存在了,flume資源沒有釋放。

這個問題也不是問題,實際上,資源的確會釋放,但是 是有一定時間等待。

查看flume1.7官方文檔taildirSource說明:

可知,如果這個文件在默認值120000ms內都沒有新行append,就會關閉資源;而當有新行append就自動打開該資源。

也就是說,默認120000ms--》2分鐘後會自動關閉所謂沒有釋放的資源。

爲了避免這麼長時間的資源浪費,可以把這個值調小一些。但是,官方給定的默認值爲什麼這麼大(相對於類似超時時間都是秒單位的,而這是分鐘單位)?當然不能爲所欲爲的把這個值改小,頻繁的開關文件資源造成系統資源的浪費更應該考慮。

一般沒有很好的測試過性能的話,還是按照默認值來就可以了。

https://www.cloudera.com/documentation/kafka/latest/topics/kafka_flume.html

Using Apache Kafka with Apache Flume

In CDH 5.2 and higher, Apache Flume contains an Apache Kafka source and sink. Use these to stream data from Kafka to Hadoop or from any Flume source to Kafka.

In CDH 5.7 and higher, the Flume connector to Kafka only works with Kafka 2.0 and higher.

Important: Do not configure a Kafka source to send data to a Kafka sink. If you do, the Kafka source sets the topic in the event header, overriding the sink configuration and creating an infinite loop, sending messages back and forth between the source and sink. If you need to use both a source and a sink, use an interceptor to modify the event header and set a different topic.

For information on configuring Kafka to securely communicate with Flume, see Configuring Flume Security with Kafka.

 

Kafka Source

Use the Kafka source to stream data in Kafka topics to Hadoop. The Kafka source can be combined with any Flume sink, making it easy to write Kafka data to HDFS, HBase, and Solr.

The following Flume configuration example uses a Kafka source to send data to an HDFS sink:

tier1.sources  = source1
 tier1.channels = channel1
 tier1.sinks = sink1
 
 tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
 tier1.sources.source1.kafka.bootstrap.servers = kafka-broker01.example.com:9092
 tier1.sources.source1.kafka.topics = weblogs
 tier1.sources.source1.kafka.consumer.group.id = flume
 tier1.sources.source1.channels = channel1
 tier1.sources.source1.interceptors = i1
 tier1.sources.source1.interceptors.i1.type = timestamp
 tier1.sources.source1.kafka.consumer.timeout.ms = 100
 
 tier1.channels.channel1.type = memory
 tier1.channels.channel1.capacity = 10000
 tier1.channels.channel1.transactionCapacity = 1000
 
 tier1.sinks.sink1.type = hdfs
 tier1.sinks.sink1.hdfs.path = /tmp/kafka/%{topic}/%y-%m-%d
 tier1.sinks.sink1.hdfs.rollInterval = 5
 tier1.sinks.sink1.hdfs.rollSize = 0
 tier1.sinks.sink1.hdfs.rollCount = 0
 tier1.sinks.sink1.hdfs.fileType = DataStream
 tier1.sinks.sink1.channel = channel1

For higher throughput, configure multiple Kafka sources to read from the same topic. If you configure all the sources with the same kafka.consumer.group.id, and the topic contains multiple partitions, each source reads data from a different set of partitions, improving the ingest rate.

For the list of Kafka Source properties, see Kafka Source Properties.

For the full list of Kafka consumer properties, see the Kafka documentation.

Tuning Notes

The Kafka source overrides two Kafka consumer parameters:

  1. auto.commit.enable is set to false by the source, and every batch is committed. For improved performance, set this to true using the kafka.auto.commit.enable setting. This can lead to data loss if the source goes down before committing.
  2. consumer.timeout.ms is set to 10, so when Flume polls Kafka for new data, it waits no more than 10 ms for the data to be available. Setting this to a higher value can reduce CPU utilization due to less frequent polling, but introduces latency in writing batches to the channel.

 

Kafka Sink

Use the Kafka sink to send data to Kafka from a Flume source. You can use the Kafka sink in addition to Flume sinks such as HBase or HDFS.

The following Flume configuration example uses a Kafka sink with an exec source:

tier1.sources  = source1
 tier1.channels = channel1
 tier1.sinks = sink1
 
 tier1.sources.source1.type = exec
 tier1.sources.source1.command = /usr/bin/vmstat 1
 tier1.sources.source1.channels = channel1
 
 tier1.channels.channel1.type = memory
 tier1.channels.channel1.capacity = 10000
 tier1.channels.channel1.transactionCapacity = 1000
 
 tier1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
 tier1.sinks.sink1.topic = sink1
 tier1.sinks.sink1.brokerList = kafka01.example.com:9092,kafka02.example.com:9092
 tier1.sinks.sink1.channel = channel1
 tier1.sinks.sink1.batchSize = 20

For the list of Kafka Sink properties, see Kafka Sink Properties.

For the full list of Kafka producer properties, see the Kafka documentation.

The Kafka sink uses the topic and key properties from the FlumeEvent headers to determine where to send events in Kafka. If the header contains the topic property, that event is sent to the designated topic, overriding the configured topic. If the header contains the key property, that key is used to partition events within the topic. Events with the same key are sent to the same partition. If the key parameter is not specified, events are distributed randomly to partitions. Use these properties to control the topics and partitions to which events are sent through the Flume source or interceptor.

 

Kafka Channel

CDH 5.3 and higher includes a Kafka channel to Flume in addition to the existing memory and file channels. You can use the Kafka channel:

  • To write to Hadoop directly from Kafka without using a source.
  • To write to Kafka directly from Flume sources without additional buffering.
  • As a reliable and highly available channel for any source/sink combination.

The following Flume configuration uses a Kafka channel with an exec source and hdfs sink:

tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1

tier1.sources.source1.type = exec
tier1.sources.source1.command = /usr/bin/vmstat 1
tier1.sources.source1.channels = channel1

tier1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.zookeeperConnect = zk01.example.com:2181
tier1.channels.channel1.parseAsFlumeEvent = false
tier1.channels.channel1.kafka.topic = channel2
tier1.channels.channel1.kafka.consumer.group.id = channel2-grp
tier1.channels.channel1.kafka.consumer.auto.offset.reset = earliest
tier1.channels.channel1.kafka.bootstrap.servers = kafka02.example.com:9092,kafka03.example.com:9092
tier1.channels.channel1.transactionCapacity = 1000
tier1.channels.channel1.kafka.consumer.max.partition.fetch.bytes=2097152

tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.hdfs.path = /tmp/kafka/channel
tier1.sinks.sink1.hdfs.rollInterval = 5
tier1.sinks.sink1.hdfs.rollSize = 0
tier1.sinks.sink1.hdfs.rollCount = 0
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.channel = channel1

For the list of Kafka Channel properties, see Kafka Channel Properties.

For the full list of Kafka producer properties, see the Kafka documentation.

Categories: Flume | Kafka | All Categories

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章