flume設計性能優化

目錄

業務場景:

過程1(單flume):

過程2(多個flume並行處理):

過程3(多路複用(路由)模式):

下面一個flume的配置,和selector.header的java代碼介紹


 

業務場景:

     每五分鐘會新生成一個2.4G左右的gz壓縮文件,大概1680萬條數據, 現在需要通過flume做數據的清洗,處理,然後寫入kafka。

服務器環境: 1臺  32核 125G內存的服務器。

考慮方案(費腦,瘋狂測試)

代碼思路:

        自定義的source讀取目錄文件,消費數據解析,轉換格式,寫入channel,自定義source另外有一篇博客,這裏暫不介紹。

        自定義source有兩種方式: PollableSourceEventDrivenSourcePollableSource自動輪詢調用 process() 方法,從中取數據,EventDrivenSource是自己定義回調機制來捕獲新數據並將其存儲到 Channel 中

開始flume節點的思路:

過程1(單flume):

    flume使用1個agent,1個source,1個channel,1個sink,消費數據

開始測試:

     最開始channel使用的是file, 測試結果大概1秒幾千, channel使用memory,測試性能大概是1秒2-3萬的樣子

太慢了,不行,需要優化-----------------------------------------------------------------

過程2(多個flume並行處理):

     考慮開啓多個flume來監控多個目錄。這樣就好並行處理,來來來開始上手

    做法: 將壓縮包解壓,然後使用linux的split命令分隔成多個文件,然後放到多個目錄中,這樣就並行處理了,

    缺點:但是解壓時間長,解壓之後2.4G的gz文件會變成10多G,會很佔服務器的磁盤。

沒磁盤呀,人也太懶,不想這麼麻煩--------------------------------------------------

過程3(多路複用(路由)模式):

     測試性能的時候,沒有sink,只有source寫入channel大概能用1秒5萬,如果有sink寫入kafka,這樣1秒只有2萬左右,所以並行量的瓶頸是sink寫入kafka,

    那麼我考慮用多個sink消費一個source,這樣就可以提升併發量。

    開始瘋狂百度,查看flume官網: http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html

     找到了flume有一種機制, 多路複用(路由)模式,如下:

    按照自己定義的方式可以將    一個source數據寫入到不同的channel

    最後長期測試大概1秒5-6萬的數據量。

服務器的長期測試期間cpu和內存使用情況如下:

 

使用multiplexing方式就是,通過selector定義header, 比如數據的header中有一個State, 我將其值定義爲AZ,則這個數據就會進入wangwu1Channel這個channel中;值爲BZ,則這個數據就會進入wangwu3Channel這個channel中,如果不是AZ,BZ,CZ,最後會進入wangwu1Channel這個channel。

 

下面是一個flume的配置,和定義selector.header的java代碼介紹

flume-conf.properties配置:

agent.sources = wangsu1Source
agent.sinks = wangsu1Sink wangsu2Sink wangsu3Sink
agent.channels = wangsu1Channel wangsu2Channel wangsu3Channel


agent.sources.wangsu1Source.type = com.mmtrix.source.MySource
agent.sources.wangsu1Source.channels = wangsu1Channel wangsu2Channel wangsu3Channel
agent.sources.wangsu1Source.check = /tigard/collector/apache-flume-1.6.0-bin/mvod/wangsu1log/checkpoint.conf
agent.sources.wangsu1Source.batchSize = 1048576
agent.sources.wangsu1Source.startTime = 2018041400
agent.sources.wangsu1Source.backupDir = /data1/wangnei
agent.sources.wangsu1Source.errorLog = /usr/local/errorLog.log
agent.sources.wangsu1Source.regex = (^MG_CDN_3004_)[0-9]{12}_(.+)(.log.gz$)
agent.sources.wangsu1Source.userId = 3004
agent.sources.wangsu1Source.logType = migu
agent.sources.wangsu1Source.resourceIds = 65
agent.sources.wangsu1Source.recordFileName = /tigard/collector/apache-flume-1.6.0-bin/mvod/wangsu1log/recordFileName.conf


agent.sources.wangsu1Source.selector.type = multiplexing
agent.sources.wangsu1Source.selector.header = State
agent.sources.wangsu1Source.selector.mapping.AZ = wangsu1Channel
agent.sources.wangsu1Source.selector.mapping.BZ = wangsu3Channel
agent.sources.wangsu1Source.selector.mapping.CZ = wangsu2Channel
agent.sources.wangsu1Source.selector.default = wangsu1Channel


agent.sinks.wangsu1Sink.type = com.mmtrix.sink.test2
agent.sinks.wangsu1Sink.channel = wangsu1Channel
agent.sinks.wangsu1Sink.topic = migulog_2_replica
agent.sinks.wangsu1Sink.requiredAcks = -1
agent.sinks.wangsu1Sink.batchSize = 10000
agent.sinks.wangsu1Sink.brokerList = mg001.mq.tigard.com:19092,mg002.mq.tigard.com:19092,mg003.mq.tigard.com:19092,mg004.mq.tigard.com:19092,mg005.mq.tigard.com:19092,mg006.mq.tigard.com:19092
agent.sinks.wangsu1Sink.kafka.compression.codec = snappy
agent.sinks.wangsu1Sink.kafka.producer.type = sync
agent.sinks.wangsu1Sink.serializer.class=kafka.serializer.StringEncoder

agent.channels.wangsu1Channel.type = memory
agent.channels.wangsu1Channel.capacity = 5000000
agent.channels.wangsu1Channel.transactionCapacity = 50000


agent.sinks.wangsu2Sink.type = com.mmtrix.sink.test2
agent.sinks.wangsu2Sink.channel = wangsu2Channel
agent.sinks.wangsu2Sink.topic = migulog_2_replica
agent.sinks.wangsu2Sink.requiredAcks = -1
agent.sinks.wangsu2Sink.batchSize = 10000
agent.sinks.wangsu2Sink.brokerList = mg001.mq.tigard.com:19092,mg002.mq.tigard.com:19092,mg003.mq.tigard.com:19092,mg004.mq.tigard.com:19092,mg005.mq.tigard.com:19092,mg006.mq.tigard.com:19092
agent.sinks.wangsu2Sink.kafka.compression.codec = snappy
agent.sinks.wangsu2Sink.kafka.producer.type = sync
agent.sinks.wangsu2Sink.serializer.class=kafka.serializer.StringEncoder

agent.channels.wangsu2Channel.type = memory
agent.channels.wangsu2Channel.capacity = 5000000
agent.channels.wangsu2Channel.transactionCapacity = 50000


agent.sinks.wangsu3Sink.type = com.mmtrix.sink.test2
agent.sinks.wangsu3Sink.channel = wangsu3Channel
agent.sinks.wangsu3Sink.topic = migulog_2_replica
agent.sinks.wangsu3Sink.requiredAcks = -1
agent.sinks.wangsu3Sink.batchSize = 10000
agent.sinks.wangsu3Sink.brokerList = mg001.mq.tigard.com:19092,mg002.mq.tigard.com:19092,mg003.mq.tigard.com:19092,mg004.mq.tigard.com:19092,mg005.mq.tigard.com:19092,mg006.mq.tigard.com:19092
agent.sinks.wangsu3Sink.kafka.compression.codec = snappy
agent.sinks.wangsu3Sink.kafka.producer.type = sync
agent.sinks.wangsu3Sink.serializer.class=kafka.serializer.StringEncoder

agent.channels.wangsu3Channel.type = memory
agent.channels.wangsu3Channel.capacity = 5000000
agent.channels.wangsu3Channel.transactionCapacity = 50000

 

按照自己定義的方式將一個source數據寫入多個channel,我是使用隨機的方式寫入三個channel。

private List<String> ChannelSelector = Arrays.asList("AZ", "BZ", "CZ");

// 隨機選擇
String selector = this.ChannelSelector.get((int) (Math.random() * this.ChannelSelector.size()));
logger.info("selector: " + selector);
for (byte[] value : mess) {
    if (value.length != 0) {
        logger.debug(value);
        Event event = EventBuilder.withBody(value);
        Map<String, String> map = new HashMap<String, String>();
        map.put("State", selector);
        event.setHeaders(map);
        events.add(event);
     }
}
System.out.println("==============" + String.valueOf(System.currentTimeMillis()/1000) + "===========read: " + events.size());
// 循環處理失敗的批次,可能失敗的原因是channel滿了,寫不進去
boolean loop = true;
while(loop) {
    try {
        getChannelProcessor().processEventBatch(events);
    } catch(Exception e) {
        logger.error("processEventBatch err:" + e.getMessage() + " sleeping 10s...");
        try {
            Thread.sleep(10000L);
        } catch(Exception e1) {
            logger.error(e1);
        } finally {
            continue;
        }
    }
    loop = false;
}
mess.clear();
mess = null;
events.clear();

 

發佈了69 篇原創文章 · 獲贊 20 · 訪問量 3萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章