Flume是一個高可用的，高可靠的，分佈式的海量日誌採集、聚合和傳輸的系統，Flume支持在日誌系統中定製各類數據發送方，用於收集數據；同時，Flume提供對數據進行簡單處理，並寫到各種數據接受方。flume包含三個部分：source、channel和sink。

本實例將使用flume收集日誌數據分別寫入到kafka(topic：logs)和hbase(table：logs)中，日誌來源於：搜狗實驗室，我們將用程序模擬webserver來寫日誌(webserverlogs文件)。

安裝配置kafka集羣請參閱：kafka的配置和分佈式部署，安裝配置Hbase數據庫請參閱：Hbase的配置和分佈式部署

說明：三臺機器的主機名分別爲：bigdata.centos01、bigdata.centos02、bigdata.centos03

一、架構圖

說明：centos02和centos03作爲flume日誌收集節點，centos01將兩臺機器收集到的日誌進行聚合，然後再分別寫入Kafka和Hbase數據庫中。

二、flume的安裝和配置

1. 下載安裝

wget http://archive.apache.org/dist/flume/1.7.0/apache-flume-1.7.0-bin.tar.gz

2. 配置

2.1 日誌收集節點配置

如上圖所示，修改centos02和centos03上flume節點配置

conf/flume-env.sh修改

export JAVA_HOME=/opt/modules/jdk8

conf/flume-conf.properties修改

a1.sources = r1
a1.channels = c1
a1.sinks = k1

# source
# exec source：運行unix命令，實時讀取webserverlogs日誌文件，加入到flume
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/datas/webserverlogs
a1.sources.r1.channels = c1

# channel
a1.channels.c1.type = memory

# sink
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = bigdata.centos01
a1.sinks.k1.port = 5555

2.2 日誌聚集節點配置

如上圖所示，修改centos01上flume節點配置

conf/flume-env.sh修改

export JAVA_HOME=/opt/modules/jdk8
export HADOOP_HOME=/opt/modules/hadoop-2.5.0
export HBASE_HOME=/opt/modules/hbase-0.98.6-cdh5.3.9

conf/flume-conf.properties修改

agent.sources = avroSource
agent.channels = kafkaC hbaseC
agent.sinks = kafkaSink hbaseSink

#********************* flume + hbase 集成*********************
# source
agent.sources.avroSource.type = avro
agent.sources.avroSource.channels = hbaseC kafkaC
agent.sources.avroSource.bind = bigdata.centos01
agent.sources.avroSource.port = 5555

# channel
agent.channels.hbaseC.type = memory
agent.channels.hbaseC.capacity = 100000
agent.channels.hbaseC.transactionCapacity = 100000
agent.channels.hbaseC.keep-alive = 10

# sink
agent.sinks.hbaseSink.type = asynchbase
agent.sinks.hbaseSink.table = logs
agent.sinks.hbaseSink.columnFamily = info
agent.sinks.hbaseSink.serializer = org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer
agent.sinks.hbaseSink.channel = hbaseC
# 自定義列
agent.sinks.hbaseSink.serializer.payloadColumn=time,userid,searchname,retorder,cliorder,url


#********************* flume + kafka 集成*********************
# channel
agent.channels.kafkaC.type = memory
agent.channels.kafkaC.capacity = 100000
agent.channels.kafkaC.transactionCapacity = 100000
agent.channels.kafkaC.keep-alive = 10

# sink
agent.sinks.kafkaSink.channel = kafkaC
agent.sinks.kafkaSink.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.kafkaSink.kafka.topic = logs
agent.sinks.kafkaSink.kafka.bootstrap.servers = bigdata.centos01:9092,bigdata.centos02:9092,bigdata.centos03:9092
agent.sinks.kafkaSink.kafka.producer.acks = 1
agent.sinks.kafkaSink.kafka.flumeBatchSize = 50

2.3 Flume的二次開發

下載源碼

http://archive.apache.org/dist/flume/1.7.0/apache-flume-1.7.0-src.tar.gz

代碼修改

由於上面centos01 flume hbasesink配置的SimpleAsyncHbaseEventSerializer類不能滿足我們的需求，它默認是往Hbase裏面寫一列，我們的需求是分隔日誌數據分別存儲到我們定義的不同列，故需要對其進行二次開發。主要修改了getActions函數。

package org.apache.flume.sink.hbase;

import com.google.common.base.Charsets;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.FlumeException;
import org.apache.flume.conf.ComponentConfiguration;
import org.apache.flume.sink.hbase.SimpleHbaseEventSerializer.KeyType;
import org.hbase.async.AtomicIncrementRequest;
import org.hbase.async.PutRequest;

import java.util.ArrayList;
import java.util.List;

public class SimpleAsyncHbaseEventSerializer implements AsyncHbaseEventSerializer {
    private byte[] table;
    private byte[] cf;
    private byte[] payload;
    private byte[] payloadColumn;
    private byte[] incrementColumn;
    private String rowPrefix;
    private byte[] incrementRow;
    private KeyType keyType;

    @Override
    public void initialize(byte[] table, byte[] cf) {
        this.table = table;
        this.cf = cf;
    }

    @Override
    public List<PutRequest> getActions() {
        List<PutRequest> actions = new ArrayList<PutRequest>();
        if (payloadColumn != null) {
            try {
                String[] columns = new String(payloadColumn).split(",");
                String[] values = new String(payload).split(",");
                if (columns.length != values.length) {
                    return actions;
                }

                String rawKey = values[0] + values[1] + System.currentTimeMillis();
                for (int i = 0; i < columns.length; i++) {
                    String column = columns[i];
                    String value = values[i];

                    PutRequest putRequest = new PutRequest(table, rawKey.getBytes("utf-8"), cf,
                            column.getBytes(), value.getBytes("utf-8"));
                    actions.add(putRequest);
                }

            } catch (Exception e) {
                throw new FlumeException("Could not get row key!", e);
            }
        }
        return actions;
    }

    public List<AtomicIncrementRequest> getIncrements() {
        List<AtomicIncrementRequest> actions = new ArrayList<AtomicIncrementRequest>();
        if (incrementColumn != null) {
            AtomicIncrementRequest inc = new AtomicIncrementRequest(table,
                    incrementRow, cf, incrementColumn);
            actions.add(inc);
        }
        return actions;
    }

    @Override
    public void cleanUp() {
        // TODO Auto-generated method stub
    }

    @Override
    public void configure(Context context) {
        String pCol = context.getString("payloadColumn", "pCol");
        String iCol = context.getString("incrementColumn", "iCol");
        rowPrefix = context.getString("rowPrefix", "default");
        String suffix = context.getString("suffix", "uuid");
        if (pCol != null && !pCol.isEmpty()) {
            if (suffix.equals("timestamp")) {
                keyType = KeyType.TS;
            } else if (suffix.equals("random")) {
                keyType = KeyType.RANDOM;
            } else if (suffix.equals("nano")) {
                keyType = KeyType.TSNANO;
            } else {
                keyType = KeyType.UUID;
            }
            payloadColumn = pCol.getBytes(Charsets.UTF_8);
        }
        if (iCol != null && !iCol.isEmpty()) {
            incrementColumn = iCol.getBytes(Charsets.UTF_8);
        }
        incrementRow = context.getString("incrementRow", "incRow").getBytes(Charsets.UTF_8);
    }

    @Override
    public void setEvent(Event event) {
        this.payload = event.getBody();
    }

    @Override
    public void configure(ComponentConfiguration conf) {
        // TODO Auto-generated method stub
    }

}

將lib目錄的flume-ng-hbase-sink-1.7.0.jar刪除，把flume-ng-hbase-sink重新打包重命名爲flume-ng-hbase-sink-1.7.0.jar，上傳至flume的lib目錄。我打包的jar包：flume-ng-hbase-sink.jar

三、模擬程序開發

通過模擬程序將搜狗實驗室的日誌文件逐行的寫入到webserverlogs文件，供日誌收集的flume收集。

package com.wmh.writeread;

import java.io.*;

public class ReadWrite {

    public static void main(String[] args) {
        String readFileName = args[0];
        String writeFileName = args[1];

        try{
            System.out.println("執行中...");
            process(readFileName,writeFileName);
            System.out.println("執行完成！");
        }catch (Exception e){
            System.out.println("執行失敗！");
        }
    }

    private static void process(String readFileName, String writeFileName) {
        File readFile = new File(readFileName);
        if (!readFile.exists()){
            System.out.println(readFileName+"不存在，請檢查路徑！");
            System.exit(1);
        }
        File writeFile = new File(writeFileName);

        BufferedReader br = null;
        FileInputStream fis = null;
        FileOutputStream fos = null;
        long count = 1l;
        try{
            fis = new FileInputStream(readFile);
            fos = new FileOutputStream(writeFile);
            br = new BufferedReader(new InputStreamReader(fis));
            String line = "";
            while ((line = br.readLine()) != null){
                fos.write((line + "\n").getBytes("utf-8"));
                fos.flush();
                Thread.sleep(100);
                System.out.println(String.format("row:[%d]>>>>>>>>>> %s",count++,line));
            }

        }catch (Exception e){
            try {
                fis.close();
                fos.close();
                br.close();
            } catch (IOException ex) {
                System.out.println("關閉資源異常！");
                System.exit(1);
            }
        }

    }

}

四、服務啓動測試

1. 服務啓動

1.1 zookeeper

啓動centos01、centos02、centos03的zookeeper

bin/zkServer.sh start

1.2 HDFS

啓動centos01的namenode、datanode，centos02和centos03的datanode

sbin/start-dfs.sh

1.3 Hbase

服務啓動

啓動centos01的master、regionserver，centos02和centos03的regionserver

bin/start-hbase.sh

創建表

# 進入命令行操作
bin/hbase shell

# 創建表
> create "logs","info"

1.4 kafka

服務啓動

啓動centos01、centos02、centos03的kafka

bin/kafka-server-start.sh config/server.properties

創建topic

創建flume配置的名爲logs的topic

# --replication-factor 副本數，爲了每個kafka可以單獨對外服務，該項配置值爲集羣機器數
bin/kafka-topics.sh --create --zookeeper bigdata.centos01:2181,bigdata.centos02:2181,bigdata.centos03:2181 --replication-factor 3 --partitions 1 --topic test

啓動consumer

啓動任意一臺機器上面的kafka consumer，供測試使用。

bin/kafka-console-consumer.sh --zookeeper bigdata.centos01:2181,bigdata.centos02:2181,bigdata.centos03:2181 --topic logs --from-beginning

1.5 flume

啓動centos01日誌聚集flume

bin/flume-ng agent --conf conf --conf-file conf/flume-conf.properties -n agent  -Dflume.root.logger=INFO,console

啓動centos01日誌收集flume

bin/flume-ng agent --conf conf --conf-file conf/flume-conf.properties -n a1  -Dflume.root.logger=INFO,console

主要區別在於：agent命名不一樣，centos01命名爲agent，centos02和centos03的命名爲a1

1.6 模擬程序啓動

將上面的模擬程序類打包成jar

#formatedSougouLog是經格式化的數據，將\t和空格替換成逗號
java -jar ReadWrite.jar formatedSougouLog webserverlogs

2. 測試

hbase插入的數據

kafka獲取的數據

flume+hbase+kafka集成部署

一、架構圖

二、flume的安裝和配置

1. 下載安裝

2. 配置

2.1 日誌收集節點配置

2.2 日誌聚集節點配置

2.3 Flume的二次開發

三、模擬程序開發

四、服務啓動測試

1. 服務啓動

1.1 zookeeper

1.2 HDFS

1.3 Hbase

1.4 kafka

1.5 flume

1.6 模擬程序啓動

2. 測試

Python 潮流週刊#52：Python 處理 Excel 的資源

杭電（hdu）2097 Sky數

Spark之Spark Streaming

Spark運行模式配置及測試

Hadoop介紹以及集羣搭建

杭電（hdu）2031 進制轉換

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結