flume自定義攔截器:根據業務數據中的時間戳實現數據在hdfs中的正確分區

業務場景:埋點數據落hive表,且埋點數據中帶有數據產生時的時間字段

業務流程:kafka->flume->hdfs->hive

問題:晚到的埋點數據會落到哪個分區中   9點產生的埋點數據  由於數據上報或者flume sink的延遲會落到9點的分區中麼?答案是不會的

flume抽取到的數據也成爲event,event分爲header和body,如果你flume sink的配置是

test.sources.s1.interceptors = i1
test.sources.s1.interceptors.i1.type = timestamp

test.sinks.k1.type = hdfs
test.sinks.k1.hdfs.path = /路徑/%Y%m%d/%H
test.sinks.k1.hdfs.filePrefix = interceptor-memory-channel-

這個/%Y%m%d/%H 是根據event header信息中的timestamp來計算的,在fflume抽取kafka數據過程中,如果數據量過大,產生積壓,那麼晚到的數據會被蓋上這個時間戳,導致數據不能準確落到對應的分區中

爲了解決這個問題,需要自定義攔截器,更改這個有timestamp決定的分區

方案:根據flume的timestamp攔截器代碼修改,是header中的timestamp時間戳爲event中的時間戳

pom.xml文件配置

    <dependencies>
        <dependency>
            <groupId>org.apache.flume</groupId>
            <artifactId>flume-ng-core</artifactId>
            <version>1.9.0</version>
        </dependency>

        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.1.23</version>
        </dependency>
    </dependencies>


    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
        </plugins>
    </build>
TimeInterceptor.class
package zm.develop;

import com.alibaba.fastjson.JSON;
import org.apache.commons.compress.utils.Charsets;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import java.util.Map;

public class TimeInterceptor implements Interceptor {


    private static final Logger logger = LoggerFactory.getLogger(TimeInterceptor.class);
    private final boolean preserveExisting;
    private final String header;
    private long timeStamp=0l;


    private TimeInterceptor(boolean preserveExisting, String header) {
        this.preserveExisting = preserveExisting;
        this.header = header;
    }

    public void initialize() { }

    public Event intercept(Event event) {

        try{
            Map<String, String> headers = event.getHeaders();

            //獲取event中的server_time(2020-03-20T04:46:42.926+0800)字段對應的時間戳作爲timestamp
            String line = new String(event.getBody(), Charsets.UTF_8);
            String server_time = JSON.parseObject(line).getString("server_time");
//            logger.info(server_time);
            if(server_time == null || server_time.length() <= 0){
                timeStamp = System.currentTimeMillis();
//                logger.info("---server_time is null or the length is zero---");
            }else {
                Date date = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZ").parse(server_time);
                timeStamp = date.getTime();
//                logger.info("----The event server_time is----"+ timeStamp);
            }
           headers.put("timestamp",Long.toString(timeStamp));
        }catch (Exception e){
            logger.info(e.toString());
        }
        return event;
    }

    public List<Event> intercept(List<Event> events) {
        for (Event event : events) {
            intercept(event);
        }
        return events;
    }

    public void close() { }


    public static class Builder implements Interceptor.Builder {

        private boolean preserveExisting = false;
        private String header = "timestamp";

        @Override
        public Interceptor build() {
            return new TimeInterceptor(preserveExisting, header);
        }

        @Override
        public void configure(Context context) {
            preserveExisting = context.getBoolean("preserveExisting", false);
            header = context.getString("headerName", "timestamp");
        }
    }
}

將自定義攔截器打包上傳到flume ./lib下

修改flume conf配置,將攔截器換成自定義的

test.sources.s1.interceptors = i1
test.sources.s1.interceptors.i1.type = zm.develop.TimeInterceptor$Builder

經測試和生產環境上的使用,該攔截器可以將數據正確落入對應分區

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章