flume自定义拦截器:根据业务数据中的时间戳实现数据在hdfs中的正确分区

业务场景:埋点数据落hive表,且埋点数据中带有数据产生时的时间字段

业务流程:kafka->flume->hdfs->hive

问题:晚到的埋点数据会落到哪个分区中   9点产生的埋点数据  由于数据上报或者flume sink的延迟会落到9点的分区中么?答案是不会的

flume抽取到的数据也成为event,event分为header和body,如果你flume sink的配置是

test.sources.s1.interceptors = i1
test.sources.s1.interceptors.i1.type = timestamp

test.sinks.k1.type = hdfs
test.sinks.k1.hdfs.path = /路径/%Y%m%d/%H
test.sinks.k1.hdfs.filePrefix = interceptor-memory-channel-

这个/%Y%m%d/%H 是根据event header信息中的timestamp来计算的,在fflume抽取kafka数据过程中,如果数据量过大,产生积压,那么晚到的数据会被盖上这个时间戳,导致数据不能准确落到对应的分区中

为了解决这个问题,需要自定义拦截器,更改这个有timestamp决定的分区

方案:根据flume的timestamp拦截器代码修改,是header中的timestamp时间戳为event中的时间戳

pom.xml文件配置

    <dependencies>
        <dependency>
            <groupId>org.apache.flume</groupId>
            <artifactId>flume-ng-core</artifactId>
            <version>1.9.0</version>
        </dependency>

        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.1.23</version>
        </dependency>
    </dependencies>


    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
        </plugins>
    </build>
TimeInterceptor.class
package zm.develop;

import com.alibaba.fastjson.JSON;
import org.apache.commons.compress.utils.Charsets;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import java.util.Map;

public class TimeInterceptor implements Interceptor {


    private static final Logger logger = LoggerFactory.getLogger(TimeInterceptor.class);
    private final boolean preserveExisting;
    private final String header;
    private long timeStamp=0l;


    private TimeInterceptor(boolean preserveExisting, String header) {
        this.preserveExisting = preserveExisting;
        this.header = header;
    }

    public void initialize() { }

    public Event intercept(Event event) {

        try{
            Map<String, String> headers = event.getHeaders();

            //获取event中的server_time(2020-03-20T04:46:42.926+0800)字段对应的时间戳作为timestamp
            String line = new String(event.getBody(), Charsets.UTF_8);
            String server_time = JSON.parseObject(line).getString("server_time");
//            logger.info(server_time);
            if(server_time == null || server_time.length() <= 0){
                timeStamp = System.currentTimeMillis();
//                logger.info("---server_time is null or the length is zero---");
            }else {
                Date date = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZ").parse(server_time);
                timeStamp = date.getTime();
//                logger.info("----The event server_time is----"+ timeStamp);
            }
           headers.put("timestamp",Long.toString(timeStamp));
        }catch (Exception e){
            logger.info(e.toString());
        }
        return event;
    }

    public List<Event> intercept(List<Event> events) {
        for (Event event : events) {
            intercept(event);
        }
        return events;
    }

    public void close() { }


    public static class Builder implements Interceptor.Builder {

        private boolean preserveExisting = false;
        private String header = "timestamp";

        @Override
        public Interceptor build() {
            return new TimeInterceptor(preserveExisting, header);
        }

        @Override
        public void configure(Context context) {
            preserveExisting = context.getBoolean("preserveExisting", false);
            header = context.getString("headerName", "timestamp");
        }
    }
}

将自定义拦截器打包上传到flume ./lib下

修改flume conf配置,将拦截器换成自定义的

test.sources.s1.interceptors = i1
test.sources.s1.interceptors.i1.type = zm.develop.TimeInterceptor$Builder

经测试和生产环境上的使用,该拦截器可以将数据正确落入对应分区

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章