业务场景:埋点数据落hive表,且埋点数据中带有数据产生时的时间字段
业务流程:kafka->flume->hdfs->hive
问题:晚到的埋点数据会落到哪个分区中 9点产生的埋点数据 由于数据上报或者flume sink的延迟会落到9点的分区中么?答案是不会的
flume抽取到的数据也成为event,event分为header和body,如果你flume sink的配置是
test.sources.s1.interceptors = i1
test.sources.s1.interceptors.i1.type = timestamp
test.sinks.k1.type = hdfs
test.sinks.k1.hdfs.path = /路径/%Y%m%d/%H
test.sinks.k1.hdfs.filePrefix = interceptor-memory-channel-
这个/%Y%m%d/%H 是根据event header信息中的timestamp来计算的,在fflume抽取kafka数据过程中,如果数据量过大,产生积压,那么晚到的数据会被盖上这个时间戳,导致数据不能准确落到对应的分区中
为了解决这个问题,需要自定义拦截器,更改这个有timestamp决定的分区
方案:根据flume的timestamp拦截器代码修改,是header中的timestamp时间戳为event中的时间戳
pom.xml文件配置
<dependencies>
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.9.0</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.1.23</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
</plugins>
</build>
TimeInterceptor.class
package zm.develop;
import com.alibaba.fastjson.JSON;
import org.apache.commons.compress.utils.Charsets;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import java.util.Map;
public class TimeInterceptor implements Interceptor {
private static final Logger logger = LoggerFactory.getLogger(TimeInterceptor.class);
private final boolean preserveExisting;
private final String header;
private long timeStamp=0l;
private TimeInterceptor(boolean preserveExisting, String header) {
this.preserveExisting = preserveExisting;
this.header = header;
}
public void initialize() { }
public Event intercept(Event event) {
try{
Map<String, String> headers = event.getHeaders();
//获取event中的server_time(2020-03-20T04:46:42.926+0800)字段对应的时间戳作为timestamp
String line = new String(event.getBody(), Charsets.UTF_8);
String server_time = JSON.parseObject(line).getString("server_time");
// logger.info(server_time);
if(server_time == null || server_time.length() <= 0){
timeStamp = System.currentTimeMillis();
// logger.info("---server_time is null or the length is zero---");
}else {
Date date = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZ").parse(server_time);
timeStamp = date.getTime();
// logger.info("----The event server_time is----"+ timeStamp);
}
headers.put("timestamp",Long.toString(timeStamp));
}catch (Exception e){
logger.info(e.toString());
}
return event;
}
public List<Event> intercept(List<Event> events) {
for (Event event : events) {
intercept(event);
}
return events;
}
public void close() { }
public static class Builder implements Interceptor.Builder {
private boolean preserveExisting = false;
private String header = "timestamp";
@Override
public Interceptor build() {
return new TimeInterceptor(preserveExisting, header);
}
@Override
public void configure(Context context) {
preserveExisting = context.getBoolean("preserveExisting", false);
header = context.getString("headerName", "timestamp");
}
}
}
将自定义拦截器打包上传到flume ./lib下
修改flume conf配置,将拦截器换成自定义的
test.sources.s1.interceptors = i1
test.sources.s1.interceptors.i1.type = zm.develop.TimeInterceptor$Builder
经测试和生产环境上的使用,该拦截器可以将数据正确落入对应分区