業務場景:埋點數據落hive表,且埋點數據中帶有數據產生時的時間字段
業務流程:kafka->flume->hdfs->hive
問題:晚到的埋點數據會落到哪個分區中 9點產生的埋點數據 由於數據上報或者flume sink的延遲會落到9點的分區中麼?答案是不會的
flume抽取到的數據也成爲event,event分爲header和body,如果你flume sink的配置是
test.sources.s1.interceptors = i1
test.sources.s1.interceptors.i1.type = timestamp
test.sinks.k1.type = hdfs
test.sinks.k1.hdfs.path = /路徑/%Y%m%d/%H
test.sinks.k1.hdfs.filePrefix = interceptor-memory-channel-
這個/%Y%m%d/%H 是根據event header信息中的timestamp來計算的,在fflume抽取kafka數據過程中,如果數據量過大,產生積壓,那麼晚到的數據會被蓋上這個時間戳,導致數據不能準確落到對應的分區中
爲了解決這個問題,需要自定義攔截器,更改這個有timestamp決定的分區
方案:根據flume的timestamp攔截器代碼修改,是header中的timestamp時間戳爲event中的時間戳
pom.xml文件配置
<dependencies>
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.9.0</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.1.23</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
</plugins>
</build>
TimeInterceptor.class
package zm.develop;
import com.alibaba.fastjson.JSON;
import org.apache.commons.compress.utils.Charsets;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import java.util.Map;
public class TimeInterceptor implements Interceptor {
private static final Logger logger = LoggerFactory.getLogger(TimeInterceptor.class);
private final boolean preserveExisting;
private final String header;
private long timeStamp=0l;
private TimeInterceptor(boolean preserveExisting, String header) {
this.preserveExisting = preserveExisting;
this.header = header;
}
public void initialize() { }
public Event intercept(Event event) {
try{
Map<String, String> headers = event.getHeaders();
//獲取event中的server_time(2020-03-20T04:46:42.926+0800)字段對應的時間戳作爲timestamp
String line = new String(event.getBody(), Charsets.UTF_8);
String server_time = JSON.parseObject(line).getString("server_time");
// logger.info(server_time);
if(server_time == null || server_time.length() <= 0){
timeStamp = System.currentTimeMillis();
// logger.info("---server_time is null or the length is zero---");
}else {
Date date = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZ").parse(server_time);
timeStamp = date.getTime();
// logger.info("----The event server_time is----"+ timeStamp);
}
headers.put("timestamp",Long.toString(timeStamp));
}catch (Exception e){
logger.info(e.toString());
}
return event;
}
public List<Event> intercept(List<Event> events) {
for (Event event : events) {
intercept(event);
}
return events;
}
public void close() { }
public static class Builder implements Interceptor.Builder {
private boolean preserveExisting = false;
private String header = "timestamp";
@Override
public Interceptor build() {
return new TimeInterceptor(preserveExisting, header);
}
@Override
public void configure(Context context) {
preserveExisting = context.getBoolean("preserveExisting", false);
header = context.getString("headerName", "timestamp");
}
}
}
將自定義攔截器打包上傳到flume ./lib下
修改flume conf配置,將攔截器換成自定義的
test.sources.s1.interceptors = i1
test.sources.s1.interceptors.i1.type = zm.develop.TimeInterceptor$Builder
經測試和生產環境上的使用,該攔截器可以將數據正確落入對應分區