接口源碼文章:https://blogs.apache.org/flume/entry/streaming_data_into_apache_hbase
參考博客:https://blog.csdn.net/m0_37739193/article/details/72868456
目的:flume從event中取出數據作爲hbase的rowkey
使用flume接收數據,再傳入hbase中,要求中間數據不落地。
flume使用http source入口,使用sink連接hbase實現數據導入,並且通過channels使flume的內存數據保存到本地磁盤(防止集羣出現故障,數據可以備份至本地)
傳入數據格式爲 http:10.0.0.1_{asdasd} 格式說明(url_數據)
hbase存儲的結果爲:
rowkey:當前時間_url
value:數據
即要對傳入的數據進行切分,將url作爲rowkey的一部分,當前時間作爲另一部分,數據存儲到value中
步驟:
1.重寫flume中能指定rowkey的源碼(HbaseEventSerializer接口)。再打成jar包
java源碼見下面:
2.將製作jar包放入flume的/home/hadoop/apache-flume-1.6.0-cdh5.5.2-bin/lib目錄下
3.flume配置文件
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = http
a1.sources.r1.port = 44444
a1.sources.r1.bind = 10.0.0.183
# Describe the sink
a1.sinks.k1.type = hbase
a1.sinks.k1.channel = c1
a1.sinks.k1.table = httpdata
a1.sinks.k1.columnFamily = a
a1.sinks.k1.serializer = com.hbase.Rowkey
a1.sinks.k1.channel = memoryChannel
# Use a channel which buffers events in memory
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /home/x/oyzm_test/flu-hbase/checkpoint/
a1.channels.c1.useDualCheckpoints = false
a1.channels.c1.dataDirs = /home/x/oyzm_test/flu-hbase/flumedir/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 10000
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
4.在hbase中建表 create 'httpdata,‘a’
5.flume啓動命令
flume-ng agent -c . -f /mysoftware/flume-1.7.0/conf/hbase_simple.conf -n a1 -Dflume.root.logger=INFO,console
6.flume數據寫入命令
curl -X POST -d’[{“body”:“http:10.0.0.1_{asdasd}”}]’ http://10.0.0.183:44444
hbase中數據結果:
20181108104034_http:10.0.0.183 column=a:data, timestamp=1541644834926, value={asdasd}
java源碼:
package com.hbase;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.conf.ComponentConfiguration;
import org.apache.flume.sink.hbase.HbaseEventSerializer;
import org.apache.hadoop.hbase.client.Increment;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Row;
public class Rowkey implements HbaseEventSerializer {
//列族(不用管)
private byte[] colFam="cf".getBytes();
//獲取文件
private Event currentEvent;
public void initialize(Event event, byte[] colFam) {
//byte[]字節型數組
this.currentEvent = event;
this.colFam = colFam;
}
public void configure(Context context) {}
public void configure(ComponentConfiguration conf) {
}
//指定rowkey,單元格修飾名,值
public List<Row> getActions() {
// 切分 currentEvent文件 從中拿到的值
String eventStr = new String(currentEvent.getBody());
//body格式爲:url_value
String url = eventStr.split("_")[0];
String value = eventStr.split("_")[1];
//得到系統日期
Date d = new Date();
SimpleDateFormat df = new SimpleDateFormat("yyyyMMddHHmmss");
//rowkey
byte[] currentRowKey = (df.format(d)+"_"+url).getBytes();
//hbase的put操作
List<Row> puts = new ArrayList<Row>();
Put putReq = new Put(currentRowKey);
//putReq.addColumn 列族,單元格修飾名(可指定),值
//putReq: column=a, data, value={asdasd}
putReq.addColumn(colFam, "data".getBytes(), value.getBytes());
puts.add(putReq);
return puts;
}
public List<Increment> getIncrements() {
List<Increment> incs = new ArrayList<Increment>();
return incs;
}
//關閉流
public void close() {
colFam = null;
currentEvent = null;
}
}
pom文件:
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.flume.flume-ng-sinks</groupId>
<artifactId>flume-ng-hbase-sink</artifactId>
<version>1.7.0</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.2.4</version>
</dependency>
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.8</version>
<scope>system</scope>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
</dependencies>
</project>