flink的官網對於寫入HDFS的例子比較簡單,很難跑起來,缺少更詳細的描述。
目標: 本地代碼flink streaming讀取kafka的數據,寫入遠程環境的HDFS中;
核心代碼:
public static void main(String[] args) throws Exception {
// set up the streaming execution environment
final StreamExecutionEnvironment env =StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Properties properties = new Properties();
//目標環境的IP地址和端口號
properties.setProperty("bootstrap.servers", "192.168.0.1:9092");//kafka
properties.setProperty("zookeeper.connect", "192.168.0.1:2181");//zookeepe
properties.setProperty("group.id", "test-consumer-group"); //group.id
//這裏很重要,填寫hdfs-site.xml和core-site.xml的路徑,可以把目標環境上的hadoop的這兩個配置拉到本地來,這個是我放在了項目的resources目錄下。
properties.setProperty("fs.hdfs.hadoopconf", "E:\\Ali-Code\\cn-smart\\cn-components\\cn-flink\\src\\main\\resources");
//根據不同的版本new不同的消費對象;
// FlinkKafkaConsumer09<String> flinkKafkaConsumer09 = new FlinkKafkaConsumer09<String>("test0", new SimpleStringSchema(),properties);
FlinkKafkaConsumer010<String> flinkKafkaConsumer010 = new FlinkKafkaConsumer010<String>("test1", new SimpleStringSchema(), properties);
// flinkKafkaConsumer010.assignTimestampsAndWatermarks(new CustomWatermarkEmitter());
DataStream<String> keyedStream = env.addSource(flinkKafkaConsumer010);
keyedStream.print();
// execute program
System.out.println("*********** hdfs ***********************");
BucketingSink<String> bucketingSink = new BucketingSink<>("/var"); //hdfs上的路徑
BucketingSink<String> bucketingSink1 = bucketingSink.setBucketer((Bucketer<String>) (clock, basePath, value) -> {
return basePath;
});
bucketingSink.setWriter(new StringWriter<>())
.setBatchSize(1024 * 1024 ) // this is 400 MB,
.setBatchRolloverInterval(2000);
keyedStream.addSink(bucketingSink);
env.execute("test");
}
會目標環境上hdfs的/var下面生成很多小目錄,這些小目錄是kafka中的數據;
待續...