開發環境準備
1、JDK
2、Maven
3、IDEA
使用Flink開發一個批處理應用程序
以最簡單的 word count 爲案例
準備一個文本,路徑爲 src/main/test_files/test_file
hello,welcome
hello,world,welcome
開發流程
- set up the batch execution environment
- read
- transform operations 開發的核心所在:開發業務邏輯
- execute program
Java實現
官方網址
第一種創建項目的方式
mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-java \
-DarchetypeVersion=1.9.0
第二種創建項目的方式
其實是一個shell腳本
$ curl https://flink.apache.org/q/quickstart.sh | bash -s 1.9.0
使用maven創建的時候會出現一個問題
[INFO] Generating project in Interactive mode
命令行會停留在這一行很久,如何解決?在mvn後加一個參數
mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-java \
-DarchetypeVersion=1.9.0 \
-DarchetypeCatalog=local
之後分別輸入紅框中自定義名字
成功
java代碼
package com.kun.flink.java.chapter02;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
/**
* 使用java api來開發Flink的批處理應用程序
*/
public class BatchWCJavaAPP {
public static void main(String[] args) throws Exception {
String input = "src/main/test_files/test_file";
//step1、獲取運行環境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//step2、read data
DataSource<String> text = env.readTextFile(input);
//step3、transform
//每一行數據按照指定的分隔符拆分
text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
@Override
public void flatMap(String s, Collector<Tuple2<String,Integer>> collector) throws Exception {
String[] strings = s.toLowerCase().split(",");
for(String string:strings){
if(string.length()>0){
//爲每個單詞附上次數1
collector.collect(new Tuple2<String,Integer>(string,1));
}
}
}
})
//合併操作
.groupBy(0).sum(1).print()
;
}
}
運行結果
19:29:13,205 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopped Akka RPC service.
(world,1)
(hello,2)
(welcome,2)
Scala實現
官方網址
有兩種構建項目的方式 SBT和Maven
這裏只測試maven
$ mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-scala \
-DarchetypeVersion=1.9.0 \
-DarchetypeCatalog=local
步驟和上面java創建工程一樣
scala代碼
package com.kun.flink.scala.chapter02
import org.apache.flink.api.scala._
object BatchWCScalaAPP {
def main(args: Array[String]): Unit = {
val input ="src/test_files/test_file"
val env = ExecutionEnvironment.getExecutionEnvironment
val text=env.readTextFile(input)
text.flatMap(_.toLowerCase.split(","))
.filter(_.nonEmpty)
.map((_,1))
.groupBy(0)
.sum(1).print()
}
}
結果
19:53:00,164 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopped Akka RPC service.
(world,1)
(hello,2)
(welcome,2)
使用Flink開發一個流處理應用程序
測試流處理的時候先打開nc
java
package com.kun.flink.java.chapter02;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
* 使用java api來開發Flink的實時處理應用程序
*
* wc統計的數據源自socket
*/
public class StreamingWCJavaAPP {
public static void main(String[] args) throws Exception {
//step1:獲取執行環境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//step2:讀取數據
DataStreamSource<String> text = env.socketTextStream("hadoop",9999);
//step3:transform
//每一行數據按照指定的分隔符拆分
text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
@Override
public void flatMap(String s, Collector<Tuple2<String,Integer>> collector) throws Exception {
String[] strings = s.toLowerCase().split(",");
for(String string:strings){
if(string.length()>0){
//爲每個單詞附上次數1
collector.collect(new Tuple2<String,Integer>(string,1));
}
}
}
})
//合併操作
.keyBy(0).timeWindow(Time.seconds(5)).sum(1).print()
;
env.execute("StreamingWCJavaAPP");
}
}
結果
2> (b,2)
6> (a,3)
4> (c,4)
重構程序手動傳入參數
package com.kun.flink.java.chapter02;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
* 使用java api來開發Flink的實時處理應用程序
*
* wc統計的數據源自socket
*/
public class StreamingWCJava02APP {
public static void main(String[] args) throws Exception {
//獲取參數
int port = 0;
//flink裏專屬類
try {
ParameterTool tool = ParameterTool.fromArgs(args);
port=tool.getInt("port");
}catch (Exception e){
System.err.println("端口未設置");
port = 9998;
}
//step1:獲取執行環境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//step2:讀取數據
DataStreamSource<String> text = env.socketTextStream("hadoop",port);
//step3:transform
//每一行數據按照指定的分隔符拆分
text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
@Override
public void flatMap(String s, Collector<Tuple2<String,Integer>> collector) throws Exception {
String[] strings = s.toLowerCase().split(",");
for(String string:strings){
if(string.length()>0){
//爲每個單詞附上次數1
collector.collect(new Tuple2<String,Integer>(string,1));
}
}
}
})
//合併操作
.keyBy(0).timeWindow(Time.seconds(5)).sum(1).print()
;
env.execute("StreamingWCJavaAPP");
}
}
運行方法:(運行結果略)
scala
package com.kun.flink.scala.chapter02
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
object StreamingWCScalaAPP {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text=env.socketTextStream("hadoop",9999)
import org.apache.flink.streaming.api.scala._
text.flatMap(_.split(","))
.map((_,1))
.keyBy(0)
.timeWindow(Time.seconds(5))
.sum(1).print()
env.execute("StreamingWCScalaAPP")
}
}
運行結果
4> (c,1)
6> (a,2)
2> (b,3)