Flink和各種組件
enviroment
-
getExecutionEnvironment
創建一個執行環境,表示當前執行程序的上下文。如果程序是獨立調的,則此方法返回本地執行文件;如果從命令行客戶端調用程序以提交到集羣,則次方法返回此集羣的環境,也就是說,getExecutionEnvironment 會根據查詢運行的方式決定返回什麼樣的運行環境,是最常用的一種創建執行環境的方式。
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
如果設置並行度,會以 flink-conf.vaml 中的配置爲準,默認爲 1。
-
createLocalEnvironment
返回本地執行環境,需要在調用時指定默認的並行度。
val env = StreamExecutionEnvironment.createLocalEnvironment(1)
-
createRemoteEnvironment
返回集羣執行環境,將 Jar 提交到遠程服務器。需要在調用時指定 JobManager 的 IP 和端口號,並指定要在集羣中運行的 Jar 包。
val env = ExecutionEnvironment.createRemoteEnvironment("jobmanager-hostname", 6123,"C://jar//flink//wordcount.jar")
Source
flink + kafka (flink 消費 kafka 中的數據)
-
啓動 zk,kafka。
-
創建 topic,以及啓動生產者。
1 bin/kafka-topics.sh --create --partitions 3 --replication-factor 2 --topic testnew --bootstrap-server vm0:2181,vm1:2181,vm2:2181
2 bin/kafka-console-producer.sh --broker-list vm0:9092,vm1:9092,vm2:9092 --topic testnew -
pom 文件
<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kafka-0.8_2.11</artifactId> <version>1.6.1</version> </dependency>
-
代碼實現
/** * flink 從kafka中讀取數據 */ object KafkaCousumerToflink { def main(args: Array[String]): Unit = { //這種可有 demo01 } def demo01: Unit = { val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment val properties = new Properties() properties.setProperty("bootstrap.servers", "vm2:9092") // only required for Kafka 0.8 properties.setProperty("zookeeper.connect", "vm2:2181") properties.setProperty("group.id", "test") val stream = env .addSource(new FlinkKafkaConsumer08[String]("mdj", new SimpleStringSchema(), properties)) .print() env.execute("KafkaCousumerToflink") } }
Transform
Transformation 的介紹
Flink 提供了大量的算子操作
一. Map
輸入一個參數產生一個參數,map 的功能是對輸入的參數進行轉換操作。
val streamMap = stream.map { x => x * 2 }
二. flatMap
輸入一個參數,產生 0、1 或者多個輸出,這個多用於拆分操作。
val streamFlatMap = stream.flatMap{
x => x.split(" ")
}
三. filter
結算每個元素的布爾值,並返回爲 true 的元素。
val streamFilter = stream.filter{
x => x == 1
}
四. KeyBy
DataStream → KeyedStream:輸入必須是Tuple類型,邏輯地將一個流拆分成不相交的分區,每個分區包含具有相同key的元素,在內部以hash的形式實現的。
注意:以下類型無法作爲key。
- POJO類,且沒有實現 hashCode 函數
- 任意形式的數組類型
五. Distinct
去重
六. join 和 outJoin
關聯
七. cross
求笛卡爾積
八. reduce
滾動合併操作,合併當前元素和上一次合併的元素結果
//求各個渠道的累計個數
val value: DataStream[(Int, Int)] = env.fromElements((1, 2), (1, 3))
val kst: KeyedStream[(Int, Int), Tuple] = value.keyBy(0)
kst.reduce { (t1, t2) => (t1._1, t1._2 + t2._2) }.print().setParallelism(1)
env.execute()
九. fold
用一個初始的一個值,與其每個元素進行滾動合併操作。
private def myFold(env: StreamExecutionEnvironment): Unit = {
val value: DataStream[(Int, Int)] = env.fromElements((1, 2), (1, 3))
val kst: KeyedStream[(Int, Int), Tuple] = value.keyBy(0)
val ds: DataStream[String] = kst.fold("")((str, i) => {
str + “-” + i
})
ds.print()
env.execute()
}
複雜的方法
- aggregation
KeyedStream --> DataStream:分組流數據的滾動聚合操作: min 和 minBy 的區別是 min 返回的是一個最小值,而 minBy 返回的是其字段中包含的最小值元素(同樣原理使用與 max 和 maxBy)。
- window
KeyedStream --> DataStream:windows 是在一個分區的 KeyedStream 中定義的,windows 根據某些特性將每個 key 的數據進行分組(例如:在 5s 內到達的數據)。
- windowAll
DataStream --> AllWindowedStream:Windows 可以在一個常規的 DataStream 中定義,Windows 根據某些特性對所有的流(例如:5s內到達的數據)。這個操作在很多情況下都不是並行操作的,所有的記錄都會聚集到一個 windowAll 的操作任務中。
- window apply
WindowedStream --> DataStream,AllWindowedStream --> DataStream:將一個通用的函數作爲一個整體傳遞給window
- window reduce
WindowedStream --> DataStream:給窗口賦予一個reduce的功能,並返回一個reduce的結果。
- window fold
WindowedStream --> DataStream:給窗口賦予一個fold的功能,並返回一個fold後的結果
- aggregation on windows
WindowedStream --> DataStream:對 window 的元素做聚合操作,min 和 minBy 的區別是 min 返回的是最小值,而 minBy 返回的是包含最小值字段的元素。(同樣原理適用於 max 和 maxBy )
- union
DataStream --> DataStream:對兩個或兩個以上的 DataStream 做 union 操作,產生一個包含所有的 DataStream 元素的新 DataStream 。注意:如果將一個 DataStream 和自己做union操作,在新的 DataStream 中,將看到每個元素重複兩次
private def myUnion(env: StreamExecutionEnvironment): Unit = {
//myConnAndCoMap(env)
val dsm: DataStream[Int] = env.fromElements(1, 3, 5)
val dsm01: DataStream[Int] = env.fromElements(2, 4, 6)
val unit: DataStream[Int] = dsm.union(dsm01)
unit.print()
env.execute()
}
- window join
DataStream,DataStream --> DataStream:根據給定的 key 和 window 對兩個 DataStream 做 join 操作
- window coGroup
DataStream,DataStream --> DataStream:根據一個給定的 key 和 window 對兩個 DataStream 做 CoGroups 操作
- connect
DataStream,DataStream --> ConnectedStreams:連接兩個保持她們類型的數據流。
12. coMap 、coFlatMap
ConnectedStreams --> DataStream:作用於 connected 數據流上,功能與 map 和 flatMap 一樣
//合併以後打印
private def myConnAndCoMap(env: StreamExecutionEnvironment): Unit = {
env.setParallelism(1)
val src: DataStream[Int] = env.fromElements(1, 3, 5)
val stringMap: DataStream[String] = src.map(line => "x " + line)
val result = stringMap.connect(src).map(new CoMapFunction[String, Int, String] {
override def map2(value: Int): String = {
"x " + (value + 1)
}
override def map1(value: String): String = {
value
}
})
result.print()
env.execute()
}
-
split
DataStream --> SplitStream:根據某些特徵把一個 DataStream 拆分成兩個或多個 DataStream
-
select
SplitStream --> DataStream:從一個 SplitStream 中獲取一個或多個 DataStream
private def selectAndSplit(env: StreamExecutionEnvironment): Unit = {
val dsm: DataStream[Long] = env.fromElements(1l, 2l, 3l, 4l)
val split:SplitStream[Long] = dsm.split(new OutputSelector[Long] {
override def select(out: Long): lang.Iterable[String] = {
val list = new util.ArrayList[String]()
if (out % 2 == 0) {
list.add("even")
} else {
list.add("odd")
}
list
}
})
split.select("odd").print().setParallelism(1)
env.execute()
}
- iterate
DataStream --> IterativeStream --> DataStream:在流程中創建一個反饋循環,將一個操作的輸出重定向到之前的操作,這對於定義持續更新模型的算法來說很有意義的。
- extract timestamps
DataStream --> DataStream:提取記錄中的時間戳來跟需要事件時間的 window 一起發揮作用.
Sink
Flink + kafka
Flink 沒有類似於 spark 中 foreach 方法,讓用戶進行迭代的操作。雖有對外的輸出操作都要利用 Sink 完成。最後通過類似如下方式完成整個任務最終輸出操作。
myDstream.addSink(new MySink(xxxx))
官方提供了一部分的框架的 sink。除此以外,需要用戶自定義實現 sink。
Kafka Sink
創建 maven 項目,引入 pom 依賴。
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka-0.11 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.8_2.11</artifactId>
<version>1.6.1</version>
</dependency>
代碼實現
/**
* 把數據寫入到kafka中
*/
object KafkaProducerFromFlink {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val dsm: DataStream[String] = env.fromElements("1", "2")
val prop: Properties = new Properties()
prop.setProperty("bootstrap.servers", "vm2:9092")
val value: FlinkKafkaProducer08[String] = new FlinkKafkaProducer08("mdj", new SimpleStringSchema(), prop)
dsm.addSink(value)
env.execute()
}
}
Redis Sink
創建 maven 工程,引入 pom
<!-- https://mvnrepository.com/artifact/org.apache.bahir/flink-connector-redis -->
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>
代碼實現
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}
import org.apache.flink.streaming.api.scala._
object MyRedisUtil {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.fromCollection(List(("flink","redis"))).map( x=>(x._1,x._2+"" )).addSink(MyRedisUtil.getRedisSink()).setParallelism(1)
env.execute("redissink")
}
val conf = new FlinkJedisPoolConfig.Builder().setHost("192.168.44.127").setPort(6379).build()
def getRedisSink(): RedisSink[(String,String)] ={
new RedisSink[(String,String)](conf,new MyRedisMapper)
}
class MyRedisMapper extends RedisMapper[(String,String)]{
override def getCommandDescription: RedisCommandDescription = {
// new RedisCommandDescription(RedisCommand.HSET, "channel_count")
new RedisCommandDescription(RedisCommand.SET ,"myset" )
}
override def getValueFromData(t: (String, String)): String = t._2
override def getKeyFromData(t: (String, String)): String = t._1
}
}
Elasticsearch
引入pom
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-elasticsearch6_2.11</artifactId>
<version>1.7.0</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.3</version>
</dependency>
添加MyEsUtil
import java.util
import com.alibaba.fastjson.{JSON, JSONObject}
import org.apache.flink.api.common.functions.RuntimeContext
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.elasticsearch.{ElasticsearchSinkFunction, RequestIndexer}
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink
import org.apache.http.HttpHost
import org.elasticsearch.action.index.IndexRequest
import org.elasticsearch.client.Requests
import org.apache.flink.api.scala._
/**
* flink數據下沉到es中
*/
object MyEsUtil {
def main(args: Array[String]): Unit = {
val esSink: ElasticsearchSink[String] = MyEsUtil.getElasticSearchSink("gmall0503_startup")
//獲取執行的環境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//得到數據
val ds: DataStream[String] = env.fromCollection(List("key1", "value1"))
//下沉數據報錯(數據格式不正確造成的)
ds.addSink(esSink)
env.execute()
}
val httpHosts = new util.ArrayList[HttpHost]
httpHosts.add(new HttpHost("vm0", 9200, "http"))
httpHosts.add(new HttpHost("vm1", 9200, "http"))
httpHosts.add(new HttpHost("vm2", 9200, "http"))
def getElasticSearchSink(indexName: String): ElasticsearchSink[String] = {
val esFunc = new ElasticsearchSinkFunction[String] {
override def process(element: String, ctx: RuntimeContext, indexer: RequestIndexer): Unit = {
println("試圖保存:" + element)
val jsonObj: JSONObject = JSON.parseObject(element)
val indexRequest: IndexRequest = Requests.indexRequest().index(indexName).`type`("_doc").source(jsonObj)
indexer.add(indexRequest)
println("保存1條")
}
}
val sinkBuilder = new ElasticsearchSink.Builder[String](httpHosts, esFunc)
//刷新前緩衝的最大動作量
sinkBuilder.setBulkFlushMaxActions(10)
sinkBuilder.build()
}
}
JDBC 自定義 sink
引入pom
<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.44</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>druid</artifactId>
<version>1.1.10</version>
</dependency>
添加MyJdbcSink
import java.sql.{Connection, DriverManager, PreparedStatement}
import com.bw.StreamSink.Stuents
import org.apache.flink.api.common.functions.MapFunction
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.datastream.{DataStream, SingleOutputStreamOperator}
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction
object MyJdbcSink {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// val source: DataStream[String] = env.socketTextStream("vm2", 9999)
val source: DataStream[String] = env.readTextFile("d:/person.txt")
val map: SingleOutputStreamOperator[Stuents] = source.map(new MapFunction[String, Stuents]() {
override def map(value: String): Stuents = {
val split: Array[String] = value.split(",")
val stu: Stuents = new Stuents
println(split(0))
stu.setId(split(0))
stu.setName(split(1))
stu.setAge(split(2).toInt)
stu
}
})
map.addSink(new SinkToMySql())
env.execute("MyJdbcSink")
}
case class student(id: String, name: String,age:String)
class SinkToMySql() extends RichSinkFunction[Stuents] {
var conn: Connection = null;
var ps: PreparedStatement = null
val driver = "com.mysql.jdbc.Driver"
val url: String = "jdbc:mysql://vm2:3306/myflink"
val username = "root"
val password = "123456"
val maxActive = "20"
//初始化的操作
override def open(parameters: Configuration): Unit = {
super.open(parameters)
super.open(parameters)
Class.forName("com.mysql.jdbc.Driver")
conn = DriverManager.getConnection(url, username, password)
conn.setAutoCommit(false)
}
//反覆調用的函數
override def invoke(value: Stuents): Unit = {
val sql: String = "insert into student(name,age) values(?,?)"
ps = conn.prepareStatement(sql)
//ps.setString(0,value.getId)
ps.setString(1,value.getName)
ps.setString(2,value.getAge.toString)
ps.execute()
conn.commit()
}
override def close(): Unit = {
super.close()
if (conn != null) {
conn.close()
}
}
}
}