Flink之末自定義udf與Sink定義

原創

2020-05-14 22:08

一、實現UDF函數——更細粒度的控制流

1.1 函數類（Function Classes）

Flink暴露了所有udf函數的接口(實現方式爲接口或者抽象類)。例如MapFunction, FilterFunction, ProcessFunction等等。
下面例子實現了FilterFunction接口：

class FilterFilter extends FilterFunction[String] {
      override def filter(value: String): Boolean = {
	      value.contains("flink")
      }
}
val flinkTweets = tweets.filter(new FlinkFilter)

還可以將函數實現成匿名類

val flinkTweets = tweets.filter(
	new RichFilterFunction[String] {
		override def filter(value: String): Boolean = {
			value.contains("flink")
		}
	}
)

我們filter的字符串"flink"還可以當作參數傳進去。

val tweets: DataStream[String] = ...
val flinkTweets = tweets.filter(new KeywordFilter("flink"))
class KeywordFilter(keyWord: String) extends FilterFunction[String] {
	override def filter(value: String): Boolean = {
		value.contains(keyWord)
	}
}

1.2 匿名函數（Lambda Functions）

val tweets: DataStream[String] = ...
val flinkTweets = tweets.filter(_.contains("flink"))

1.3 富函數（Rich Functions）

“富函數”是DataStream API提供的一個函數類的接口，所有Flink函數類都有其Rich版本。它與常規函數的不同在於，可以獲取運行環境的上下文，並擁有一些生命週期方法，所以可以實現更復雜的功能。

RichMapFunction
RichFlatMapFunction
RichFilterFunction
…

Rich Function有一個生命週期的概念。典型的生命週期方法有：

open()方法是rich function的初始化方法，當一個算子例如map或者filter被調用之前open()會被調用。
close()方法是生命週期中的最後一個調用的方法，做一些清理工作。
getRuntimeContext()方法提供了函數的RuntimeContext的一些信息，例如函數執行的並行度，任務的名字，以及state狀態

class MyFlatMap extends RichFlatMapFunction[Int, (Int, Int)] {
	var subTaskIndex = 0

	override def open(configuration: Configuration): Unit = {
		subTaskIndex = getRuntimeContext.getIndexOfThisSubtask
		// 以下可以做一些初始化工作，例如建立一個和HDFS的連接
	}

	override def flatMap(in: Int, out: Collector[(Int, Int)]): Unit = {
		if (in % 2 == subTaskIndex) {
			out.collect((subTaskIndex, in))
		}
	}

	override def close(): Unit = {
		// 以下做一些清理工作，例如斷開和HDFS的連接。
	}
}

二、Sink定義

Flink沒有類似於spark中foreach方法，讓用戶進行迭代的操作。雖有對外的輸出操作都要利用Sink完成。最後通過類似如下方式完成整個任務最終輸出操作。

  stream.addSink(new MySink(xxxx))

官方提供了一部分的框架的sink。除此以外，需要用戶自定義實現sink。

2.1 Kafka

pom.xml文件定義

<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka-0.11 -->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka-0.11_2.11</artifactId>
    <version>1.10.0</version>
</dependency>

主函數中添加sink：

val union = high.union(low).map(_.temperature.toString)
union.addSink(new FlinkKafkaProducer011[String]("localhost:9092", "test", new SimpleStringSchema()))

2.2 Redis

pom.xml文件定義

<!-- https://mvnrepository.com/artifact/org.apache.bahir/flink-connector-redis -->
<dependency>
    <groupId>org.apache.bahir</groupId>
    <artifactId>flink-connector-redis_2.11</artifactId>
    <version>1.0</version>
</dependency>

定義一個redis的mapper類，用於定義保存到redis時調用的命令：

class MyRedisMapper extends RedisMapper[SensorReading]{
  override def getCommandDescription: RedisCommandDescription = {
    new RedisCommandDescription(RedisCommand.HSET, "sensor_temperature")
  }
  override def getValueFromData(t: SensorReading): String = t.temperature.toString

  override def getKeyFromData(t: SensorReading): String = t.id
}

在主函數中調用：

val conf = new FlinkJedisPoolConfig.Builder().setHost("localhost").setPort(6379).build()
dataStream.addSink( new RedisSink[SensorReading](conf, new MyRedisMapper) )

2.3 Elasticsearch

pom.xml

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-elasticsearch6_2.11</artifactId>
    <version>1.10.0</version>
</dependency>

在主函數中調用：

val httpHosts = new util.ArrayList[HttpHost]()
httpHosts.add(new HttpHost("localhost", 9200))

val esSinkBuilder = new ElasticsearchSink.Builder[SensorReading]( httpHosts, new ElasticsearchSinkFunction[SensorReading] {
  override def process(t: SensorReading, runtimeContext: RuntimeContext, requestIndexer: RequestIndexer): Unit = {
    println("saving data: " + t)
    val json = new util.HashMap[String, String]()
    json.put("data", t.toString)
    val indexRequest = Requests.indexRequest().index("sensor").`type`("readingData").source(json)
    requestIndexer.add(indexRequest)
    println("saved successfully")
  }
} )
dataStream.addSink( esSinkBuilder.build() )

5.6.4 JDBC 自定義sink

<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.44</version>
</dependency>

添加MyJdbcSink

class MyJdbcSink() extends RichSinkFunction[SensorReading]{
  var conn: Connection = _
  var insertStmt: PreparedStatement = _
  var updateStmt: PreparedStatement = _

  // open 主要是創建連接
  override def open(parameters: Configuration): Unit = {
    super.open(parameters)

    conn = DriverManager.getConnection("jdbc:mysql://localhost:3306/test", "root", "123456")
    insertStmt = conn.prepareStatement("INSERT INTO temperatures (sensor, temp) VALUES (?, ?)")
    updateStmt = conn.prepareStatement("UPDATE temperatures SET temp = ? WHERE sensor = ?")
  }
  // 調用連接，執行sql
  override def invoke(value: SensorReading, context: SinkFunction.Context[_]): Unit = {
    
	updateStmt.setDouble(1, value.temperature)
    updateStmt.setString(2, value.id)
    updateStmt.execute()

    if (updateStmt.getUpdateCount == 0) {
      insertStmt.setString(1, value.id)
      insertStmt.setDouble(2, value.temperature)
      insertStmt.execute()
    }
  }

  override def close(): Unit = {
    insertStmt.close()
    updateStmt.close()
    conn.close()
  }
}

在main方法中增加，把明細保存到mysql中

dataStream.addSink(new MyJdbcSink())

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Flink之末自定義udf與Sink定義

一、實現UDF函數——更細粒度的控制流

1.1 函數類（Function Classes）

1.2 匿名函數（Lambda Functions）

1.3 富函數（Rich Functions）

二、Sink定義

2.1 Kafka

2.2 Redis

2.3 Elasticsearch

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

本地SSL證書過期輸入命令在IIS自動生成

idea常用插件及代碼註釋

hive表鎖定問題（Locks on the underlying objects cannot be acquired）

Flink任務調度原理之TaskManager 與Slots

Flink最強攻略寶典

sqoop密碼明文問題解決

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結