Flink系列（二）-- Flink的數據源詳解

原文地址（包含源碼和圖片）：http://note.youdao.com/noteshare?id=c91f71fd16bedf7dfaac3b6fa663a243&sub=B79A8354FB1D4CB5BE44A1513C4F7A6C

一、DataSource

1、Flink 做爲一款流式計算框架，它可用來做批處理，即處理靜態的數據集、歷史的數據集；也可以用來做流處理，即實時的處理些實時數據流，實時的產生數據流結果，只要數據源源不斷的過來，Flink 就能夠一直計算下去，這個 Data Sources 就是數據的來源地。

2、Flink 中你可以使用 StreamExecutionEnvironment.addSource(sourceFunction) 來爲你的程序添加數據來源。

3、Flink 已經提供了若干實現好了的 source functions，當然你也可以通過實現 SourceFunction 來自定義非並行的 source 或者實現ParallelSourceFunction 接口或者擴展 RichParallelSourceFunction 來自定義並行的 source

二、基於集合

1、fromCollection(Collection) - 從 Java 的 Java.util.Collection 創建數據流。集合中的所有元素類型必須相同。

2、fromCollection(Iterator, Class) - 從一個迭代器中創建數據流。Class 指定了該迭代器返回元素的類型。

3、fromElements(T …) - 從給定的對象序列中創建數據流。所有對象類型必須相同。

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<Event> input = env.fromElements( new Event(1, "barfoo", 1.0), new Event(2, "start", 2.0), new Event(3, "foobar", 3.0), ... );

4、fromParallelCollection(SplittableIterator, Class) - 從一個迭代器中創建並行數據流。Class 指定了該迭代器返回元素的類型。

5、generateSequence(from, to) - 創建一個生成指定區間範圍內的數字序列的並行數據流。

三、基於文件

1、readTextFile(path) - 讀取文本文件，即符合 TextInputFormat 規範的文件，並將其作爲字符串返回。

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<String> text = env.readTextFile("file:///path/to/file");

2、readFile(fileInputFormat, path) - 根據指定的文件輸入格式讀取文件（一次）。

3、readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo) - 這是上面兩個方法內部調用的方法。它根據給定的 fileInputFormat 和讀取路徑讀取文件。根據提供的 watchType，這個 source 可以定期（每隔 interval 毫秒）監測給定路徑的新數據（FileProcessingMode.PROCESS_CONTINUOUSLY），或者處理一次路徑對應文件的數據並退出（FileProcessingMode.PROCESS_ONCE）。你可以通過 pathFilter 進一步排除掉需要處理的文件。

/* 在具體實現上，Flink 把文件讀取過程分爲兩個子任務，即目錄監控和數據讀取。每個子任務都由單獨的實體實現。目錄監控由單個非並行（並行度爲1）的任務執行，而數據讀取由並行運行的多個任務執行。後者的並行性等於作業的並行性。單個目錄監控任務的作用是掃描目錄（根據 watchType 定期掃描或僅掃描一次），查找要處理的文件並把文件分割成切分片（splits），然後將這些切分片分配給下游 reader。reader 負責讀取數據。每個切分片只能由一個 reader 讀取，但一個 reader 可以逐個讀取多個切分片。重要注意：如果 watchType 設置爲 FileProcessingMode.PROCESS_CONTINUOUSLY，則當文件被修改時，其內容將被重新處理。這會打破“exactly-once”語義，因爲在文件末尾附加數據將導致其所有內容被重新處理。如果 watchType 設置爲 FileProcessingMode.PROCESS_ONCE，則 source 僅掃描路徑一次然後退出，而不等待 reader 完成文件內容的讀取。當然 reader 會繼續閱讀，直到讀取所有的文件內容。關閉 source 後就不會再有檢查點。這可能導致節點故障後的恢復速度較慢，因爲該作業將從最後一個檢查點恢復讀取。 */ final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<MyEvent> stream = env.readFile( myFormat, myFilePath, FileProcessingMode.PROCESS_CONTINUOUSLY, 100, FilePathFilter.createDefaultFilter(), typeInfo);

四、基於 Socket

socketTextStream(String hostname, int port) - 從 socket 讀取。元素可以用分隔符切分。

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<Tuple2<String, Integer>> dataStream = env .socketTextStream("localhost", 9999) // 監聽 localhost 的 9999 端口過來的數據 .flatMap(new Splitter()) .keyBy(0) .timeWindow(Time.seconds(5)) .sum(1);

五、基於自定義

addSource - 添加一個新的 source function。例如，你可以 addSource(new FlinkKafkaConsumer011<>(…)) 以從 Apache Kafka 讀取數據

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<KafkaEvent> input = env .addSource( new FlinkKafkaConsumer011<>( parameterTool.getRequired("input-topic"), //從參數中獲取傳進來的 topic new KafkaEventSchema(), parameterTool.getProperties()) .assignTimestampsAndWatermarks(new CustomWatermarkExtractor()));

如何自定義？

如果你想自己自定義自己的 Source 呢？

那麼你就需要去了解一下 SourceFunction 接口了，它是所有 stream source 的根接口，它繼承自一個標記接口（空接口）Function。

SourceFunction 定義了兩個接口方法：

1、run ：啓動一個 source，即對接一個外部數據源然後 emit 元素形成 stream（大部分情況下會通過在該方法裏運行一個 while 循環的形式來產生 stream）。

2、cancel ：取消一個 source，也即將 run 中的循環 emit 元素的行爲終止。

正常情況下，一個 SourceFunction 實現這兩個接口方法就可以了。其實這兩個接口方法也固定了一種實現模板。

比如，實現一個 XXXSourceFunction，那麼大致的模板是這樣的：(直接拿 FLink 源碼的實例給你看看)

六、Flink Kafka source

[KafkaUtils.scala]

package com.test.flink.model import java.util.Properties import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord} import org.slf4j.{Logger, LoggerFactory} import scala.collection.mutable /* * * 創建主題 ./kafka-topics.sh --create --zookeeper 172.16.208.149:2181,172.16.208.150:2181 --replication-factor 2 --partitions 8 --topic metric 控制檯消費者 * * */ /*** * 往kafka中寫數據 */ object KafkaUtils { val logger:Logger = LoggerFactory.getLogger(KafkaUtils.getClass) final val broker_list:String = "172.16.208.149:6667,172.16.208.150:6667,172.16.208.151:6667,172.16.208.152:6667" final val zk_list:String = "172.16.208.149:2181,172.16.208.150:2181" final val topic:String = "metric1" def writeToKafka(): Unit ={ val props:Properties = new Properties() props.put("bootstrap.servers", broker_list) props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer") //key 序列化 props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer") //value 序列化 val producer:KafkaProducer[String,String] = new KafkaProducer[String,String](props) val tags = new mutable.HashMap[String,String]() val fields = new mutable.HashMap[String,String]() tags.put("cluster", "ghf") tags.put("host_ip", "111.111.111.111") fields.put("used_percent","90d") fields.put("max", "27244873d") fields.put("used", "17244873d") fields.put("init", "27244873d") val metric = Metric( "mem", System.currentTimeMillis(), fields, tags ) val str = JavaDemoUtils.seq2Josn(metric) val record:ProducerRecord[String,String] = new ProducerRecord[String,String](topic,null,null,str) producer.send(record) println(s"發送數據: ${str}") producer.flush() } def main(args: Array[String]): Unit = { while(true){ Thread.sleep(300) writeToKafka() } } } import scala.beans.BeanProperty import scala.collection.mutable case class Metric(@BeanProperty var name:String, @BeanProperty var timestamp:Long, @BeanProperty var fields: mutable.HashMap[String,String], @BeanProperty var tags: mutable.HashMap[String,String]) { override def toString: String = { "Metric{" + "name='" + name + '\'' + ", timestamp='" + timestamp + '\'' + ", fields=" + fields + ", tags=" + tags + '}'; } }

[KafkaSource.scala]

package com.test.flink.kafkaSource import java.util.Properties import org.apache.flink.api.common.serialization.SimpleStringSchema; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011; object KafkaSource { final val broker_list:String = "172.16.208.149:6667,172.16.208.150:6667,172.16.208.151:6667,172.16.208.152:6667" final val zk_list:String = "172.16.208.149:2181,172.16.208.150:2181" final val topic:String = "metric1" def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment val props = new Properties() props.put("bootstrap.servers",broker_list) props.put("zookeeper.connect",zk_list) props.put("group.id", "metric-group") props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer") //key 反序列化 props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer") //value 反序列化 props.put("auto.offset.reset", "latest") val dataStream:DataStreamSource[String] = env.addSource( new FlinkKafkaConsumer011(topic, new SimpleStringSchema(), props)).setParallelism(1) dataStream.print() env.execute("flink kafka source") } }

七、自定義Source(SQL)

package com.ghf.test.flink import java.sql.{Connection, Driver, DriverManager, PreparedStatement, ResultSet} import org.apache.flink.configuration.Configuration import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction} class SourceFromMySQL extends RichSourceFunction[Student]{ var conn:Connection = null var ps:PreparedStatement = null /** * open() 方法中建立連接，這樣不用每次 invoke 的時候都要建立連接和釋放連接。 * * @param parameters * @throws Exception */ override def open(parameters: Configuration): Unit = { super.open(parameters) conn = SourceFromMySQL.getConn() val sql = "select * from student" ps = conn.prepareStatement(sql) } /** * 程序執行完畢就可以進行，關閉連接和釋放資源的動作了 * * @throws Exception */ override def close(): Unit = { super.close() if(conn != null){ conn.close() } if(ps != null){ ps.close() } } /** * DataStream 調用一次 run() 方法用來獲取數據 * * @param ctx * @throws Exception */ override def run(sourceContext: SourceFunction.SourceContext[Student]): Unit = { val rs: ResultSet = ps.executeQuery() while(rs.next()){ val student = Student( rs.getInt("id"), rs.getString("name").trim, rs.getString("password").trim, rs.getInt("age") ) sourceContext.collect(student) } } override def cancel(): Unit = ??? def getConn(): Connection ={ var con:Connection = null try{ Class.forName("com.mysql.jdbc.Driver") con = DriverManager.getConnection("jdbc:mysql://172.16.190.76:3306/test?useUnicode=true&characterEncoding=UTF-8", "root", "root") } catch{ case ex:Exception => println("-----------mysql get connection has exception , msg = "+ ex.getMessage()); } con } } object SourceFromMySQL{ val url:String = "jdbc:mysql://172.16.190.76:3306/hive" val user:String = "root" val password:String = "root" def main(args: Array[String]): Unit = { // getConn() val env = StreamExecutionEnvironment.getExecutionEnvironment env.addSource(new SourceFromMySQL).print env.execute("flink add sql source") } def getConn(): Connection ={ var con:Connection = null try{ Class.forName("com.mysql.jdbc.Driver") con = DriverManager.getConnection(url,user,password) } catch{ case ex:Exception => println("-----------mysql get connection has exception , msg = "+ ex.getMessage()); } con } } package com.ghf.test.flink import scala.beans.BeanProperty case class Student(@BeanProperty val id:Int, @BeanProperty val name:String, @BeanProperty val password:String, @BeanProperty val age:Int) { override def toString: String = "Student{" + "id=" + id + ", name='" + name + '\'' + ", password='" + password + '\'' + ", age=" + age + '}'; }

八、RichSourceFunction 抽象類說明

繼承自 AbstractRichFunction。爲實現一個 Rich SourceFunction 提供基礎能力。該類的子類有三個，兩個是抽象類，在此基礎上提供了更具體的實現，另一個是 ContinuousFileMonitoringFunction。

MessageAcknowledgingSourceBase ：它針對的是數據源是消息隊列的場景並且提供了基於 ID 的應答機制。

MultipleIdsMessageAcknowledgingSourceBase ：在 MessageAcknowledgingSourceBase 的基礎上針對 ID 應答機制進行了更爲細分的處理，支持兩種 ID 應答模型：session id 和 unique message id。

ContinuousFileMonitoringFunction：這是單個（非並行）監視任務，它接受 FileInputFormat，並且根據 FileProcessingMode 和 FilePathFilter，它負責監視用戶提供的路徑；決定應該進一步讀取和處理哪些文件；創建與這些文件對應的 FileInputSplit 拆分，將它們分配給下游任務以進行進一步處理。

九、自定義SQLSink

package com.test.ghf.sqlSink import java.sql.{Connection, DriverManager, PreparedStatement} import org.apache.flink.configuration.Configuration import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction} import scala.beans.BeanProperty class SQLSink extends RichSinkFunction[Student]{ var conn:Connection = null var ps:PreparedStatement = null val url:String = "jdbc:mysql://172.16.190.76:3306/hive" val user:String = "root" val password:String = "root" /** * open() 方法中建立連接，這樣不用每次 invoke 的時候都要建立連接和釋放連接。 * * @param parameters * @throws Exception */ override def open(parameters: Configuration): Unit = { super.open(parameters) conn = getConn() val sql = "insert into student1(id, name, password, age) values(?,?,?,?);" ps = conn.prepareStatement(sql) } def getConn(): Connection ={ var con:Connection = null try{ Class.forName("com.mysql.jdbc.Driver") con = DriverManager.getConnection(url,user,password) } catch{ case ex:Exception => println("-----------mysql get connection has exception , msg = "+ ex.getMessage()); } con } /** * 程序執行完畢就可以進行，關閉連接和釋放資源的動作了 * * @throws Exception */ override def close(): Unit = { super.close() if(conn != null){ conn.close() } if(ps != null){ ps.close() } } /** * 每條數據的插入都要調用一次 invoke() 方法 * * @param value * @param context * @throws Exception */ override def invoke(value: Student, context: SinkFunction.Context[_]): Unit = { ps.setInt(1,value.getId) ps.setString(2,value.getName) ps.setString(3,value.getPassword) ps.setInt(4,value.getAge) ps.executeUpdate() } } case class Student(@BeanProperty val id:Int, @BeanProperty val name:String, @BeanProperty val password:String, @BeanProperty val age:Int) { override def toString: String = "Student{" + "id=" + id + ", name='" + name + '\'' + ", password='" + password + '\'' + ", age=" + age + '}'; } object Mian { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment val stream = env.addSource(new SQLSource) stream.addSink(new SQLSink) env.execute("sql to sql") } }

Flink系列（二）-- Flink的數據源詳解

Flink系列（二）-- Flink的數據源詳解

ElasticSearch從入門到放棄（五） -- Java API【基於官方文檔7.5】

JAVA 定時調取器的使用

Python_ML-Day05: TensorFlow的線程隊列與IO操作、TFRecords文件的存取

從零開始搭建CDH大數據平臺（二）-- CDH 5.3.6集羣搭建篇

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結