FLink學習之數據是怎麼進來的

FLink學習002——數據是怎麼進來的

1.Flink世界觀

在flink的世界觀中一切都是由流組成的，離線數據是有界限的流，實時數據是一個沒有界限的流，這就是所謂的有界流和無界流。

無界數據流：無界數據流有一個開始但是沒有結束，它們不會在生成時終止並提供數據，必須連續處理無界流，也就是說必須在獲取後立即處理event。對於無界數據流我們無法等待所有數據都到達，因爲輸入是無界的，並且在任何時間點都不會完成。處理無界數據通常要求以特定順序（例如事件發生的順序）獲取event，以便能夠推斷結果完整性。

有界數據流：有界數據流有明確定義的開始和結束，可以在執行任何計算之前通過獲取所有數據來處理有界流，處理有界流不需要有序獲取，因爲可以始終對有界數據集進行排序，有界流的處理也稱爲批處理。

2.WordCount

public class WordCount {

	// *************************************************************************
	// PROGRAM
	// *************************************************************************

	public static void main(String[] args) throws Exception {

		// Checking input parameters
		final MultipleParameterTool params = MultipleParameterTool.fromArgs(args);

		// set up the execution environment
		final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

		// make parameters available in the web interface
		env.getConfig().setGlobalJobParameters(params);

		// get input data
		DataStream<String> text = null;
		if (params.has("input")) {
			// union all the inputs from text files
			for (String input : params.getMultiParameterRequired("input")) {
				if (text == null) {
					text = env.readTextFile(input);
				} else {
					text = text.union(env.readTextFile(input));
				}
			}
			Preconditions.checkNotNull(text, "Input DataStream should not be null.");
		} else {
			System.out.println("Executing WordCount example with default input data set.");
			System.out.println("Use --input to specify file input.");
			// get default test text data
			text = env.fromElements(WordCountData.WORDS);
		}

		DataStream<Tuple2<String, Integer>> counts =
			// split up the lines in pairs (2-tuples) containing: (word,1)
			text.flatMap(new Tokenizer())
			// group by the tuple field "0" and sum up tuple field "1"
			.keyBy(0).sum(1);

		// emit result
		if (params.has("output")) {
			counts.writeAsText(params.get("output"));
		} else {
			System.out.println("Printing result to stdout. Use --output to specify output path.");
			counts.print();
		}
		// execute program
		env.execute("Streaming WordCount");
	}

	// *************************************************************************
	// USER FUNCTIONS
	// *************************************************************************

	/**
	 * Implements the string tokenizer that splits sentences into words as a
	 * user-defined FlatMapFunction. The function takes a line (String) and
	 * splits it into multiple pairs in the form of "(word,1)" ({@code Tuple2<String,
	 * Integer>}).
	 */
	public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {

		@Override
		public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
			// normalize and split the line
			String[] tokens = value.toLowerCase().split("\\W+");

			// emit the pairs
			for (String token : tokens) {
				if (token.length() > 0) {
					out.collect(new Tuple2<>(token, 1));
				}
			}
		}
	}

}

4.數據源

數據源的構建是通過StreamExecutionEnviroment這個方法實現來得到的

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

在StreamExecutionEnviroment中，使用了readFile方法讀取數據，但是這種方法並不適合我們當前業務，不是實時數據處理。用一個socketTextStream用例來說明，可以看到指定了hostname和port，構建起一個接受網絡數據的數據源

public DataStreamSource<String> socketTextStream(String hostname, int port) {
   return socketTextStream(hostname, port, "\n");
}

public DataStreamSource<String> socketTextStream(String hostname, int port, String delimiter) {
   return socketTextStream(hostname, port, delimiter, 0);
}

public DataStreamSource<String> socketTextStream(String hostname, int port, String delimiter, long maxRetry) {
   return addSource(new SocketTextStreamFunction(hostname, port, delimiter, maxRetry),
         "Socket Stream");
}

可以看到會根據傳入的hostname、port，以及默認的行分隔符”\n”，和最大嘗試次數0，構造一個SocketTextStreamFunction實例，並採用默認的數據源節點名稱爲”Socket Stream”。
SocketTextStreamFunction的類繼承圖如下所示，可以看出其是SourceFunction的一個子類，而SourceFunction是Flink中數據源的基礎接口。

也就是：SocketTextStreamFunction 實現了SourceFunction接口，而SourceFunction繼承了Function和Serializable兩個接口，其中Function也繼承了Serializable接口。

下面是SourceFunction內部方法

@Public
public interface SourceFunction<T> extends Function, Serializable {
   void run(SourceContext<T> ctx) throws Exception;
   void cancel();
   @Public
   interface SourceContext<T> {
      void collect(T element);
      @PublicEvolving
      void collectWithTimestamp(T element, long timestamp);
      @PublicEvolving
      void emitWatermark(Watermark mark);
      @PublicEvolving
      void markAsTemporarilyIdle();
      Object getCheckpointLock();
      void close();
   }
}

run(SourceContex)方法：就是實現數據獲取邏輯的地方，並可以通過傳入的參數ctx（ctx是SourceContext類型）實現向下遊節點的數據轉發。
cancel()方法：則是用來取消數據源的數據產生，一般在run方法中，會存在一個循環來持續產生數據，而cancel方法則可以使得該循環終止。

具體而言，我們可以研究下SocketTextStreamFunction的具體實現（也就是主要看其run方法的具體實現）：

先看下類的介紹：

/**
 * A source function that reads strings from a socket. The source will read bytes from the socket
 * stream and convert them to characters, each byte individually. When the delimiter character is
 * received, the function will output the current string, and begin a new string.
 */

SocketTextStreamFuction主要是從socket讀取byte數據，讀取到的byte數據會被轉換爲字符，在接收到分隔符前，讀取到的字符會被認爲一個String；接收到分隔符後，也就意味着一個新的string即將到來。

下面是SocketTextStreamFunction中的幾個主要成員屬性：

/** Default delay between successive connection attempts. */
    private static final int DEFAULT_CONNECTION_RETRY_SLEEP = 500;

    /** Default connection timeout when connecting to the server socket (infinite). */
    private static final int CONNECTION_TIMEOUT_TIME = 0;

    private final String hostname;
    private final int port;
    private final String delimiter;
    private final long maxNumRetries;
    private final long delayBetweenRetries;

    private transient Socket currentSocket;
        
    private volatile boolean isRunning = true;

isRunning 就是上面提到的那個volatile修飾的bool標誌，delimiter由構造器傳入，即兩個String使用什麼分隔的。下面是所有data source類的核心，即run方法的實現

public void run(SourceContext<String> ctx) throws Exception {
   final StringBuilder buffer = new StringBuilder();
   long attempt = 0;  //重試次數
   /** 這裏是第一層循環，只要當前處於運行狀態，該循環就不會退出，會一直循環 */
   while (isRunning) {
      try (Socket socket = new Socket()) {
         /** 對指定的hostname和port，建立Socket連接，並構建一個BufferedReader，用來從Socket中讀取數據 */
         currentSocket = socket;
         LOG.info("Connecting to server socket " + hostname + ':' + port);
         socket.connect(new InetSocketAddress(hostname, port), CONNECTION_TIMEOUT_TIME);
         BufferedReader reader = new BufferedReader(new InputStreamReader(socket.getInputStream()));
         char[] cbuf = new char[8192];
         int bytesRead;
         /** 這裏是第二層循環，對運行狀態進行了雙重校驗，同時對從Socket中讀取的字節數進行判斷 */
         while (isRunning && (bytesRead = reader.read(cbuf)) != -1) {
            buffer.append(cbuf, 0, bytesRead);
            int delimPos;
            /** 這裏是第三層循環，就是對從Socket中讀取到的數據，按行分隔符進行分割，並將每行數據作爲一個整體字符串向下遊轉發 */
            while (buffer.length() >= delimiter.length() && (delimPos = buffer.indexOf(delimiter)) != -1) {
               String record = buffer.substring(0, delimPos);
               if (delimiter.equals("\n") && record.endsWith("\r")) {
                  record = record.substring(0, record.length() - 1);
               }
               /** 用入參ctx，進行數據的轉發 */
               ctx.collect(record);
               buffer.delete(0, delimPos + delimiter.length());
            }
         }
      }
      /** 如果由於遇到EOF字符，導致從循環中退出，則根據運行狀態，以及設置的最大重試嘗試次數，決定是否進行 sleep and retry，或者直接退出循環 */
      if (isRunning) {
         attempt++;
         if (maxNumRetries == -1 || attempt < maxNumRetries) {
            LOG.warn("Lost connection to server socket. Retrying in " + delayBetweenRetries + " msecs...");
            Thread.sleep(delayBetweenRetries);
         }
         else {
            break;
         }
      }
   }
   /** 在最外層的循環都退出後，最後檢查下緩存中是否還有數據，如果有，則向下遊轉發 */
   if (buffer.length() > 0) {
      ctx.collect(buffer.toString());
   }
}

cancel方法：

public void cancel() {
   isRunning = false;
   Socket theSocket = this.currentSocket;
   /** 如果當前socket不爲null，則進行關閉操作 */
   if (theSocket != null) {
      IOUtils.closeSocket(theSocket);
   }
}

StreamExecutionEnvironment:addSource()方法：

public <OUT> DataStreamSource<OUT> addSource(SourceFunction<OUT> function, String sourceName) {
   return addSource(function, sourceName, null);
}

public <OUT> DataStreamSource<OUT> addSource(SourceFunction<OUT> function, String sourceName, TypeInformation<OUT> typeInfo) {
   /** 如果傳入的輸出數據類型信息爲null，則嘗試提取輸出數據的類型信息 */
   if (typeInfo == null) {
      if (function instanceof ResultTypeQueryable) {
         /** 如果傳入的function實現了ResultTypeQueryable接口, 則直接通過接口獲取 */
         typeInfo = ((ResultTypeQueryable<OUT>) function).getProducedType();
      } else {
         try {
            /** 通過反射機制來提取類型信息 */
            typeInfo = TypeExtractor.createTypeInfo(
                  SourceFunction.class,
                  function.getClass(), 0, null, null);
         } catch (final InvalidTypesException e) {
            /** 提取失敗, 則返回一個MissingTypeInfo實例 */
            typeInfo = (TypeInformation<OUT>) new MissingTypeInfo(sourceName, e);
         }
      }
   }
   /** 根據function是否是ParallelSourceFunction的子類實例來判斷是否是一個並行數據源節點 */
   boolean isParallel = function instanceof ParallelSourceFunction;
   /** 閉包清理, 可減少序列化內容, 以及防止序列化出錯 */
   clean(function);
   StreamSource<OUT, ?> sourceOperator;
   /** 根據function是否是StoppableFunction的子類實例, 來決定構建不同的StreamOperator */
   if (function instanceof StoppableFunction) {
      sourceOperator = new StoppableStreamSource<>(cast2StoppableSourceFunction(function));
   } else {
      sourceOperator = new StreamSource<>(function);
   }
   /** 返回一個新構建的DataStreamSource實例 */
   return new DataStreamSource<>(this, typeInfo, sourceOperator, isParallel, sourceName);
}

通過對addSource重載方法的依次調用，最後得到了一個DataStreamSource的實例。
TypeInformation是Flink的類型系統中的核心類，用作函數輸入和輸出的類型都需要通過TypeInformation來表示，TypeInformation可以看做是數據類型的一個工具，可以通過它獲取對應數據類型的序列化器和比較器等。
StreamSource的類繼承圖如下所示：

上圖可以看出StreamSource是StreamOperator接口的一個具體實現類，其構造函數的入參就是一個SourceFunction的子類實例，這裏就是前面介紹過的SocketTextStreamFunciton的實例，構造過程如下：

public StreamSource(SRC sourceFunction) {
   super(sourceFunction);
   this.chainingStrategy = ChainingStrategy.HEAD;
}

public AbstractUdfStreamOperator(F userFunction) {
   this.userFunction = requireNonNull(userFunction);
   checkUdfCheckpointingPreconditions();
}

private void checkUdfCheckpointingPreconditions() {
   if (userFunction instanceof CheckpointedFunction && userFunction instanceof ListCheckpointed) {
      throw new IllegalStateException("User functions are not allowed to implement AND ListCheckpointed.");
   }
}xxxxxxxxxx public public StreamSource(SRC sourceFunction) {   super(sourceFunction);   this.chainingStrategy = ChainingStrategy.HEAD;}public AbstractUdfStreamOperator(F userFunction) {   this.userFunction = requireNonNull(userFunction);   checkUdfCheckpointingPreconditions();}private void checkUdfCheckpointingPreconditions() {   if (userFunction instanceof CheckpointedFunction && userFunction instanceof ListCheckpointed) {      throw new IllegalStateException("User functions are not allowed to implement AND ListCheckpointed.");   }}java

把傳入的userFunction賦值給自己的屬性變量，並對傳入的userFunction做了校驗工作，然後將鏈接策略設置爲HEAD。
Flink中爲了優化執行效率，會對數據處理鏈中的相鄰節點會進行合併處理，鏈接策略有三種：
ALWAYS —— 儘可能的與前後節點進行鏈接；
NEVER —— 不與前後節點進行鏈接；
HEAD —— 只能與後面的節點鏈接，不能與前面的節點鏈接。
作爲數據源的源頭，是最頂端的節點了，所以只能採用HEAD或者NEVER，對於StreamSource，採用的是HEAD策略。
StreamOperator是Flink中流操作符的基礎接口，其抽象子類AbstractStreamOperator實現了一些公共方法，用戶自定義的數據處理邏輯會被封裝在StreamOperator的具體實現子類中。

在sourceOperator變量被賦值後，即開始進行DataStreamSource的實例構建，並作爲數據源構造調用的返回結果。

return new DataStreamSource<>(this, typeInfo, sourceOperator, isParallel, sourceName);

在Flink中，DataStream描述了一個具有相同數據類型的數據流，其提供了數據操作的各種API，如map、reduce等，通過這些API，可以對數據流中的數據進行各種操作，DataStreamSource的構建過程如下：

public DataStreamSource(StreamExecutionEnvironment environment,
      TypeInformation<T> outTypeInfo, StreamSource<T, ?> operator,
      boolean isParallel, String sourceName) {
   super(environment, new SourceTransformation<>(sourceName, operator, outTypeInfo, environment.getParallelism()));
   this.isParallel = isParallel;
   if (!isParallel) {
      setParallelism(1);
   }
}

protected SingleOutputStreamOperator(StreamExecutionEnvironment environment, StreamTransformation<T> transformation) {
   super(environment, transformation);
}

public DataStream(StreamExecutionEnvironment environment, StreamTransformation<T> transformation) {
   this.environment = Preconditions.checkNotNull(environment, "Execution Environment must not be null.");
   this.transformation = Preconditions.checkNotNull(transformation, "Stream Transformation must not be null.");
}

可見構建過程就是初始化了DataStream中的environment和transformation這兩個屬性。

其中transformation賦值的是SourceTranformation的一個實例，SourceTransformation是StreamTransformation的子類，而StreamTransformation則描述了創建一個DataStream的操作。對於每個DataStream，其底層都是有一個StreamTransformation的具體實例的，所以在DataStream在構造初始時會爲其屬性transformation設置一個具體的實例。並且DataStream的很多接口的調用都是直接調用的StreamTransformation的相應接口，如並行度、id、輸出數據類型信息、資源描述等。

通過上述過程，根據指定的hostname和port進行數據產生的數據源就構造完成了，獲得的是一個DataStreamSource的實例，描述的是一個輸出數據類型是String的數據流的源。
在上述的數據源的構建過程中，出現Function(SourceFunction)、StreamOperator、StreamTransformation、DataStream這四個接口：

Function接口：用戶通過繼承該接口的不同子類來實現用戶自己的數據處理邏輯，如上述中實現了SourceFunction這個子類，來實現從指定hostname和port來接收數據，並轉發字符串的邏輯；
StreamOperator接口：數據流操作符的基礎接口，該接口的具體實現子類中，會有保存用戶自定義數據處理邏輯的函數的屬性，負責對userFunction的調用，以及調用時傳入所需參數，比如在StreamSource這個類中，在調用SourceFunction的run方法時，會構建一個SourceContext的具體實例，作爲入參，用於run方法中，進行數據的轉發；

StreamTransformation接口：該接口描述了構建一個DataStream的操作，以及該操作的並行度、輸出數據類型等信息，並有一個屬性，用來持有StreamOperator的一個具體實例；
DataStream：描述的是一個具有相同數據類型的數據流，底層是通過具體的StreamTransformation來實現，其負責提供各種對流上的數據進行操作轉換的API接口。

通過上述的關係，最終用戶自定義數據處理邏輯的函數，以及並行度、輸出數據類型等就都包含在了DataStream中，而DataStream也就可以很好的描述一個具體的數據流了。

上述四個接口的包含關係是這樣的：Function –> StreamOperator –> StreamTransformation –> DataStream。

通過數據源的構造，理清Flink數據流中的幾個接口的關係後，接下來在數據源上進行各種操作，達到最終的數據統計分析的目的。

FLink學習之數據是怎麼進來的

FLink學習002——數據是怎麼進來的

1.Flink世界觀

2.WordCount

4.數據源

linux安裝cuda和cudnn

模擬手機設備：使用 Playwright 實現移動端自動化測試

Mellanox網卡開啓SR-IOV

全面系統的AI學習路徑，幫助普通人也能玩轉AI

HTML 00 Tutorial

uni-app實現上拉加載

vue3編譯優化之“靜態提升”

又是一個月-20240513

flask 如何保證返回json有序

linux服務器設置ssh免密

Flink之統計PVUV

FLink學習之數據是怎麼進來的

Java創建刪除Topic

Java獲取隨機數的應用

Flink學習之環境搭建，項目結構

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結