Flink系列（2）：從零搭建Flink環境及WordCount示例

（一）搭建前的環境配置

本文從零開始搭建並熟悉Flink的開發方式，故所有環境以Windows單機爲主，開發語言採用Java，最後涉及一些集羣環境的配置方式。

在搭建Flink本地單機環境前，首先確保電腦上Java及Maven環境已搭建，筆者使用的Java版本爲1.8.0_241；maven版本爲3.6.3；Flink版本爲1.9.2。

隨後，從Flink官網下載對應的Flink安裝包，下載地址：https://flink.apache.org/downloads.html

在選擇下載版本時，有如下可選：

對於本地環境而言，Scala2.11和2.12沒有區別，選擇一個下載即可。解壓後如圖所示：

（二）啓動Cluster界面

官網標註的啓動本地任務方式如下圖所示：

但筆者打開實際的界面如下圖所示：

並沒有start-local.bat，後來查詢資料得知，新版本的Flink不提供這種啓動方式了，直接啓動start-cluster.bat即可，如下圖所示：

然後在電腦網頁打開http://localhost:8081即可，運行界面如下圖所示：

這裏是空白的，但不影響測試，後期啓動集羣環境後，界面上就能夠看到運行的任務信息了。

Windows環境下還有一種啓動方式，安裝Cygwin，然後按照Linux的方式來進行啓動，也就是運行start-cluster.sh即可。但是由於是演示使用，因此就不推薦這種方式了。如果想真正的參與到開發中，還是推薦能夠搭建一個小型的集羣來進行模擬和開發。

（三）創建Maven工程

官方推薦的是使用mvn構建，相關命令如下：

mvn archetype:generate                               \
      -DarchetypeGroupId=org.apache.flink              \
      -DarchetypeArtifactId=flink-quickstart-java      \
      -DarchetypeCatalog=https://repository.apache.org/content/repositories/snapshots/ \
      -DarchetypeVersion=1.3-SNAPSHOT

筆者實際安裝中，發現三個問題：

第一個是-DarchetypeCatalog報錯，這個刪除即可。

第二個是maven官網訪問總是緩慢，因此換成阿里雲的節點即可。具體方式是：打開maven安裝目錄，打開conf目錄下的settings.xml，注意，如果使用Eclipse等IDE，maven的目錄可能不是自己安裝的，需要去修改對應的settings.xml。

打開settings.xml之後，搜索：“mirror”，把原本註釋掉的部分打開，修改成如下：

<mirror>
<id>nexus-aliyun</id>
<mirrorOf>*</mirrorOf>
<name>Nexus aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>

但是存在一個問題，阿里雲的鏡像與官網的有一些區別，在Flink最新版本中，使用阿里雲鏡像會出現找不到對應pom的情況，這時候修改maven配置文件，改回官方鏡像，然後重新編譯，即可解決。

第三個是1.3-SNAPSHOT不知道是什麼時候的版本了，筆者手動改成改成了1.9.2版本，目錄如下：

Jar包的下載地址從maven官網查找：

https://mvnrepository.com/artifact/org.apache.flink/flink-quickstart-java/1.9.2

下載後，放入到對應的目錄即可：

修改後，官網的構建任務就可以正確運行了。

最終的mvn命令如下：

mvn archetype:generate -DarchetypeGroupId=org.apache.flink -DarchetypeArtifactId=flink-quickstart-java -DarchetypeVersion=1.9.2 -DgroupId=wiki-edits -DartifactId=wiki-edits -Dversion=0.1 -Dpackage=wikiedits -DinteractiveMode=false

構建成功圖如下所示：

構建完成後，打開Eclipse，導入Maven工程，mvn clean install後，工程樣子如下圖所示：

（四）創建WordCount任務並運行

Quickstart 工程包含了一個 WordCount 的實現，也就是大數據處理系統的 Hello World。WordCount 的目標是計算文本中單詞出現的頻率。比如：單詞 “the” 或者 “house” 在所有的Wikipedia文本中出現了多少次。

樣本輸入:

big data is big

樣本輸出:

big 2
data 1
is 1

下面的代碼就是 Quickstart 工程的 WordCount 實現，它使用兩種操作( FlatMap 和 Reduce )處理了一些文本，並且在標準輸出中打印了單詞的計數結果。

public class WordCount {

  public static void main(String[] args) throws Exception {

    // set up the execution environment
    final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

    // get input data
    DataSet<String> text = env.fromElements(
        "To be, or not to be,--that is the question:--",
        "Whether 'tis nobler in the mind to suffer",
        "The slings and arrows of outrageous fortune",
        "Or to take arms against a sea of troubles,"
        );

    DataSet<Tuple2<String, Integer>> counts =
        // split up the lines in pairs (2-tuples) containing: (word,1)
        text.flatMap(new LineSplitter())
        // group by the tuple field "0" and sum up tuple field "1"
        .groupBy(0)
        .sum(1);

    // execute and print result
    counts.print();
  }
}

這些操作是在專門的類中定義的，下面是 LineSplitter 類的實現。

public static final class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {

  @Override
  public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
    // normalize and split the line
    String[] tokens = value.toLowerCase().split("\\W+");

    // emit the pairs
    for (String token : tokens) {
      if (token.length() > 0) {
        out.collect(new Tuple2<String, Integer>(token, 1));
      }
    }
  }
}

運行結果如下：

以上的實現方式非常簡單，因此設計一種更加複雜的WordCount例子，包括了輸入、輸出及DataSet數據結構。

首先寫一個WordCountData：

public class WordCountData {

	public static final String[] WORDS = new String[] {
		"To be, or not to be,--that is the question:--",
		"Whether 'tis nobler in the mind to suffer",
		"The slings and arrows of outrageous fortune",
		"Or to take arms against a sea of troubles,",
		"And by opposing end them?--To die,--to sleep,--",
		"No more; and by a sleep to say we end",
		"The heartache, and the thousand natural shocks",
		"That flesh is heir to,--'tis a consummation",
		"Devoutly to be wish'd. To die,--to sleep;--",
		"To sleep! perchance to dream:--ay, there's the rub;",
		"For in that sleep of death what dreams may come,",
		"When we have shuffled off this mortal coil,",
		"Must give us pause: there's the respect",
		"That makes calamity of so long life;",
		"For who would bear the whips and scorns of time,",
		"The oppressor's wrong, the proud man's contumely,",
		"The pangs of despis'd love, the law's delay,",
		"The insolence of office, and the spurns",
		"That patient merit of the unworthy takes,",
		"When he himself might his quietus make",
		"With a bare bodkin? who would these fardels bear,",
		"To grunt and sweat under a weary life,",
		"But that the dread of something after death,--",
		"The undiscover'd country, from whose bourn",
		"No traveller returns,--puzzles the will,",
		"And makes us rather bear those ills we have",
		"Than fly to others that we know not of?",
		"Thus conscience does make cowards of us all;",
		"And thus the native hue of resolution",
		"Is sicklied o'er with the pale cast of thought;",
		"And enterprises of great pith and moment,",
		"With this regard, their currents turn awry,",
		"And lose the name of action.--Soft you now!",
		"The fair Ophelia!--Nymph, in thy orisons",
		"Be all my sins remember'd."
	};

	public static DataSet<String> getDefaultTextLineDataSet(ExecutionEnvironment env) {
		return env.fromElements(WORDS);
	}
}

然後撰寫WordCount例子：

public class WordCount {

	// *************************************************************************
	//     PROGRAM
	// *************************************************************************

	public static void main(String[] args) throws Exception {

		final MultipleParameterTool params = MultipleParameterTool.fromArgs(args);

		// set up the execution environment
		final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

		// make parameters available in the web interface
		env.getConfig().setGlobalJobParameters(params);

		// get input data
		DataSet<String> text = null;
		if (params.has("input")) {
			// union all the inputs from text files
			for (String input : params.getMultiParameterRequired("input")) {
				if (text == null) {
					text = env.readTextFile(input);
				} else {
					text = text.union(env.readTextFile(input));
				}
			}
			Preconditions.checkNotNull(text, "Input DataSet should not be null.");
		} else {
			// get default test text data
			System.out.println("Executing WordCount example with default input data set.");
			System.out.println("Use --input to specify file input.");
			text = WordCountData.getDefaultTextLineDataSet(env);
		}

		DataSet<Tuple2<String, Integer>> counts =
				// split up the lines in pairs (2-tuples) containing: (word,1)
				text.flatMap(new Tokenizer())
				// group by the tuple field "0" and sum up tuple field "1"
				.groupBy(0)
				.sum(1);

		// emit result
		if (params.has("output")) {
			counts.writeAsCsv(params.get("output"), "\n", " ");
			// execute program
			env.execute("WordCount Example");
		} else {
			System.out.println("Printing result to stdout. Use --output to specify output path.");
			counts.print();
		}

	}

	// *************************************************************************
	//     USER FUNCTIONS
	// *************************************************************************

	/**
	 * Implements the string tokenizer that splits sentences into words as a user-defined
	 * FlatMapFunction. The function takes a line (String) and splits it into
	 * multiple pairs in the form of "(word,1)" ({@code Tuple2<String, Integer>}).
	 */
	public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {

		@Override
		public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
			// normalize and split the line
			String[] tokens = value.toLowerCase().split("\\W+");

			// emit the pairs
			for (String token : tokens) {
				if (token.length() > 0) {
					out.collect(new Tuple2<>(token, 1));
				}
			}
		}
	}

}

運行結果與第一個例子相同。

（五）集羣運行與Kafka接入

首先，要在pom中添加對應的java包：

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka-0.8_2.11</artifactId>
    <version>${flink.version}</version>
</dependency>

接下來，去掉原有代碼中的print()方法，改成寫入到Kafka中：

    StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();

    DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());

    KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
      .keyBy(new KeySelector<WikipediaEditEvent, String>() {
        @Override
        public String getKey(WikipediaEditEvent event) {
          return event.getUser();
        }
      });

    DataStream<Tuple2<String, Long>> result = keyedEdits
      .timeWindow(Time.seconds(5))
      .fold(new Tuple2<>("", 0L), new FoldFunction<WikipediaEditEvent, Tuple2<String, Long>>() {
        @Override
        public Tuple2<String, Long> fold(Tuple2<String, Long> acc, WikipediaEditEvent event) {
          acc.f0 = event.getUser();
          acc.f1 += event.getByteDiff();
          return acc;
        }
      });

result
    .map(new MapFunction<Tuple2<String,Long>, String>() {
        @Override
        public String map(Tuple2<String, Long> tuple) {
            return tuple.toString();
        }
    })
    .addSink(new FlinkKafkaProducer08<>("localhost:9092", "wiki-result", new SimpleStringSchema()));

    see.execute();

最後，啓動Kafka並觀察輸出：

bin/kafka-console-consumer.sh  --zookeeper localhost:2181 --topic wiki-result

在Web頁面中觀察運行情況：

曉陽的數據小站

發佈了30 篇原創文章 · 獲贊 33 · 訪問量 6759

私信關注

Flink系列（2）：從零搭建Flink環境及WordCount示例

（一）搭建前的環境配置

（二）啓動Cluster界面

（三）創建Maven工程

（四）創建WordCount任務並運行

（五）集羣運行與Kafka接入

python gdal 安裝使用（Windows， python 3.6.8）

天下數據，唯快不破

開源組件系列（4）：分佈式消息隊列（Kafka）

開源組件系列（7）：分佈式結構化存儲（HBase）

有關大型數據倉庫三大痛點的個人看法

開源組件系列（10）：集羣化服務資源管理系統（Mesos）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結