Flink on YARN快速入門指南

　　Apache Flink是一個高效、分佈式、基於Java和Scala(主要是由Java實現)實現的通用大數據分析引擎，它具有分佈式 MapReduce一類平臺的高效性、靈活性和擴展性以及並行數據庫查詢優化方案，它支持批量和基於流的數據分析，且提供了基於Java和Scala的API。

　　從Flink官方文檔可以知道，目前Flink支持三大部署模式：Local、Cluster以及Cloud，如下圖所示：

　　本文將簡單地介紹如何部署Apache Flink On YARN(也就是如何在YARN上運行Flink作業)，本文是基於Apache Flink 1.0.0以及Hadoop 2.2.0。

　　在YARN上啓動一個Flink主要有兩種方式：(1)、啓動一個YARN session(Start a long-running Flink cluster on YARN)；(2)、直接在YARN上提交運行Flink作業(Run a Flink job on YARN)。下面將分別進行介紹。

Flink YARN Session

　　這種模式下會啓動yarn session，並且會啓動Flink的兩個必要服務：JobManager和TaskManagers，然後你可以向集羣提交作業。同一個Session中可以提交多個Flink作業。需要注意的是，這種模式下Hadoop的版本至少是2.2，而且必須安裝了HDFS（因爲啓動YARN session的時候會向HDFS上提交相關的jar文件和配置文件）。我們可以通過./bin/yarn-session.sh腳本啓動YARN Session，由於我們第一次使用這個腳本，我們先看看這個腳本支持哪些參數：

[[email protected]
flink]$ ./bin/yarn-session.sh

Usage:

   Required

     -n,--container
<arg>   Number of YARN container to allocate (=Number of Task Managers)

   Optional

     -D
<arg>                        Dynamic properties

     -d,--detached                  
Start detached

     -jm,--jobManagerMemory
<arg>    Memory for

JobManager Container [in

MB]

     -nm,--name
<arg>                Set a custom name for

the application on YARN

     -q,--query                     
Display available YARN resources (memory, cores)

     -qu,--queue
<arg>               Specify YARN queue.

     -s,--slots
<arg>                Number of slots per TaskManager

     -st,--streaming                
Start Flink in

streaming mode

     -tm,--taskManagerMemory
<arg>   Memory per TaskManager Container [in

MB]

各個參數的含義裏面已經介紹的很詳細了。在啓動的是可以指定TaskManager的個數以及內存(默認是1G)，也可以指定JobManager的內存，但是JobManager的個數只能是一個。好了，我們開啓動一個YARN session吧：

./bin/yarn-session.sh
-n 4 -tm 8192 -s 8

上面命令啓動了4個TaskManager，每個TaskManager內存爲8G且佔用了8個核(是每個TaskManager，默認是1個核)。在啓動YARN session的時候會加載conf/flink-config.yaml配置文件，我們可以根據自己的需求去修改裏面的相關參數（關於裏面的參數含義請參見Flink官方文檔介紹吧）。一切順利的話，我們可以在https://www.iteblog.com:9981/proxy/application_1453101066555_2766724/#/overview上看到類似於下面的頁面：

啓動了YARN session之後我們如何運行作業呢？很簡單，我們可以使用./bin/flink腳本提交作業，同樣我們來看看這個腳本支持哪些參數：

[iteblog@www.iteblog.com
flink-1.0.0]$
bin/flink

./flink
<ACTION> [OPTIONS] [ARGUMENTS]

 

The
following actions are available:

 

Action
"run"

compiles and runs a program.

 

  Syntax:

run [OPTIONS] <jar-file> <arguments>

  "run"

action options:

     -c,--class

<classname>               Class with

the program entry point

                                          ("main"

method or "getPlan()"

method.

                                          Only
needed if

the JAR file does not

                                          specify
the class

in its manifest.

     -C,--classpath
<url>                 Adds a URL to each user code

                                          classloader 
on all nodes in the

                                          cluster.
The paths must specify a

                                          protocol
(e.g. file://)
and be

                                          accessible
on all nodes (e.g. by means

                                          of
a NFS share). You can use this

                                          option
multiple times for

specifying

                                          more
than one URL. The protocol must

                                          be
supported by the {@link

                                          java.net.URLClassLoader}.

     -d,--detached                       
If present, runs the job in detached

                                          mode

     -m,--jobmanager
<host:port>         
Address of the JobManager (master) to

                                          which
to connect. Specify

                                          'yarn-cluster'

as the JobManager to

                                          deploy
a YARN cluster for

the job. Use

                                          this

flag to connect to a different

                                          JobManager
than the one specified in

                                          the
configuration.

     -p,--parallelism
<parallelism>       The parallelism with

which to run the

                                          program.
Optional flag to override

the

                                          default
value specified in the

                                          configuration.

     -q,--sysoutLogging                  
If present, supress logging output to

                                          standard
out.

     -s,--fromSavepoint
<savepointPath>   Path to a savepoint to reset the job

                                          back
to (for

example

                                          file:///flink/savepoint-1537).

我們可以使用run選項運行Flink作業。這個腳本可以自動獲取到YARN session的地址，所以我們可以不指定--jobmanager參數。我們以Flink自帶的WordCount程序爲例進行介紹，先將測試文件上傳到HDFS上：

hadoop
fs -copyFromLocal LICENSE hdfs:///user/iteblog/

然後將這個文件作爲輸入並運行WordCount程序：

./bin/flink
run ./examples/batch/WordCount.jar --input hdfs:///user/iteblog/LICENSE

一切順利的話，可以看到在終端會顯示出計算的結果：

(0,9)

(1,6)

(10,3)

(12,1)

(15,1)

(17,1)

(2,9)

(2004,1)

(2010,2)

(2011,2)

(2012,5)

(2013,4)

(2014,6)

(2015,7)

(2016,2)

(3,6)

(4,4)

(5,3)

(50,1)

(6,3)

(7,3)

(8,2)

(9,2)

(a,25)

(above,4)

(acceptance,1)

(accepting,3)

(act,1)

如果我們不想將結果輸出在終端，而是保存在文件中，可以使用--output參數指定保存結果的地方：

./bin/flink
run ./examples/batch/WordCount.jar     \

          --input
hdfs:///user/iteblog/LICENSE    
\

          --output
hdfs:///user/iteblog/result.txt 

然後我們可以到hdfs:///user/iteblog/result.txt文件裏面查看剛剛運行的結果。

　　需要注意的是：1、上面的--input和--output參數並不是Flink內部的參數，而是WordCount程序中定義的；
　　2、指定路徑的時候一定記得需要加上模式，比如上面的hdfs://，否者程序會在本地尋找文件。

Run a single Flink job on YARN

　　上面的YARN session是在Hadoop YARN環境下啓動一個Flink cluster集羣，裏面的資源是可以共享給其他的Flink作業。我們還可以在YARN上啓動一個Flink作業。這裏我們還是使用./bin/flink，但是不需要事先啓動YARN session：

./bin/flink
run -m yarn-cluster -yn 2

./examples/batch/WordCount.jar      \

          --input
hdfs:///user/iteblog/LICENSE                           
\

          --output
hdfs:///user/iteblog/result.txt