背景介紹
繼Spark之後第三代內存計算框架Flink應運而生,Flink作爲第三代計算框架Flink吸取了二代大數據Spark計算的設計的精華,依然採用DAG模型做任務拆分,但是Spark在流處理領域上因爲微觀批處理實時性不高甚至在性能上還不能和一代流處理框架Storm匹敵。因此第三代計算引擎Flink誕生了,主要原因是Flink是一個純流式計算引擎,而類似於Spark這種微批的引擎,只是Flink流式引擎的一個特例。在這一點上Flink的設計思路恰恰和Spark的實現相反。
如下圖所示,Spark的模塊和架構棧是基於RDD批處理實現的核心計算引擎,然後是在批處理之上實現了 DStream (微觀批處理),所以導致了Spark Streaming在流處理的領域避免不了批處理
延遲較高的詬病。
Apache Flink是一個框架和分佈式處理引擎,用於對無界和有界數據流進行狀態計算。因此可以看出針對有界數據的計算其實本質就是批處理,對於無界數據就是Flink中的流處理。所以對於Flink而言在實現上是站在流處理的概念上實現批處理,但是Spark計算卻是站在批處理的視角上實現流處理。
不難看出Flink在架構的設計優雅程度上其實和Spark是非常相似的。資源管理上Flink同樣可以運行在Standalone和yarn、k8s等,在上層上抽象出 流處理和批處理兩個維度數據的處理方式分別處理unbound和bounded數據。並且在DataStream和DateSet API之上均有對應的實現例如SQL處理、CEP-Event (Complex event processing)、MachineLearing等。
Flink 架構
Flink運行時包含兩種類型的進程:
- JobManagers(也稱爲master)協調分佈式執行。他們安排任務,協調檢查點,協調故障恢復等。至少有一個Job Manager。高可用性設置將具有多個JobManagers,其中一個始終是leader,其他人處於standby狀態。
- TaskManagers(也稱爲worker)執行數據流的任務(或更具體地說,子任務),並緩衝和交換數據流。必須始終至少有一個TaskManager。
JobManagers和TaskManagers可以通過各種方式啓動:直接在機器上作爲standalone方式,在容器中,或由YARN或Mesos等資源框架管理。TaskManagers連接到JobManagers,宣佈自己可用,並被分配工作。
Client不是運行時程序執行的一部分,但用於準備數據流並將數據流發送到JobManager。之後,客戶端可以斷開連接或保持連接以接收進度報告。客戶端既可以作爲觸發執行的Java / Scala程序的一部分運行,也可以在命令行進程中運行./bin/flink run ...
每個worker(TaskManager)都是一個JVM進程,可以在不同的線程中執行一個或多個子任務。爲了控制Worker接受的Task數量,Worker節點運行task slots(at least one)。每個Task Slot代表TaskManager的固定資源子集。例如,具有3個Task Slots的TaskManager將其1/3的託管內存專用於每個task slot.切分資源的目的是爲了對一個任務的執行做資源隔離,也就意味着當前任務的執行一旦分配完slot之後,不會被其他job任務侵佔。如果一個TaskManager 擁有多個Task Slots意味着更多Sub Tasks 共享同一個JVM。同一JVM中的任務共享TCP連接(通過多路複用)和心跳消息。
默認情況下,Flink允許子任務共享Task slot,即使它們是不同任務的子任務,只要它們來自同一個job即可。一個Slot槽可以保存Job的的整個工作流程。允許此Task Slots共享有兩個主要好處:
- Flink集羣需要與作業中使用的TaskSlots總數。無需計算程序總共包含多少任務。
- 更好的資源利用率。允許一個job中共享Task Slots 也就意味着系統可以更加充分的使得資源得到合理的利用。沒有Task Slot共享,非密集源/ map()子任務將阻止與資源密集型 window subtasks一樣多的資源。通過Task slot,將示例中的基本並行性從2增加到6可以充分利用時隙資源,同時確保繁重的子任務在TaskManagers之間公平分配.
環境搭建
Hadoop環境
- 設置CentOS進程數和文件數(重啓生效)
[root@CentOS ~]# vi /etc/security/limits.conf
- soft nofile 204800
- hard nofile 204800
- soft nproc 204800
- hard nproc 204800
優化linux性能,可能修改這個最大值
- 配置主機名(重啓生效)
[root@CentOS ~]# vi /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=CentOS
[root@CentOS ~]# rebbot
- 設置IP映射
[root@CentOS ~]# vi /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.239.131 CentOS
- 防火牆服務
# 臨時關閉服務
[root@CentOS ~]# service iptables stop
iptables: Setting chains to policy ACCEPT: filter [ OK ]
iptables: Flushing firewall rules: [ OK ]
iptables: Unloading modules: [ OK ]
[root@CentOS ~]# service iptables status
iptables: Firewall is not running.
# 關閉開機自動啓動
[root@CentOS ~]# chkconfig iptables off
[root@CentOS ~]# chkconfig --list | grep iptables
iptables 0:off 1:off 2:off 3:off 4:off 5:off 6:off
- 安裝JDK1.8+
[root@CentOS ~]# rpm -ivh jdk-8u171-linux-x64.rpm
[root@CentOS ~]# ls -l /usr/java/
total 4
lrwxrwxrwx. 1 root root 16 Mar 26 00:56 default -> /usr/java/latest
drwxr-xr-x. 9 root root 4096 Mar 26 00:56 jdk1.8.0_171-amd64
lrwxrwxrwx. 1 root root 28 Mar 26 00:56 latest -> /usr/java/jdk1.8.0_171-amd64
[root@CentOS ~]# vi .bashrc
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
[root@CentOS ~]# source ~/.bashrc
- SSH配置免密
[root@CentOS ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
4b:29:93:1c:7f:06:93:67:fc:c5:ed:27:9b:83:26:c0 root@CentOS
The key's randomart image is:
+--[ RSA 2048]----+
| |
| o . . |
| . + + o .|
| . = * . . . |
| = E o . . o|
| + = . +.|
| . . o + |
| o . |
| |
+-----------------+
[root@CentOS ~]# ssh-copy-id CentOS
The authenticity of host 'centos (192.168.40.128)' can't be established.
RSA key fingerprint is 3f:86:41:46:f2:05:33:31:5d:b6:11:45:9c:64:12:8e.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'centos,192.168.40.128' (RSA) to the list of known hosts.
root@centos's password:
Now try logging into the machine, with "ssh 'CentOS'", and check in:
.ssh/authorized_keys
to make sure we haven’t added extra keys that you weren’t expecting.
[root@CentOS ~]# ssh root@CentOS
Last login: Tue Mar 26 01:03:52 2019 from 192.168.40.1
[root@CentOS ~]# exit
logout
Connection to CentOS closed.
- 配置HDFS|YARN
將hadoop-2.9.2.tar.gz
解壓到系統的/usr
目錄下然後配置[core|hdfs|yarn|mapred]-site.xml配置文件。
[root@CentOS ~]# vi /usr/hadoop-2.9.2/etc/hadoop/core-site.xml
<!--nn訪問入口-->
<property>
<name>fs.defaultFS</name>
<value>hdfs://CentOS:9000</value>
</property>
<!--hdfs工作基礎目錄-->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/hadoop-2.9.2/hadoop-${user.name}</value>
</property>
[root@CentOS ~]# vi /usr/hadoop-2.9.2/etc/hadoop/hdfs-site.xml
<!--block副本因子-->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!--配置Sencondary namenode所在物理主機-->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>CentOS:50090</value>
</property>
<!--設置datanode最大文件操作數-->
<property>
<name>dfs.datanode.max.xcievers</name>
<value>4096</value>
</property>
<!--設置datanode並行處理能力-->
<property>
<name>dfs.datanode.handler.count</name>
<value>6</value>
</property>
[root@CentOS ~]# vi /usr/hadoop-2.9.2/etc/hadoop/yarn-site.xml
<!--配置MapReduce計算框架的核心實現Shuffle-洗牌-->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!--配置資源管理器所在的目標主機-->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>CentOS</value>
</property>
<!--關閉物理內存檢查-->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!--關閉虛擬內存檢查-->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
[root@CentOS ~]# vi /usr/hadoop-2.9.2/etc/hadoop/mapred-site.xml
<!--MapRedcue框架資源管理器的實現-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
- 配置hadoop環境變量
[root@CentOS ~]# vi .bashrc
HADOOP_HOME=/usr/hadoop-2.9.2
JAVA_HOME=/usr/java/latest
CLASSPATH=.
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export JAVA_HOME
export CLASSPATH
export PATH
export M2_HOME
export MAVEN_OPTS
export HADOOP_HOME
export HADOOP_CONF_DIR
export HADOOP_CLASSPATH=$(hadoop classpath)
[root@CentOS ~]# source .bashrc
- 啓動Hadoop服務
[root@CentOS ~]# hdfs namenode -format # 創建初始化所需的fsimage文件
[root@CentOS ~]# start-dfs.sh
[root@CentOS ~]# start-yarn.sh
Flink環境
下載Flink安裝文件,並且文件解壓到/usr
文件夾
[root@CentOS ~]# tar -zxf flink-1.8.0-bin-scala_2.11.tgz -C /usr/
下載地址: http://mirror.bit.edu.cn/apache/flink/flink-1.8.0/flink-1.8.0-bin-scala_2.11.tgz
啓動Flink會話
[root@CentOS flink-1.8.0]# ./bin/yarn-session.sh -n 7 -tm 1024 -s 2 -d
...
Flink JobManager is now running on centos:47486 with leader id 00000000-0000-0000-0000-000000000000.
JobManager Web Interface: http://centos:47486
編寫代碼
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_${flink.scala.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<!--插件-->
<plugin>
<!-- 這是個編譯scala代碼的 -->
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- 數據打包fatjars -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
編寫WorldCount,將程序打包成jar
val env = ExecutionEnvironment.getExecutionEnvironment
env.readTextFile("hdfs://CentOS:9000/demo/words/")
.flatMap(_.split("\\W+"))
.map((_,1))
.groupBy(0)
.sum(1)
.print()
執行
[root@CentOS flink-1.8.0]# ./bin/flink run -c com.jiangzz.demo01.TestBatch -m CentOS:59800 -p 3 /root/original-flinkbatch-1.0-SNAPSHOT.jar
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/flink-1.8.0/lib/slf4j-log4j12-1.7.15.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-04-25 13:29:47,838 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found Yarn properties file under /tmp/.yarn-properties-root.
2019-04-25 13:29:47,838 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found Yarn properties file under /tmp/.yarn-properties-root.
Starting execution of program
(day,2)
(a,1)
(demo,1)
(up,1)
(good,2)
(is,1)
(study,1)
(this,1)
Program execution finished
Job with JobID 5b35ef6eb8951510adcf461d6cd5d45f has finished.
Job Runtime: 31367 ms
Accumulator Results:
- a15eb45b31ea5df2e656861d6e7749d4 (java.util.ArrayList) [8 elements]
Flink Yarn集成原理
啓動新的Flink YARN會話時,客戶端首先檢查所請求的資源(ApplicationMaster的內存和vcores)是否可用。之後,它將包含Flink和配置的jar上傳到HDFS(步驟1)。客戶端的下一步是請求YARN容器以啓動ApplicationMaster(步驟2、3)。由於客戶端將配置和jar文件註冊爲容器的資源,因此在該特定機器上運行的YARN的NodeManager將負責準備容器(例如,下載文件)。完成後,將啓動ApplicationMaster(AM)。JobManager和AM在同一容器中運行。一旦它們成功啓動,AM就知道JobManager(它自己的主機)的地址。它正在爲TaskManagers生成一個新的Flink配置文件(以便它們可以連接到JobManager)。該文件也上傳到HDFS。此外,AM容器還提供Flink的Web界面。 YARN代碼分配的所有端口都是臨時端口。這允許用戶並行執行多個Flink YARN會話。之後,AM開始爲Flink的TaskManagers分配容器,這將從HDFS下載jar文件和修改後的配置。完成這些步驟後,即可建立Flink並準備接受作業。
Standalone模式
[root@CentOS flink-1.8.0]# vi conf/flink-conf.yaml
jobmanager.rpc.address: CentOS
# The RPC port where the JobManager is reachable.
jobmanager.rpc.port: 6123
# The heap size for the JobManager JVM
jobmanager.heap.size: 1024m
# The heap size for the TaskManager JVM
taskmanager.heap.size: 1024m
# The number of task slots that each TaskManager offers. Each slot runs one parallel pipeline.
taskmanager.numberOfTaskSlots: 8
# The parallelism used for programs that did not specify and other parallelism.
parallelism.default: 3
[root@CentOS flink-1.8.0]# vi conf/slaves
CentOS
啓動Flink集羣
[root@CentOS flink-1.8.0]# ./bin/start-cluster.sh
訪問web頁面
提交job任務如下:
[root@CentOS flink-1.8.0]# ./bin/flink run -m CentOS:8081 -c com.jiangzz.demo01.TestBatch /root/original-flinkbatch-1.0-SNAPSHOT.jar
Starting execution of program
(day,2)
(a,1)
(demo,1)
(up,1)
(good,2)
(is,1)
(study,1)
(this,1)
Program execution finished
Job with JobID 515fa31d6641100f11a20d37232fff6c has finished.
Job Runtime: 18229 ms
Accumulator Results:
- c9ca57cd45122888fa33e38682138617 (java.util.ArrayList) [8 elements]
更多精彩內容關注
</div>