背景:
最近公司需要引入flink相關框架做一些大數據報表分析的任務,之前沒有實際接觸過flink,所以需要學習一下。此外,防止看完就忘,也爲了後續的回顧學習,因此在這裏做一個整理,也希望幫助到有需要的朋友。
環境準備:
我這裏是在自己的筆記本上搭建的環境
- VMware 安裝centos7虛擬機 並配置好網絡等
- win10安裝idea 並配置maven(要求3.0以上,我用的3.6.2)
- flink-1.7.2-bin-hadoop27-scala_2.12.tgz
- jdk(要求1.8以上)
瞭解到有三種配置idea開發flink的方式
-
通過cmd運行mvn archetype:generate來生成flink模板項目,然後mvn clean package進行編譯,之後導入idea即可開發
-
直接在idea中創建一個maven項目(可以是空的maven項目,也可以通過選擇archetype創建模板項目,模板會將依賴生成好,不需要很大的修改),然後在pom中配置flink相關依賴即可開發
-
curl url...獲取模板項目,導入idea即可開發
我這裏只試了1和2兩種方式,都可以。這裏以第2種空模板的方式爲例,最後面附帶一個通過arachetype創建的截圖。
1.首先創建一個空的maven項目
2.然後修改pom.xml文件,引入需要的依賴(可以在這裏找依賴)
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>asn</groupId>
<artifactId>flinkLearn</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-java -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.7.2</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.12</artifactId>
<version>1.7.2</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.12</artifactId>
<version>1.7.2</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.12</artifactId>
<version>1.7.2</version>
<scope>provided</scope>
</dependency>
</dependencies>
</project>
<scope>provided</scope>可以保證在打包的時候不會把依賴的jar包也打進去,避免跟flink集羣中的包衝突。但是在本地run的時候可能需要註釋掉。也就是說只有打包的時候才使用。
3.等帶maven將相關依賴下載下來(這個時間有點長,半小時多)
4.依賴下載完之後就可以開發flink程序了,下面是兩個wordcount程序BatchWordCountJava和SocketWindowWordCountJava
不知道爲什麼,我下載的netcat不能有效監聽端口(nc -l 9000),所以沒辦法測試流計算,就先測試批量計算。下面是完整代碼
package wordCount;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.AggregateOperator;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
public class BatchWordCountJava {
public static void main(String[] args) throws Exception {
String inputPath = "/opt/testBatch.txt";
String outPath = "/opt/output";
//獲取運行環境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//獲取數據源
DataSource<String> text = env.readTextFile(inputPath);
AggregateOperator<Tuple2<String, Integer>> sum = text.flatMap(new Tokenizer()).groupBy(0).sum(1);
sum.writeAsCsv(outPath,"\n"," ").setParallelism(1);
env.execute();
}
public static class Tokenizer implements FlatMapFunction<String, Tuple2<String,Integer>>{
@Override
public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
String[] split = s.toLowerCase().split("\\s+");
for (String word:split){
if (word.length()>0){
collector.collect(new Tuple2<>(word,1));
}
}
}
}
}
可以現在本地測試之後再打包到服務器上測試,只需要注意打包的時候修改輸入文件路徑和輸出結果路徑。(代碼中的路徑是我服務器上的路徑,各位需要自己修改)
5.如果本地沒問題,接下來需要打包,打包的話需要在pom中做些新的配置,下面是完整的配置
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>asn</groupId>
<artifactId>flinkLearn</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-java -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.7.2</version>
<scope>provided</scope><!-- 這個主要是保證打包的時候不會把額外的依賴也打包,避免跟集羣中的包衝突 -->
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.12</artifactId>
<version>1.7.2</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.12</artifactId>
<version>1.7.2</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.12</artifactId>
<version>1.7.2</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<!-- 編譯插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<!-- scala編譯插件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.1.6</version>
<configuration>
<scalaCompatVersion>2.12</scalaCompatVersion>
<scalaVersion>2.12</scalaVersion>
<encoding>UTF-8</encoding>
</configuration>
<executions>
<execution>
<id>compile-scala</id>
<phase>compile</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>test-compile-scala</id>
<phase>test-compile</phase>
<goals>
<goal>add-source</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- 打jar包插件(會包含所有依賴) -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.6</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<!-- 這裏可以設置jar包的入口類(可選) 如果不設置,可以在命令行run jar包的時候通過-c指定入口類-->
<mainClass>wordCount.BatchWordCountJava</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
從cmd中進入當前項目所在目錄,執行mvn clean package進行打包,之後會在target文件夾中生產兩個jar包(一個帶依賴)
選擇第二個帶依賴的jar包上傳到centos中,之後就可以啓動flink運行這個程序了(保證存在對應的文件路徑)
[root@flink1 flink-1.7.2]# bin/flink run ../flinkLearn-1.0-SNAPSHOT-jar-with-dependencies.jar
之後就能看到有output文件生成,且通過webui也可以看到任務執行情況。
通過arachetype創建maven項目: