在 Flink 應用程序中傳遞和使用參數

幾乎所有的 Flink 應用程序，包括批處理和流處理，都依賴於外部配置參數，這些參數被用來指定輸入和輸出源(如路徑或者地址)，系統參數(併發數，運行時配置)和應用程序的可配參數(通常用在自定義函數中)。

Flink 提供了一個簡單的叫做 ParameterTool 的使用工具，提供了一些基礎的工具來解決這些問題，當然你也可以不用這裏所描述的ParameterTool，使用其他的框架，如：Commons CLI 和 argparse4j 在 Flink 中也是支持的。

一、獲取配置值，並傳入 PratameterTool。

ParameterTool 提供了一系列預定義的靜態方法來讀取配置信息，ParameterTool 內部是一個 Map<String, String>，所以很容易與你自己的配置形式相集成。

1、從命令行中獲取。

public static void main(String[] args) throws Exception {
   ParameterTool parameter = ParameterTool.fromArgs(args);
   ……
}

在執行命令中使用：--name 張三 --age 20

2、從 properties 文件中獲取。

// kafka.properties
ParameterTool parameter = ParameterTool.fromPropertiesFile(Constant.CONFIG_NAME_KAFKA);

3、從系統屬性中獲取。

當啓動一個JVM時，你可以給它傳遞一些系統屬性，如：-Dinput=hdfs:///mydata，你可以用這些系統屬性來初始化 PrameterTool

ParameterTool parameter = ParameterTool.fromSystemProperties();

二、在程序中，使用 ParameterTool 參數。

1、直接從 ParameterTool 中獲取，使用。

ParameterTool parameter = ParameterTool.fromPropertiesFile(Constant.CONFIG_NAME_KAFKA);
parameter.getNumberOfParameters(); // 參數個數
String topic = parameter.getRequired("kafka.topic");
String key = parameter.get("kafka.key", "test");
int port = parameter.getInt("kafka.port", 5672);

// kafka 配置信息
Properties properties = new Properties();
properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, servers);
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, groupId);
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, reset);

// kafka consumer
FlinkKafkaConsumer consumer = new FlinkKafkaConsumer(topic, new SimpleStringSchema(), properties);

2、因爲 ParameterTool 是可序列化的，可將 ParameterToole 傳遞給函數使用。

// 數據處理
sourceStream
        .connect(configBroadcastStream)
        .process(new KafkaUserOpinionBroadcastProcessFunction(parameterTool))
        .uid("broadcast-connect-process");

public class KafkaUserOpinionBroadcastProcessFunction extends KeyedBroadcastProcessFunction<Tuple, UserOpinionData, SensitiveWordConfig, String> implements Serializable {
    private static final long serialVersionUID = 10000L;

    // 參數信息
    private Map<String, String> globalJobParametersMap;

    public KafkaUserOpinionBroadcastProcessFunction(ParameterTool parameterTool) {
        this.globalJobParametersMap = parameterTool.toMap();
    }
    ……
}

3、將 ParameterTool 註冊爲全局參數使用。

在 ExecutionConfig 中註冊爲全作業參數的參數，可以被 JobManager 的 web 端以及用戶所有自定義函數中以配置值的形式訪問。

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
ParameterTool parameter = ParameterTool.fromPropertiesFile(Constant.CONFIG_NAME_KAFKA);
env.getConfig().setGlobalJobParameters(parameter);

static final class UserOpinionFilter extends RichFlatMapFunction<UserOpinionData, Tuple2<String, Integer>> {
        ParameterTool parameterTool;

        @Override
        public void open(Configuration parameters) throws Exception {
            parameterTool = (ParameterTool) getRuntimeContext().getExecutionConfig().getGlobalJobParameters();
        }

        @Override
        public void flatMap(UserOpinionData userOpinionData, Collector<Tuple2<String, Integer>> collector) throws Exception {

        }
    }

該方法使用不當，可能會造成 flink on yarn 任務提交不了，如

[root@snd-gp2-slave bin]# ./flink run -m yarn-cluster -yn 1 -p 4 -yjm 1024 -ytm 4096 -ynm FlinkOnYarnSession-UserOpinionDataConsumer -d -c com.igg.flink.tool.userOpinionMonitor.kafka.consumer.JavaKafkaUserOpinionDataConsumer /home/flink/igg-flink-tool-1.0.0-SNAPSHOT.jar
2020-01-08 22:31:05,143 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2020-01-08 22:31:05,143 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar

或者 checkpoint 時間很長。

三、使用distributedCache

parametertool 進行參數傳遞會很方便，但是也僅僅適用於少量參數的傳遞，如果有比較大量的數據傳遞，flink則提供了另外的方

式來進行，其中之一即是 distributedCache。

在定義DAG圖的時候指定緩存文件

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Register a file from HDFS
env.registerCachedFile("hdfs:///path/to/file", "sensitiveInfo");

flink本身支持指定本地的緩存文件，但一般而言，建議指定分佈式存儲，如hdfs上的文件，併爲其指定一個名稱。

使用起來也很簡單，繼承Rich函數，在open方法中進行獲取。

// 定義文件緩存變量
private File sensitiveInfo = null;

@Override
public void open(Configuration parameters) throws Exception {
    sensitiveInfo = getRuntimeContext().getDistributedCache().getFile("sensitiveInfo");
}

應該說定義的緩存本身都是固定的，緩存不會變化，那麼如果緩存本身隨着時間也會發生變化，怎麼辦？

那就用connectStream，其實也是流的聚合了。

四、使用connectStream

這個也是在其他計算引擎中廣泛使用的方法之一。

使用 ConnectedStream 的前提當然是需要有一個動態的流，比如在主數據之外，還有一些規則數據，這些規則數據會通過

Restful服務來發布。

// 廣播流
BroadcastStream configBroadcastStream = configStream
        .map((MapFunction<String, SensitiveWordConfig>) s -> SensitiveWordConfig.buildSensitiveWordConfig(s))
        .filter((FilterFunction<SensitiveWordConfig>) sensitiveWordConfig -> sensitiveWordConfig != null)
        .uid("sensitive-word-source")
        .broadcast(new MapStateDescriptor(
                "user-opinion-broadcast-state-desc",
            BasicTypeInfo.STRING_TYPE_INFO,
            TypeInformation.of(new TypeHint<SensitiveWordConfig>(){})
        ));

具體的使用代碼，可以看筆者的博文基於Kafka+Flink+Hutool的用戶言論實時監控案例

對於 ConnectedStream，數據是從 JM 發送到 TM，有時我們需要將數據從 TM發送到 JM，要如何實現呢？可以使用

accumulator。

flink提供了accumulator來實現數據的回傳，亦即從 TM 傳回到 JM。

flink本身提供了一些內置的accumulator:

IntCounter, LongCounter, DoubleCounter – allows summing together int, long, double values sent from task managers
AverageAccumulator – calculates an average of double values
LongMaximum, LongMinimum, IntMaximum, IntMinimum, DoubleMaximum, DoubleMinimum – accumulators to determine maximum and minimum values for different types
Histogram – used to computed distribution of values from task managers

首先需要定義一個accumulator，然後在某個自定義函數中來註冊它，這樣在客戶端就可以獲取相應的的值。

new RichFlatMapFunction<String, Tuple2<String, Integer>>() {

    // Create an accumulator
    private IntCounter linesNum = new IntCounter();

    @Override
    public void open(Configuration parameters) throws Exception {
        // Register accumulator
        getRuntimeContext().addAccumulator("linesNum", linesNum);
    }

    @Override
    public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
        String[] words = line.split("\\W+");
        for (String word : words) {
            out.collect(new Tuple2<>(word, 1));
        }
        
        // Increment after each line is processed
        linesNum.add(1);
    }
}

在定義DAG中獲取回傳數據

public static void main(String[] args) throws Exception {
  // todo:
  
  // Get accumulator result
  int linesNum = env.getLastJobExecutionResult().getAccumulatorResult("linesNum");
  System.out.println(linesNum);
  
  env.execute();
}

上面介紹了幾種參數傳遞的方式，在日常的使用中，可能不僅僅是使用其中一種，或許是某些的組合，比如通過parametertool來

傳遞hdfs的路徑，再通過filecache來讀取緩存。

如果有寫的不對的地方，歡迎大家指正。有什麼疑問，歡迎加QQ羣：176098255

在 Flink 應用程序中傳遞和使用參數

一、獲取配置值，並傳入 PratameterTool。

二、在程序中，使用 ParameterTool 參數。

三、使用distributedCache

四、使用connectStream

推薦2款開源、美觀的WinForm UI控件庫

NET9 AspnetCore將整合OpenAPI的文檔生成功能而無需三方庫

hadoop 踩坑記

logstash+hdfs 實現 kafka 數據監控

數據倉庫簡介、發展、架構演進、實時數倉建設、與離線數倉對比

基於Kafka+Flink+Hutool的用戶言論實時監控案例

在 Flink 應用程序中傳遞和使用參數

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結