flink任務提交流程分析

下面先上傳一張啓動流程的順序圖

從圖上看出MiniCluster（包含）之前的所有的流程都是屬於客戶端的，後續的都是屬於遠端

其中每一個流轉類，如果細講都可以形成好幾篇，下面在結合代碼細講一下

StreamExecutionEnvironment類表示運行一個Flink任務所需的環境，分爲本地LocalStreamEnvironment和遠程RemoteStreamEnvironment兩種。這個環境讓我們可以配置參數來控制如何運行Flink任務。我們來看看這個環境類中有哪些內容：

private final ExecutionConfig config = new ExecutionConfig();

StreamExecutionEnvironment包含一個ExecutionConfig實例，負責設置默認的任務併發度（當一個function沒有顯式指定時適用），失敗重試次數及間隔，數據傳遞模式（batch或pipelined），開啓UDF代碼分析模式，註冊序列化方式等等配置。

特別一提的是ClosureCleaner，開啓後可以分析用戶代碼，將不需要的closure置爲null，從而在大多數情況下使得閉包或匿名類可以序列化。用戶代碼必須是可以序列化的，以做到在集羣不同節點之間傳輸任務。

之後是一個針對checkpoint檢查點功能的配置類CheckpointConfig

private final CheckpointConfig checkpointCfg = new CheckpointConfig();

該配置包含checkpoint模式（默認EXACTLY_ONCE），checkpoint超時時限，觸發間隔，併發checkpoint數量，清理持久化的checkpoint文件（任務取消時刪除還是保留對應checkpoint），失敗處理策略等等配置。

下一個變量transformations值得關注，它是保存了該任務所有的StreamTransformation實例的集合。

protected final List<StreamTransformation<?>> transformations = new ArrayList<>();

一個StreamTransformation<T>表示會生成一個DataStream的操作單元，每一個DataStream<T>都包含指向其生成源StreamTransformation<T>的引用。調用DataStream的方法（比如map）時，Flink會根據計算拓撲結構生成一個由StreamTransformation組成的樹狀結構，只有當真正執行任務計算時用StreamGraphGenerator將其轉換爲StreamGraph。需要注意的是，並不是所有的方法都會產生實際操作單元，比如union，split，select，rebalance，partition等對操作單元進行歸類，整理的操作不會生成操作單元。稍後再來看看StreamGraph相關的源碼，我們先看看StreamTransformation<T>類：

每個實例有一個int類型的唯一id，通過一個static的遞增的idCounter獲得（應該是線程安全的吧，都沒有併發保護）。同時還有一個String類型的uid，由用戶指定並且在任務重啓前後保持一致。

protected static Integer idCounter = 0;

public static int getNewNodeId() {
   idCounter++;
   return idCounter;
}

輸出類型通過TypeInformation類封裝，用來生成序列化用的serializers和比較大小用的comparators，以及進行一些類型檢查。

protected TypeInformation<T> outputType;

以下變量可以設置資源需求，ResourceSpec類可以指定該StreamTransformation所需的資源，包括CPU數量，heap內存，direct內存，native內存等。

private ResourceSpec minResources = ResourceSpec.DEFAULT;
private ResourceSpec preferredResources = ResourceSpec.DEFAULT;

hashCode和equals方法被重寫如下：

@Override
public boolean equals(Object o) {
   if (this == o) {
      return true;
   }
   if (!(o instanceof StreamTransformation)) {
      return false;
   }

   StreamTransformation<?> that = (StreamTransformation<?>) o;

   if (bufferTimeout != that.bufferTimeout) {
      return false;
   }
   if (id != that.id) {
      return false;
   }
   if (parallelism != that.parallelism) {
      return false;
   }
   if (!name.equals(that.name)) {
      return false;
   }
   return outputType != null ? outputType.equals(that.outputType) : that.outputType == null;
}

@Override
public int hashCode() {
   int result = id;
   result = 31 * result + name.hashCode();
   result = 31 * result + (outputType != null ? outputType.hashCode() : 0);
   result = 31 * result + parallelism;
   result = 31 * result + (int) (bufferTimeout ^ (bufferTimeout >>> 32));
   return result;
}

大概瞭解完StreamTransformation類了，讓我們回到StreamExecutionEnvironment類。存儲鍵值對及狀態快照的組件被抽象爲StateBackend類。該類必須是可以序列化的，因爲需要和任務代碼一起被髮送到多個分佈式的節點上並行運行。因此通常AbstractStateBackend的子類實現是以工廠模式的形式，保證序列化及反序列化之後能夠還原正確的狀態並指向正確的存儲服務，這樣做會很輕量級，提高序列化的效率。StateBackend的實現也必須是線程安全的，以便多個Operator併發地使用。

通過設置bufferTimeout可以控制輸出緩存flush的間隔，用來平衡延遲和吞吐量。

再來看看構造計算拓撲DAG用的API。addSource方法用來添加一個數據源到計算任務中。默認情況下數據源是非並行的，用戶需要實現ParallelSourceFunction接口或者繼承RichParallelSourceFunction來實現可並行的數據源。

addSource方法將一個StreamFunction封裝爲StreamSource，當數據源開始執行時調用SourceFunction#run(SourceContext<T> ctx)方法，持續地向SourceContext發送生成的數據。

public <OUT> DataStreamSource<OUT> addSource(SourceFunction<OUT> function, String sourceName, TypeInformation<OUT> typeInfo) {

   // code omitted...

   boolean isParallel = function instanceof ParallelSourceFunction;

   clean(function);
   StreamSource<OUT, ?> sourceOperator;
   if (function instanceof StoppableFunction) {
      sourceOperator = new StoppableStreamSource<>(cast2StoppableSourceFunction(function));
   } else {
      sourceOperator = new StreamSource<>(function);
   }

   return new DataStreamSource<>(this, typeInfo, sourceOperator, isParallel, sourceName);
}

如何生成整個任務的DAG計算圖呢？getStreamGraph方法會調用StreamGraphGenerator#generate方法，使用StreamExecutionEnvironment及其包含的所有transformations生成計算圖。

@Internal
public StreamGraph getStreamGraph() {
   if (transformations.size() <= 0) {
      throw new IllegalStateException("No operators defined in streaming topology. Cannot execute.");
   }
   return StreamGraphGenerator.generate(this, transformations);
}

public static StreamGraph generate(StreamExecutionEnvironment env, List<StreamTransformation<?>> transformations) {
   return new StreamGraphGenerator(env).generateInternal(transformations);
}

private StreamGraph generateInternal(List<StreamTransformation<?>> transformations) {
   for (StreamTransformation<?> transformation: transformations) {

// 遍歷所有transformation並轉換爲計算圖
      transform(transformation);
   }
   return streamGraph;
}

具體轉換的方法在StreamGraphGenerator#transform方法中定義，直接返回已經被轉換過的實例，否則根據StreamTransformation的具體類型調用相應處理邏輯：

private Collection<Integer> transform(StreamTransformation<?> transform) {

// 直接返回已經完成轉換的實例
   if (alreadyTransformed.containsKey(transform)) {
      return alreadyTransformed.get(transform);
   }

   LOG.debug("Transforming " + transform);

   if (transform.getMaxParallelism() <= 0) {

      // if the max parallelism hasn't been set, then first use the job wide max parallelism
      // from theExecutionConfig.
      int globalMaxParallelismFromConfig = env.getConfig().getMaxParallelism();
      if (globalMaxParallelismFromConfig > 0) {
         transform.setMaxParallelism(globalMaxParallelismFromConfig);
      }
   }

   // call at least once to trigger exceptions about MissingTypeInfo
   transform.getOutputType();

// 根據不同的類型調用相應的轉換邏輯 
   Collection<Integer> transformedIds;
   if (transform instanceof OneInputTransformation<?, ?>) {
      transformedIds = transformOneInputTransform((OneInputTransformation<?, ?>) transform);
   } else if (transform instanceof TwoInputTransformation<?, ?, ?>) {
      transformedIds = transformTwoInputTransform((TwoInputTransformation<?, ?, ?>) transform);
   } else if (transform instanceof SourceTransformation<?>) {
      transformedIds = transformSource((SourceTransformation<?>) transform);
   } else if (transform instanceof SinkTransformation<?>) {
      transformedIds = transformSink((SinkTransformation<?>) transform);
   } else if (transform instanceof UnionTransformation<?>) {
      transformedIds = transformUnion((UnionTransformation<?>) transform);
   } else if (transform instanceof SplitTransformation<?>) {
      transformedIds = transformSplit((SplitTransformation<?>) transform);
   } else if (transform instanceof SelectTransformation<?>) {
      transformedIds = transformSelect((SelectTransformation<?>) transform);
   } else if (transform instanceof FeedbackTransformation<?>) {
      transformedIds = transformFeedback((FeedbackTransformation<?>) transform);
   } else if (transform instanceof CoFeedbackTransformation<?>) {
      transformedIds = transformCoFeedback((CoFeedbackTransformation<?>) transform);
   } else if (transform instanceof PartitionTransformation<?>) {
      transformedIds = transformPartition((PartitionTransformation<?>) transform);
   } else if (transform instanceof SideOutputTransformation<?>) {
      transformedIds = transformSideOutput((SideOutputTransformation<?>) transform);
   } else {
      throw new IllegalStateException("Unknown transformation: " + transform);
   }

   // need this check because the iterate transformation adds itself before
   // transforming the feedback edges
   if (!alreadyTransformed.containsKey(transform)) {
      alreadyTransformed.put(transform, transformedIds);
   }

   if (transform.getBufferTimeout() >= 0) {
      streamGraph.setBufferTimeout(transform.getId(), transform.getBufferTimeout());
   }
   if (transform.getUid() != null) {
      streamGraph.setTransformationUID(transform.getId(), transform.getUid());
   }
   if (transform.getUserProvidedNodeHash() != null) {
      streamGraph.setTransformationUserHash(transform.getId(), transform.getUserProvidedNodeHash());
   }

   if (transform.getMinResources() != null && transform.getPreferredResources() != null) {
      streamGraph.setResources(transform.getId(), transform.getMinResources(), transform.getPreferredResources());
   }

   return transformedIds;
}

首先來看看transformOneInputTransform的實現。它首先拿到輸入的StreamTransformation實例，遞歸地調用transform處理輸入實例，然後決定當前實例屬於哪個資源共享組（slot），將其添加爲DAG中的一個Operator（同時添加相對應的edge），並設置partitioning所需的key serializer和該Operator的併發度。

private <IN, OUT> Collection<Integer> transformOneInputTransform(OneInputTransformation<IN, OUT> transform) {

// 遞歸地處理輸入的StreamTransformation實例
   Collection<Integer> inputIds = transform(transform.getInput());

   // the recursive call might have already transformed this
   if (alreadyTransformed.containsKey(transform)) {
      return alreadyTransformed.get(transform);
   }

// 判斷該實例屬於哪一個資源共享slot槽位
   String slotSharingGroup = determineSlotSharingGroup(transform.getSlotSharingGroup(), inputIds);

// 添加相應的Operator到DAG中
   streamGraph.addOperator(transform.getId(),
         slotSharingGroup,
         transform.getCoLocationGroupKey(),
         transform.getOperator(),
         transform.getInputType(),
         transform.getOutputType(),
         transform.getName());

   if (transform.getStateKeySelector() != null) {
      TypeSerializer<?> keySerializer = transform.getStateKeyType().createSerializer(env.getConfig());
      streamGraph.setOneInputStateKey(transform.getId(), transform.getStateKeySelector(), keySerializer);
   }

   streamGraph.setParallelism(transform.getId(), transform.getParallelism());
   streamGraph.setMaxParallelism(transform.getId(), transform.getMaxParallelism());

   for (Integer inputId: inputIds) {

// 添加相應的邊到DAG中
      streamGraph.addEdge(inputId, transform.getId(), 0);
   }

   return Collections.singleton(transform.getId());
}

其他StreamTransform類型的處理方式大致也是這幾個步驟：1）遞歸處理輸入的StreamTransform實例，2）確定資源共享slot槽位，3）添加Operator節點和相應的edge到DAG中，4）設置並行度和partition所需的serializer等。

最後一步就是通過調用StreamExecutionEnvironment#execute方法真正啓動任務了。本地模式和遠程模式分別實現了execute方法，先來看看本地模式的實現：

@Override
public JobExecutionResult execute(String jobName) throws Exception {
   // transform the streaming program into a JobGraph
   // 將定義的StreamTransformation集合轉換爲DAG計算圖
   StreamGraph streamGraph = getStreamGraph();
   streamGraph.setJobName(jobName);

// 將DAG計算圖轉換爲任務圖
   JobGraph jobGraph = streamGraph.getJobGraph();
   jobGraph.setAllowQueuedScheduling(true);

   Configuration configuration = new Configuration();
   configuration.addAll(jobGraph.getJobConfiguration());
   configuration.setString(TaskManagerOptions.MANAGED_MEMORY_SIZE, "0");

   // add (and override) the settings with what the user defined
   configuration.addAll(this.configuration);

   if (!configuration.contains(RestOptions.PORT)) {
      configuration.setInteger(RestOptions.PORT, 0);
   }

   int numSlotsPerTaskManager = configuration.getInteger(TaskManagerOptions.NUM_TASK_SLOTS, jobGraph.getMaximumParallelism());

   MiniClusterConfiguration cfg = new MiniClusterConfiguration.Builder()
      .setConfiguration(configuration)
      .setNumSlotsPerTaskManager(numSlotsPerTaskManager)
      .build();

   if (LOG.isInfoEnabled()) {
      LOG.info("Running job on local embedded Flink mini cluster");
   }

   MiniCluster miniCluster = new MiniCluster(cfg);

   try {
      miniCluster.start();
      configuration.setInteger(RestOptions.PORT, miniCluster.getRestAddress().getPort());

      return miniCluster.executeJobBlocking(jobGraph);
   }
   finally {
      transformations.clear();
      miniCluster.close();
   }
}

在執行前必須把StreamGraph轉換成JobGraph，大致需要以下幾個步驟：

1）計算各個節點的哈希值
2）如果可行，串聯合並相鄰的計算步驟以提高執行效率
3）設置計算圖中的邊並保存到配置中
4）設置資源共享槽位和相關聯節點的位置
5）配置Checkpoint機制
6）將用戶提供的文件上傳到分佈式緩存中

public JobGraph getJobGraph(@Nullable JobID jobID) {
   // temporarily forbid checkpointing for iterative jobs
   if (isIterative() && checkpointConfig.isCheckpointingEnabled() && !checkpointConfig.isForceCheckpointing()) {
      throw new UnsupportedOperationException(
         "Checkpointing is currently not supported by default for iterative jobs, as we cannot guarantee exactly once semantics. "
            + "State checkpoints happen normally, but records in-transit during the snapshot will be lost upon failure. "
            + "\nThe user can force enable state checkpoints with the reduced guarantees by calling: env.enableCheckpointing(interval,true)");
   }

   return StreamingJobGraphGenerator.createJobGraph(this, jobID);
}

public static JobGraph createJobGraph(StreamGraph streamGraph, @Nullable JobID jobID) {
   return new StreamingJobGraphGenerator(streamGraph, jobID).createJobGraph();
}

private JobGraph createJobGraph() {

   // make sure that all vertices start immediately
   jobGraph.setScheduleMode(ScheduleMode.EAGER);

   // Generate deterministic hashes for the nodes in order to identify them across
   // submission iff they didn't change.
   Map<Integer, byte[]> hashes = defaultStreamGraphHasher.traverseStreamGraphAndGenerateHashes(streamGraph);

   // Generate legacy version hashes for backwards compatibility
   List<Map<Integer, byte[]>> legacyHashes = new ArrayList<>(legacyStreamGraphHashers.size());
   for (StreamGraphHasher hasher : legacyStreamGraphHashers) {
      legacyHashes.add(hasher.traverseStreamGraphAndGenerateHashes(streamGraph));
   }

   Map<Integer, List<Tuple2<byte[], byte[]>>> chainedOperatorHashes = new HashMap<>();

// 將可以串聯合並的Operator合併
   setChaining(hashes, legacyHashes, chainedOperatorHashes);

// 設置任務圖中的邊，並且寫入配置
   setPhysicalEdges();

// 設置資源共享槽位和相關聯節點的位置
   setSlotSharingAndCoLocation();

   // 配置Checkpoint機制
   configureCheckpointing();

   // 將用戶提供的文件上傳到分佈式緩存中   JobGraphGenerator.addUserArtifactEntries(streamGraph.getEnvironment().getCachedFiles(), jobGraph);

   // set the ExecutionConfig last when it has been finalized
   try {
      jobGraph.setExecutionConfig(streamGraph.getExecutionConfig());
   }
   catch (IOException e) {
      throw new IllegalConfigurationException("Could not serialize the ExecutionConfig." +
            "This indicates that non-serializable types (like custom serializers) were registered");
   }

   return jobGraph;
}

得到JobGraph之後，我們需要提交它到一個實現了JobExecutor接口的服務。本地模式使用MiniCluster類實現JobExecutor#executeJobBlocking(JobGraph job)來執行JobGraph。

@Override
public JobExecutionResult executeJobBlocking(JobGraph job) throws JobExecutionException, InterruptedException {
   checkNotNull(job, "job is null");

   final CompletableFuture<JobSubmissionResult> submissionFuture = submitJob(job);

   final CompletableFuture<JobResult> jobResultFuture = submissionFuture.thenCompose(
      (JobSubmissionResult ignored) -> requestJobResult(job.getJobID()));

   final JobResult jobResult;

   try {
      jobResult = jobResultFuture.get();
   } catch (ExecutionException e) {
      throw new JobExecutionException(job.getJobID(), "Could not retrieve JobResult.", ExceptionUtils.stripExecutionException(e));
   }

   try {
      return jobResult.toJobExecutionResult(Thread.currentThread().getContextClassLoader());
   } catch (IOException | ClassNotFoundException e) {
      throw new JobExecutionException(job.getJobID(), e);
   }
}

上面的方法調用submitJob方法提交任務，具體步驟包括開啓queued scheduling，上傳任務所需的jar文件到Blob文件服務端，向DispatcherGateway提交任務。代碼如下：

public CompletableFuture<JobSubmissionResult> submitJob(JobGraph jobGraph) {
   final DispatcherGateway dispatcherGateway;
   try {
      dispatcherGateway = getDispatcherGateway();
   } catch (LeaderRetrievalException | InterruptedException e) {
      ExceptionUtils.checkInterrupted(e);
      return FutureUtils.completedExceptionally(e);
   }

   // we have to allow queued scheduling in Flip-6 mode because we need to request slots
   // from the ResourceManager
   jobGraph.setAllowQueuedScheduling(true);

   final CompletableFuture<InetSocketAddress> blobServerAddressFuture = createBlobServerAddress(dispatcherGateway);

   final CompletableFuture<Void> jarUploadFuture = uploadAndSetJobFiles(blobServerAddressFuture, jobGraph);

   final CompletableFuture<Acknowledge> acknowledgeCompletableFuture = jarUploadFuture.thenCompose(
      (Void ack) -> dispatcherGateway.submitJob(jobGraph, rpcTimeout));

   return acknowledgeCompletableFuture.thenApply(
      (Acknowledge ignored) -> new JobSubmissionResult(jobGraph.getJobID()));
}

在Dispatcher類中，提交過程實現如下：

@Override
public CompletableFuture<Acknowledge> submitJob(JobGraph jobGraph, Time timeout) {
   return internalSubmitJob(jobGraph).whenCompleteAsync((acknowledge, throwable) -> {
      if (throwable != null) {
         cleanUpJobData(jobGraph.getJobID(), true);
      }
   }, getRpcService().getExecutor());
}

private CompletableFuture<Acknowledge> internalSubmitJob(JobGraph jobGraph) {
   final JobID jobId = jobGraph.getJobID();

   log.info("Submitting job {} ({}).", jobId, jobGraph.getName());
   final RunningJobsRegistry.JobSchedulingStatus jobSchedulingStatus;

   try {
      jobSchedulingStatus = runningJobsRegistry.getJobSchedulingStatus(jobId);
   } catch (IOException e) {
      return FutureUtils.completedExceptionally(new FlinkException(String.format("Failed to retrieve job scheduling status for job %s.", jobId), e));
   }

   if (jobSchedulingStatus == RunningJobsRegistry.JobSchedulingStatus.DONE || jobManagerRunnerFutures.containsKey(jobId)) {
      return FutureUtils.completedExceptionally(
         new JobSubmissionException(jobId, String.format("Job has already been submitted and is in state %s.", jobSchedulingStatus)));
   } else {
      final CompletableFuture<Acknowledge> persistAndRunFuture = waitForTerminatingJobManager(jobId, jobGraph, this::persistAndRunJob)
         .thenApply(ignored -> Acknowledge.get());

      return persistAndRunFuture.exceptionally(
         (Throwable throwable) -> {
            final Throwable strippedThrowable = ExceptionUtils.stripCompletionException(throwable);
            log.error("Failed to submit job {}.", jobId, strippedThrowable);
            throw new CompletionException(
               new JobSubmissionException(jobId, "Failed to submit job.", strippedThrowable));
         });
   }
}

persistAndRunJob方法保存提交的JobGraph爲SubmittedJobGraph，然後運行任務，當任務執行時拋出異常則刪除該任務。

private CompletableFuture<Void> persistAndRunJob(JobGraph jobGraph) throws Exception {
   submittedJobGraphStore.putJobGraph(new SubmittedJobGraph(jobGraph));

   final CompletableFuture<Void> runJobFuture = runJob(jobGraph);

   return runJobFuture.whenComplete(BiConsumerWithException.unchecked((Object ignored, Throwable throwable) -> {
      if (throwable != null) {
         submittedJobGraphStore.removeJobGraph(jobGraph.getJobID());
      }
   }));
}

private CompletableFuture<Void> runJob(JobGraph jobGraph) {
   Preconditions.checkState(!jobManagerRunnerFutures.containsKey(jobGraph.getJobID()));

   final CompletableFuture<JobManagerRunner> jobManagerRunnerFuture = createJobManagerRunner(jobGraph);

   jobManagerRunnerFutures.put(jobGraph.getJobID(), jobManagerRunnerFuture);

   return jobManagerRunnerFuture
      .thenApply(FunctionUtils.nullFn())
      .whenCompleteAsync(
         (ignored, throwable) -> {
            if (throwable != null) {
               jobManagerRunnerFutures.remove(jobGraph.getJobID());
            }
         },
         getMainThreadExecutor());
}

private CompletableFuture<JobManagerRunner> createJobManagerRunner(JobGraph jobGraph) {
   final RpcService rpcService = getRpcService();

   final CompletableFuture<JobManagerRunner> jobManagerRunnerFuture = CompletableFuture.supplyAsync(
      CheckedSupplier.unchecked(() ->
         jobManagerRunnerFactory.createJobManagerRunner(
            ResourceID.generate(),
            jobGraph,
            configuration,
            rpcService,
            highAvailabilityServices,
            heartbeatServices,
            blobServer,
            jobManagerSharedServices,
            new DefaultJobManagerJobMetricGroupFactory(jobManagerMetricGroup),
            fatalErrorHandler)),
      rpcService.getExecutor());

   return jobManagerRunnerFuture.thenApply(FunctionUtils.uncheckedFunction(this::startJobManagerRunner));
}

我們看到它創建了一個JobManagerRunner實例，爲該任務創建一個JobMaster實例，同時創建ExecutionGraph並保存在JobMaster中。最終是startJobManagerRunner方法真正地開始執行任務。

private JobManagerRunner startJobManagerRunner(JobManagerRunner jobManagerRunner) throws Exception {
   final JobID jobId = jobManagerRunner.getJobGraph().getJobID();
   jobManagerRunner.getResultFuture().whenCompleteAsync(
      (ArchivedExecutionGraph archivedExecutionGraph, Throwable throwable) -> {
         // check if we are still the active JobManagerRunner by checking the identity
         //noinspection ObjectEquality
         if (jobManagerRunner == jobManagerRunnerFutures.get(jobId).getNow(null)) {
            if (archivedExecutionGraph != null) {
               jobReachedGloballyTerminalState(archivedExecutionGraph);
            } else {
               final Throwable strippedThrowable = ExceptionUtils.stripCompletionException(throwable);

               if (strippedThrowable instanceof JobNotFinishedException) {
                  jobNotFinished(jobId);
               } else {
                  jobMasterFailed(jobId, strippedThrowable);
               }
            }
         } else {
            log.debug("There is a newer JobManagerRunner for the job {}.", jobId);
         }
      }, getMainThreadExecutor());

   jobManagerRunner.start();

   return jobManagerRunner;
}

JobManagerRunner的start方法體現了分佈式系統中主從一致性的處理方式。JobManagerRunner類本身實現了LeaderContender接口，顧名思義是擁有參與leader競爭的功能。它調用start方法會將自身傳遞給LeaderElectionService的start方法啓動競選服務並嘗試成爲leader，當成功成爲leader時該服務回調競選者（JobManagerRunner）的grantLeadership方法，從而調用verifyJobSchedulingStatusAndStartJobManager方法啓動對應的JobMaster的執行過程。

public void start() throws Exception {
   try {
      // 競爭leader，成功時競選服務會回調grantLeadership方法
      leaderElectionService.start(this);
   } catch (Exception e) {
      log.error("Could not start the JobManager because the leader election service did not start.", e);
      throw new Exception("Could not start the leader election service.", e);
   }
}

@Override
public void grantLeadership(final UUID leaderSessionID) {
   synchronized (lock) {
      if (shutdown) {
         log.info("JobManagerRunner already shutdown.");
         return;
      }

      try {
         verifyJobSchedulingStatusAndStartJobManager(leaderSessionID);
      } catch (Exception e) {
         handleJobManagerRunnerError(e);
      }
   }
}

private void verifyJobSchedulingStatusAndStartJobManager(UUID leaderSessionId) throws Exception {
   final JobSchedulingStatus jobSchedulingStatus = runningJobsRegistry.getJobSchedulingStatus(jobGraph.getJobID());

   if (jobSchedulingStatus == JobSchedulingStatus.DONE) {
      log.info("Granted leader ship but job {} has been finished. ", jobGraph.getJobID());
      jobFinishedByOther();
   } else {
      log.info("JobManager runner for job {} ({}) was granted leadership with session id {} at {}.",
         jobGraph.getName(), jobGraph.getJobID(), leaderSessionId, getAddress());

      runningJobsRegistry.setJobRunning(jobGraph.getJobID());

      final CompletableFuture<Acknowledge> startFuture = jobMaster.start(new JobMasterId(leaderSessionId), rpcTimeout);
      final CompletableFuture<JobMasterGateway> currentLeaderGatewayFuture = leaderGatewayFuture;

      startFuture.whenCompleteAsync(
         (Acknowledge ack, Throwable throwable) -> {
            if (throwable != null) {
               handleJobManagerRunnerError(new FlinkException("Could not start the job manager.", throwable));
            } else {
               confirmLeaderSessionIdIfStillLeader(leaderSessionId, currentLeaderGatewayFuture);
            }
         },
         jobManagerSharedServices.getScheduledExecutorService());
   }
}

public CompletableFuture<Acknowledge> start(final JobMasterId newJobMasterId, final Time timeout) throws Exception {
   // make sure we receive RPC and async calls
   super.start();

   return callAsyncWithoutFencing(() -> startJobExecution(newJobMasterId), timeout);
}

private Acknowledge startJobExecution(JobMasterId newJobMasterId) throws Exception {
   validateRunsInMainThread();

   checkNotNull(newJobMasterId, "The new JobMasterId must not be null.");

   if (Objects.equals(getFencingToken(), newJobMasterId)) {
      log.info("Already started the job execution with JobMasterId {}.", newJobMasterId);

      return Acknowledge.get();
   }

   setNewFencingToken(newJobMasterId);

   startJobMasterServices();

   log.info("Starting execution of job {} ({})", jobGraph.getName(), jobGraph.getJobID());

   resetAndScheduleExecutionGraph();

   return Acknowledge.get();
}

resetAndScheduleExecutionGraph方法將JobGraph轉換爲ExecutionGraph並安排執行。

private void resetAndScheduleExecutionGraph() throws Exception {
   validateRunsInMainThread();

final CompletableFuture<Void> executionGraphAssignedFuture;

if (executionGraph.getState() == JobStatus.CREATED) {
      executionGraphAssignedFuture = CompletableFuture.completedFuture(null);
   } else {
      suspendAndClearExecutionGraphFields(new FlinkException("ExecutionGraph is being reset in order to be rescheduled."));
      final JobManagerJobMetricGroup newJobManagerJobMetricGroup = jobMetricGroupFactory.create(jobGraph);
      final ExecutionGraph newExecutionGraph = createAndRestoreExecutionGraph(newJobManagerJobMetricGroup);

executionGraphAssignedFuture = executionGraph.getTerminationFuture().handleAsync(
         (JobStatus ignored, Throwable throwable) -> {
            assignExecutionGraph(newExecutionGraph, newJobManagerJobMetricGroup);
            return null;
         },
         getMainThreadExecutor());
   }   executionGraphAssignedFuture.thenRun(this::scheduleExecutionGraph);
}

createAndRestoreExecutionGraph方法會根據JobGraph創建一個新的ExecutionGraph，並從保存的檢查點恢復其狀態。ExecutionGraphBuilder.buildGraph方法描述了創建過程。JobGraph中的所有節點將按照拓撲順序添加到已有或新建的ExecutionGraph中：

private ExecutionGraph createAndRestoreExecutionGraph(JobManagerJobMetricGroup currentJobManagerJobMetricGroup) throws Exception {

ExecutionGraph newExecutionGraph = createExecutionGraph(currentJobManagerJobMetricGroup);

final CheckpointCoordinator checkpointCoordinator = newExecutionGraph.getCheckpointCoordinator();

if (checkpointCoordinator != null) {
      // check whether we find a valid checkpoint
      if (!checkpointCoordinator.restoreLatestCheckpointedState(
         newExecutionGraph.getAllVertices(),
         false,
         false)) {

// check whether we can restore from a savepoint
         // 嘗試從檢查點恢復ExecutionGraph的狀態
         tryRestoreExecutionGraphFromSavepoint(newExecutionGraph, jobGraph.getSavepointRestoreSettings());
      }
   }

return newExecutionGraph;
}

private ExecutionGraph createExecutionGraph(JobManagerJobMetricGroup currentJobManagerJobMetricGroup) throws JobExecutionException, JobException {
   return ExecutionGraphBuilder.buildGraph(
      null,
      jobGraph,
      jobMasterConfiguration.getConfiguration(),
      scheduledExecutorService,
      scheduledExecutorService,
      slotPool.getSlotProvider(),
      userCodeLoader,
      highAvailabilityServices.getCheckpointRecoveryFactory(),
      rpcTimeout,
      restartStrategy,
      currentJobManagerJobMetricGroup,
      blobServer,
      jobMasterConfiguration.getSlotRequestTimeout(),
      log);
}

具體添加過程在ExecutionGraph.attachJobGraph方法中定義。

public void attachJobGraph(List<JobVertex> topologicallySorted) throws JobException {

LOG.debug("Attaching {} topologically sorted vertices to existing job graph with {} " +
         "vertices and {} intermediate results.",
         topologicallySorted.size(), tasks.size(), intermediateResults.size());

final ArrayList<ExecutionJobVertex> newExecJobVertices = new ArrayList<>(topologicallySorted.size());
   final long createTimestamp = System.currentTimeMillis();

for (JobVertex jobVertex : topologicallySorted) {

if (jobVertex.isInputVertex() && !jobVertex.isStoppable()) {
         this.isStoppable = false;
      }

// create the execution job vertex and attach it to the graph
      ExecutionJobVertex ejv = new ExecutionJobVertex(
         this,
         jobVertex,
         1,
         rpcTimeout,
         globalModVersion,
         createTimestamp);

ejv.connectToPredecessors(this.intermediateResults);

ExecutionJobVertex previousTask = this.tasks.putIfAbsent(jobVertex.getID(), ejv);
      if (previousTask != null) {
         throw new JobException(String.format("Encountered two job vertices with ID %s : previous=[%s] / new=[%s]",
               jobVertex.getID(), ejv, previousTask));
      }

for (IntermediateResult res : ejv.getProducedDataSets()) {
         IntermediateResult previousDataSet = this.intermediateResults.putIfAbsent(res.getId(), res);
         if (previousDataSet != null) {
            throw new JobException(String.format("Encountered two intermediate data set with ID %s : previous=[%s] / new=[%s]",
                  res.getId(), res, previousDataSet));
         }
      }

this.verticesInCreationOrder.add(ejv);
      this.numVerticesTotal += ejv.getParallelism();
      newExecJobVertices.add(ejv);
   }

terminationFuture = new CompletableFuture<>();
   failoverStrategy.notifyNewVertices(newExecJobVertices);
}

創建好ExecutionGraph後就可以調用scheduleExecutionGraph安排執行了。Flink支持兩種執行模式，LAZY_FROM_SOURCE模式只有在一個Operator的輸入數據就緒時才初始化該節點，EAGER模式會在一開始就按拓撲順序加載計算圖中的所有節點。

private void scheduleExecutionGraph() {
   checkState(jobStatusListener == null);
   // register self as job status change listener
   jobStatusListener = new JobManagerJobStatusListener();
   executionGraph.registerJobStatusListener(jobStatusListener);

try {
      executionGraph.scheduleForExecution();
   }
   catch (Throwable t) {
      executionGraph.failGlobal(t);
   }
}

public void scheduleForExecution() throws JobException {

   final long currentGlobalModVersion = globalModVersion;

   if (transitionState(JobStatus.CREATED, JobStatus.RUNNING)) {

      final CompletableFuture<Void> newSchedulingFuture;

      switch (scheduleMode) {

         case LAZY_FROM_SOURCES:
            // 只有在一個Operator節點的輸入數據就緒時才初始化
            newSchedulingFuture = scheduleLazy(slotProvider);
            break;

         case EAGER:
            // 一開始就初始化所有Operator節點
            newSchedulingFuture = scheduleEager(slotProvider, allocationTimeout);
            break;

         default:
            throw new JobException("Schedule mode is invalid.");
      }

      if (state == JobStatus.RUNNING && currentGlobalModVersion == globalModVersion) {
         schedulingFuture = newSchedulingFuture;

         newSchedulingFuture.whenCompleteAsync(
            (Void ignored, Throwable throwable) -> {
               if (throwable != null && !(throwable instanceof CancellationException)) {
                  // only fail if the scheduling future was not canceled
                  failGlobal(ExceptionUtils.stripCompletionException(throwable));
               }
            },
            futureExecutor);
      } else {
         newSchedulingFuture.cancel(false);
      }
   }
   else {
      throw new IllegalStateException("Job may only be scheduled from state " + JobStatus.CREATED);
   }
}

EAGER模式的初始化會按照拓撲順序依次爲每一個ExecutionJobVertex（異步地）分配資源，分配完成後會返回一個Execution集合表示該任務的一次執行，並依次調用Execution.deploy部署到分配好到資源上。

private CompletableFuture<Void> scheduleEager(SlotProvider slotProvider, final Time timeout) {
   checkState(state == JobStatus.RUNNING, "job is not running currently");

   // Important: reserve all the space we need up front.
   // that way we do not have any operation that can fail between allocating the slots
   // and adding them to the list. If we had a failure in between there, that would
   // cause the slots to get lost
   final boolean queued = allowQueuedScheduling;

   // collecting all the slots may resize and fail in that operation without slots getting lost
   final ArrayList<CompletableFuture<Execution>> allAllocationFutures = new ArrayList<>(getNumberOfExecutionJobVertices());

   final Set<AllocationID> allPreviousAllocationIds =
      Collections.unmodifiableSet(computeAllPriorAllocationIdsIfRequiredByScheduling());

   // allocate the slots (obtain all their futures
   for (ExecutionJobVertex ejv : getVerticesTopologically()) {
      // these calls are not blocking, they only return futures
      Collection<CompletableFuture<Execution>> allocationFutures = ejv.allocateResourcesForAll(
         slotProvider,
         queued,
         LocationPreferenceConstraint.ALL,
         allPreviousAllocationIds,
         timeout);

      allAllocationFutures.addAll(allocationFutures);
   }

   // this future is complete once all slot futures are complete.
   // the future fails once one slot future fails.
   final ConjunctFuture<Collection<Execution>> allAllocationsFuture = FutureUtils.combineAll(allAllocationFutures);

   final CompletableFuture<Void> currentSchedulingFuture = allAllocationsFuture
      .thenAccept(
         (Collection<Execution> executionsToDeploy) -> {
            for (Execution execution : executionsToDeploy) {
               try {
                  execution.deploy();
               } catch (Throwable t) {
                  throw new CompletionException(
                     new FlinkException(
                        String.format("Could not deploy execution %s.", execution),
                        t));
               }
            }
         })
      // Generate a more specific failure message for the eager scheduling
      .exceptionally(
         (Throwable throwable) -> {
            final Throwable strippedThrowable = ExceptionUtils.stripCompletionException(throwable);
            final Throwable resultThrowable;

            if (strippedThrowable instanceof TimeoutException) {
               int numTotal = allAllocationsFuture.getNumFuturesTotal();
               int numComplete = allAllocationsFuture.getNumFuturesCompleted();
               String message = "Could not allocate all requires slots within timeout of " +
                  timeout + ". Slots required: " + numTotal + ", slots allocated: " + numComplete;

               resultThrowable = new NoResourceAvailableException(message);
            } else {
               resultThrowable = strippedThrowable;
            }

            throw new CompletionException(resultThrowable);
         });

   return currentSchedulingFuture;
}

好的，已經很接近整個流程到終點了！Execution類代表一次具體的執行，來看看它是怎麼部署的。

public void deploy() throws JobException {
   final LogicalSlot slot  = assignedResource;

   checkNotNull(slot, "In order to deploy the execution we first have to assign a resource via tryAssignResource.");

   // Check if the TaskManager died in the meantime
   // This only speeds up the response to TaskManagers failing concurrently to deployments.
   // The more general check is the rpcTimeout of the deployment call
   if (!slot.isAlive()) {
      throw new JobException("Target slot (TaskManager) for deployment is no longer alive.");
   }

   // make sure exactly one deployment call happens from the correct state
   // note: the transition from CREATED to DEPLOYING is for testing purposes only
   ExecutionState previous = this.state;
   if (previous == SCHEDULED || previous == CREATED) {
      if (!transitionState(previous, DEPLOYING)) {
         // race condition, someone else beat us to the deploying call.
         // this should actually not happen and indicates a race somewhere else
         throw new IllegalStateException("Cannot deploy task: Concurrent deployment call race.");
      }
   }
   else {
      // vertex may have been cancelled, or it was already scheduled
      throw new IllegalStateException("The vertex must be in CREATED or SCHEDULED state to be deployed. Found state " + previous);
   }

   if (this != slot.getPayload()) {
      throw new IllegalStateException(
         String.format("The execution %s has not been assigned to the assigned slot.", this));
   }

   try {

      // race double check, did we fail/cancel and do we need to release the slot?
      if (this.state != DEPLOYING) {
         slot.releaseSlot(new FlinkException("Actual state of execution " + this + " (" + state + ") does not match expected state DEPLOYING."));
         return;
      }

      if (LOG.isInfoEnabled()) {
         LOG.info(String.format("Deploying %s (attempt #%d) to %s", vertex.getTaskNameWithSubtaskIndex(),
               attemptNumber, getAssignedResourceLocation()));
      }

      final TaskDeploymentDescriptor deployment = vertex.createDeploymentDescriptor(
         attemptId,
         slot,
         taskRestore,
         attemptNumber);

      // null taskRestore to let it be GC'ed
      taskRestore = null;

      final TaskManagerGateway taskManagerGateway = slot.getTaskManagerGateway();

      final CompletableFuture<Acknowledge> submitResultFuture = taskManagerGateway.submitTask(deployment, rpcTimeout);

      submitResultFuture.whenCompleteAsync(
         (ack, failure) -> {
            // only respond to the failure case
            if (failure != null) {
               if (failure instanceof TimeoutException) {
                  String taskname = vertex.getTaskNameWithSubtaskIndex() + " (" + attemptId + ')';

                  markFailed(new Exception(
                     "Cannot deploy task " + taskname + " - TaskManager (" + getAssignedResourceLocation()
                        + ") not responding after a rpcTimeout of " + rpcTimeout, failure));
               } else {
                  markFailed(failure);
               }
            }
         },
         executor);
   }
   catch (Throwable t) {
      markFailed(t);
      ExceptionUtils.rethrow(t);
   }
}

TaskManagerGateway接口定義了和TaskManager通信的方法，有兩種具體實現，分別基於Actor模式和RPC模式。基於RPC的實現會包含一個TaskExecutorGateway的實現類TaskExecutor來代理提交任務的實際工作。

@Override
public CompletableFuture<Acknowledge> submitTask(
      TaskDeploymentDescriptor tdd,
      JobMasterId jobMasterId,
      Time timeout) {

   try {
      final JobID jobId = tdd.getJobId();
      final JobManagerConnection jobManagerConnection = jobManagerTable.get(jobId);

      if (jobManagerConnection == null) {
         final String message = "Could not submit task because there is no JobManager " +
            "associated for the job " + jobId + '.';

         log.debug(message);
         throw new TaskSubmissionException(message);
      }

      if (!Objects.equals(jobManagerConnection.getJobMasterId(), jobMasterId)) {
         final String message = "Rejecting the task submission because the job manager leader id " +
            jobMasterId + " does not match the expected job manager leader id " +
            jobManagerConnection.getJobMasterId() + '.';

         log.debug(message);
         throw new TaskSubmissionException(message);
      }

      if (!taskSlotTable.tryMarkSlotActive(jobId, tdd.getAllocationId())) {
         final String message = "No task slot allocated for job ID " + jobId +
            " and allocation ID " + tdd.getAllocationId() + '.';
         log.debug(message);
         throw new TaskSubmissionException(message);
      }

      // re-integrate offloaded data:
      try {

// 從文件服務中讀取該任務所需的數據         tdd.loadBigData(blobCacheService.getPermanentBlobService());
      } catch (IOException | ClassNotFoundException e) {
         throw new TaskSubmissionException("Could not re-integrate offloaded TaskDeploymentDescriptor data.", e);
      }

      // deserialize the pre-serialized information
      // 反序列化數據以初始化任務
      final JobInformation jobInformation;
      final TaskInformation taskInformation;
      try {
         jobInformation = tdd.getSerializedJobInformation().deserializeValue(getClass().getClassLoader());
         taskInformation = tdd.getSerializedTaskInformation().deserializeValue(getClass().getClassLoader());
      } catch (IOException | ClassNotFoundException e) {
         throw new TaskSubmissionException("Could not deserialize the job or task information.", e);
      }

      if (!jobId.equals(jobInformation.getJobId())) {
         throw new TaskSubmissionException(
            "Inconsistent job ID information inside TaskDeploymentDescriptor (" +
               tdd.getJobId() + " vs. " + jobInformation.getJobId() + ")");
      }

      TaskMetricGroup taskMetricGroup = taskManagerMetricGroup.addTaskForJob(
         jobInformation.getJobId(),
         jobInformation.getJobName(),
         taskInformation.getJobVertexId(),
         tdd.getExecutionAttemptId(),
         taskInformation.getTaskName(),
         tdd.getSubtaskIndex(),
         tdd.getAttemptNumber());

      InputSplitProvider inputSplitProvider = new RpcInputSplitProvider(
         jobManagerConnection.getJobManagerGateway(),
         taskInformation.getJobVertexId(),
         tdd.getExecutionAttemptId(),
         taskManagerConfiguration.getTimeout());

      TaskManagerActions taskManagerActions = jobManagerConnection.getTaskManagerActions();
      CheckpointResponder checkpointResponder = jobManagerConnection.getCheckpointResponder();

      LibraryCacheManager libraryCache = jobManagerConnection.getLibraryCacheManager();
      ResultPartitionConsumableNotifier resultPartitionConsumableNotifier = jobManagerConnection.getResultPartitionConsumableNotifier();
      PartitionProducerStateChecker partitionStateChecker = jobManagerConnection.getPartitionStateChecker();

      final TaskLocalStateStore localStateStore = localStateStoresManager.localStateStoreForSubtask(
         jobId,
         tdd.getAllocationId(),
         taskInformation.getJobVertexId(),
         tdd.getSubtaskIndex());

      final JobManagerTaskRestore taskRestore = tdd.getTaskRestore();

      final TaskStateManager taskStateManager = new TaskStateManagerImpl(
         jobId,
         tdd.getExecutionAttemptId(),
         localStateStore,
         taskRestore,
         checkpointResponder);

      Task task = new Task(
         jobInformation,
         taskInformation,
         tdd.getExecutionAttemptId(),
         tdd.getAllocationId(),
         tdd.getSubtaskIndex(),
         tdd.getAttemptNumber(),
         tdd.getProducedPartitions(),
         tdd.getInputGates(),
         tdd.getTargetSlotNumber(),
         taskExecutorServices.getMemoryManager(),
         taskExecutorServices.getIOManager(),
         taskExecutorServices.getNetworkEnvironment(),
         taskExecutorServices.getBroadcastVariableManager(),
         taskStateManager,
         taskManagerActions,
         inputSplitProvider,
         checkpointResponder,
         blobCacheService,
         libraryCache,
         fileCache,
         taskManagerConfiguration,
         taskMetricGroup,
         resultPartitionConsumableNotifier,
         partitionStateChecker,
         getRpcService().getExecutor());

      log.info("Received task {}.", task.getTaskInfo().getTaskNameWithSubtasks());

      boolean taskAdded;

      try {
         taskAdded = taskSlotTable.addTask(task);
      } catch (SlotNotFoundException | SlotNotActiveException e) {
         throw new TaskSubmissionException("Could not submit task.", e);
      }

      if (taskAdded) {
         task.startTaskThread();

         return CompletableFuture.completedFuture(Acknowledge.get());
      } else {
         final String message = "TaskManager already contains a task for id " +
            task.getExecutionId() + '.';

         log.debug(message);
         throw new TaskSubmissionException(message);
      }
   } catch (TaskSubmissionException e) {
      return FutureUtils.completedExceptionally(e);
   }
}

至此，startTaskThread方法就真正地啓動任務對應的線程運行了

至此我們可以看到，一個任務的DAG計算圖大致經歷以下三個過程：

StreamGraph
最接近代碼所表達的邏輯層面的計算拓撲結構，按照用戶代碼的執行順序向StreamExecutionEnvironment添加StreamTransformation構成流式圖。
JobGraph
從StreamGraph生成，將可以串聯合並的節點進行合併，設置節點之間的邊，安排資源共享slot槽位和放置相關聯的節點，上傳任務所需的文件，設置檢查點配置等。相當於經過部分初始化和優化處理的任務圖。
ExecutionGraph
由JobGraph轉換而來，包含了任務具體執行所需的內容，是最貼近底層實現的執行圖。

flink任務提交流程分析

HTML頁面關於高分屏的設置

druid數據源 xml配置

flink實戰-定時器實現已完成訂單自動五星好評

flink實戰教程-使用set實時計算當天網站uv

聊聊AWK命令的那些事

放棄fastjson，擁抱Jackson

flink實戰教程-flink streaming sql 初體驗

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結