FileSystem/JDBC/Kafka - Flink三大Connector實現原理及案例

點擊上方 藍色字體 ，選擇“ 設爲星標 ”

回覆”資源“獲取更多資源

本文分別講述了Flink三大Connector：FileSystem Connector、JDBC Connector和Kafka Connector的源碼實現和案例代碼。

FileSystem Connector

Sink

構造FileSystemTableSink對象，傳入相關屬性參數：

public TableSink<RowData> createTableSink(TableSinkFactory.Context context) {
        Configuration conf = new Configuration();
        context.getTable().getOptions().forEach(conf::setString);

        return new FileSystemTableSink(
                context.getObjectIdentifier(),//connector標識符
                context.isBounded(),//是否有界流
                context.getTable().getSchema(),//表的schema
                getPath(conf),//file 路徑
                context.getTable().getPartitionKeys(),//分區key
                conf.get(PARTITION_DEFAULT_NAME),//默認分區名稱
                context.getTable().getOptions());//參數
    }

FileSystemTableSink會根據DataStream構造DataStreamSink。consumeDataStream主要做幾個事情：

構造RowDataPartitionComputer，將分區字段和非分區字段index和type分開。
EmptyMetaStoreFactory空的metastore實現。
UUID生成文件前綴
構造FileSystemFactory的實現
根據是否有界流走不同分支處理

public final DataStreamSink<RowData> consumeDataStream(DataStream<RowData> dataStream) {
        RowDataPartitionComputer computer = new RowDataPartitionComputer(
                defaultPartName,
                schema.getFieldNames(),
                schema.getFieldDataTypes(),
                partitionKeys.toArray(new String[0]));

        EmptyMetaStoreFactory metaStoreFactory = new EmptyMetaStoreFactory(path);
        OutputFileConfig outputFileConfig = OutputFileConfig.builder()
                .withPartPrefix("part-" + UUID.randomUUID().toString())
                .build();
        FileSystemFactory fsFactory = FileSystem::get;

        if (isBounded) {
            FileSystemOutputFormat.Builder<RowData> builder = new FileSystemOutputFormat.Builder<>();
            builder.setPartitionComputer(computer);
            builder.setDynamicGrouped(dynamicGrouping);
            builder.setPartitionColumns(partitionKeys.toArray(new String[0]));
            builder.setFormatFactory(createOutputFormatFactory());
            builder.setMetaStoreFactory(metaStoreFactory);
            builder.setFileSystemFactory(fsFactory);
            builder.setOverwrite(overwrite);
            builder.setStaticPartitions(staticPartitions);
            builder.setTempPath(toStagingPath());
            builder.setOutputFileConfig(outputFileConfig);
            return dataStream.writeUsingOutputFormat(builder.build())
                    .setParallelism(dataStream.getParallelism());
        } else {
            Configuration conf = new Configuration();
            properties.forEach(conf::setString);
            Object writer = createWriter();
            TableBucketAssigner assigner = new TableBucketAssigner(computer);
            TableRollingPolicy rollingPolicy = new TableRollingPolicy(
                    !(writer instanceof Encoder),
                    conf.get(SINK_ROLLING_POLICY_FILE_SIZE).getBytes(),
                    conf.get(SINK_ROLLING_POLICY_ROLLOVER_INTERVAL).toMillis());

            BucketsBuilder<RowData, String, ? extends BucketsBuilder<RowData, ?, ?>> bucketsBuilder;
            if (writer instanceof Encoder) {
                //noinspection unchecked
                bucketsBuilder = StreamingFileSink.forRowFormat(
                        path, new ProjectionEncoder((Encoder<RowData>) writer, computer))
                        .withBucketAssigner(assigner)
                        .withOutputFileConfig(outputFileConfig)
                        .withRollingPolicy(rollingPolicy);
            } else {
                //noinspection unchecked
                bucketsBuilder = StreamingFileSink.forBulkFormat(
                        path, new ProjectionBulkFactory((BulkWriter.Factory<RowData>) writer, computer))
                        .withBucketAssigner(assigner)
                        .withOutputFileConfig(outputFileConfig)
                        .withRollingPolicy(rollingPolicy);
            }
            return createStreamingSink(
                    conf,
                    path,
                    partitionKeys,
                    tableIdentifier,
                    overwrite,
                    dataStream,
                    bucketsBuilder,
                    metaStoreFactory,
                    fsFactory,
                    conf.get(SINK_ROLLING_POLICY_CHECK_INTERVAL).toMillis());
        }
    }

一般流式任務都是無界流，所以走else分支：

根據format類型創建Writer對象，比如parquet，是從BulkWriter創建來的
用TableBucketAssigner包裝RowDataPartitionComputer
構造TableRollingPolicy，用於文件的生成策略，BulkWriter是根據checkpoint的執行來生成文件
構造BucketsBuilder對象

createStreamingSink

BucketsBuilder包裝成StreamingFileWriter，這是個operator，繼承了AbstractStreamOperator
在inputStream後增加了一個operator，主要處理邏輯在這個operator裏面
如果配置了sink.partition-commit.policy.kind，則會進行commit處理，比如維護partition到metastore或者生成_success文件，同樣也是增加了一個operator
最後通過一個DiscardingSink function將數據丟棄，因爲數據在上面operator已經處理過了

public static DataStreamSink<RowData> createStreamingSink(
            Configuration conf,
            Path path,
            List<String> partitionKeys,
            ObjectIdentifier tableIdentifier,
            boolean overwrite,
            DataStream<RowData> inputStream,
            BucketsBuilder<RowData, String, ? extends BucketsBuilder<RowData, ?, ?>> bucketsBuilder,
            TableMetaStoreFactory msFactory,
            FileSystemFactory fsFactory,
            long rollingCheckInterval) {
        if (overwrite) {
            throw new IllegalStateException("Streaming mode not support overwrite.");
        }

        StreamingFileWriter fileWriter = new StreamingFileWriter(
                rollingCheckInterval,
                bucketsBuilder);
        DataStream<CommitMessage> writerStream = inputStream.transform(
                StreamingFileWriter.class.getSimpleName(),
                TypeExtractor.createTypeInfo(CommitMessage.class),
                fileWriter).setParallelism(inputStream.getParallelism());

        DataStream<?> returnStream = writerStream;

        // save committer when we don't need it.
        if (partitionKeys.size() > 0 && conf.contains(SINK_PARTITION_COMMIT_POLICY_KIND)) {
            StreamingFileCommitter committer = new StreamingFileCommitter(
                    path, tableIdentifier, partitionKeys, msFactory, fsFactory, conf);
            returnStream = writerStream
                    .transform(StreamingFileCommitter.class.getSimpleName(), Types.VOID, committer)
                    .setParallelism(1)
                    .setMaxParallelism(1);
        }
        //noinspection unchecked
        return returnStream.addSink(new DiscardingSink()).setParallelism(1);
    }

PS:這裏有個java8的函數式接口的寫法，第一次接觸的同學可能會有點蒙,如果接口只有一個抽象方法，那麼接口就是函數式接口，實現方式可以有很多種，最常見的就是使用匿名內部類，還有就是使用lambda或構造器引用來實現。如下，

FileSystemFactory fsFactory = FileSystem::get;
//等同於 匿名類
        FileSystemFactory fileSystemFactory = new FileSystemFactory() {
            public FileSystem create(URI fsUri) throws IOException {
                return FileSystem.get(fsUri);
            }
        };

//        等同於 lambda
        FileSystemFactory fileSystemFactory = uri -> FileSystem.get(uri);

數據寫入filesystem

數據處理在StreamingFileWriter#processElement

public void processElement(StreamRecord<RowData> element) throws Exception {
        helper.onElement(
                element.getValue(),
                getProcessingTimeService().getCurrentProcessingTime(),
                element.hasTimestamp() ? element.getTimestamp() : null,
                currentWatermark);
    }

在此之前會在initializeState中通過BucketsBuilder創建Buckets，並封裝到StreamingFileSinkHelper中

@Override
    public void initializeState(StateInitializationContext context) throws Exception {
        super.initializeState(context);
        buckets = bucketsBuilder.createBuckets(getRuntimeContext().getIndexOfThisSubtask());

        // Set listener before the initialization of Buckets.
        inactivePartitions = new HashSet<>();
        buckets.setBucketLifeCycleListener(new BucketLifeCycleListener<RowData, String>() {
            @Override
            public void bucketCreated(Bucket<RowData, String> bucket) {
            }

            @Override
            public void bucketInactive(Bucket<RowData, String> bucket) {
                inactivePartitions.add(bucket.getBucketId());
            }
        });

        helper = new StreamingFileSinkHelper<>(
                buckets,
                context.isRestored(),
                context.getOperatorStateStore(),
                getRuntimeContext().getProcessingTimeService(),
                bucketCheckInterval);
        currentWatermark = Long.MIN_VALUE;
    }

回到processElement，跟進代碼你會發現最終數據會由Bucket的write寫入文件

void write(IN element, long currentTime) throws IOException {
  //判斷是否有inprogress的文件，如果沒有則新起一個
        if (inProgressPart == null || rollingPolicy.shouldRollOnEvent(inProgressPart, element)) {

            if (LOG.isDebugEnabled()) {
                LOG.debug("Subtask {} closing in-progress part file for bucket id={} due to element {}.",
                        subtaskIndex, bucketId, element);
            }

            inProgressPart = rollPartFile(currentTime);
        }
        inProgressPart.write(element, currentTime);
    }

最終通過調用第三方包中write的方式寫入文件系統，如 hadoop、hive、parquet、orc等

checkpoint

做cp的是snapshotState方法，主要邏輯在Buckets類中

public void snapshotState(
            final long checkpointId,
            final ListState<byte[]> bucketStatesContainer,
            final ListState<Long> partCounterStateContainer) throws Exception {

        Preconditions.checkState(
            bucketWriter != null && bucketStateSerializer != null,
                "sink has not been initialized");

        LOG.info("Subtask {} checkpointing for checkpoint with id={} (max part counter={}).",
                subtaskIndex, checkpointId, maxPartCounter);

        bucketStatesContainer.clear();
        partCounterStateContainer.clear();

        snapshotActiveBuckets(checkpointId, bucketStatesContainer);
        partCounterStateContainer.add(maxPartCounter);
    }

private void snapshotActiveBuckets(
            final long checkpointId,
            final ListState<byte[]> bucketStatesContainer) throws Exception {

        for (Bucket<IN, BucketID> bucket : activeBuckets.values()) {
            final BucketState<BucketID> bucketState = bucket.onReceptionOfCheckpoint(checkpointId);

            final byte[] serializedBucketState = SimpleVersionedSerialization
                    .writeVersionAndSerialize(bucketStateSerializer, bucketState);

            bucketStatesContainer.add(serializedBucketState);

            if (LOG.isDebugEnabled()) {
                LOG.debug("Subtask {} checkpointing: {}", subtaskIndex, bucketState);
            }
        }
    }

這裏會對active狀態的Bucket進行snapshot

BucketState<BucketID> onReceptionOfCheckpoint(long checkpointId) throws IOException {
        prepareBucketForCheckpointing(checkpointId);

        InProgressFileWriter.InProgressFileRecoverable inProgressFileRecoverable = null;
        long inProgressFileCreationTime = Long.MAX_VALUE;

        if (inProgressPart != null) {
            inProgressFileRecoverable = inProgressPart.persist();
            inProgressFileCreationTime = inProgressPart.getCreationTime();
            this.inProgressFileRecoverablesPerCheckpoint.put(checkpointId, inProgressFileRecoverable);
        }

        return new BucketState<>(bucketId, bucketPath, inProgressFileCreationTime, inProgressFileRecoverable, pendingFileRecoverablesPerCheckpoint);//返回BucketState，用於序列化
    }

private void prepareBucketForCheckpointing(long checkpointId) throws IOException {
        if (inProgressPart != null && rollingPolicy.shouldRollOnCheckpoint(inProgressPart)) {
            if (LOG.isDebugEnabled()) {
                LOG.debug("Subtask {} closing in-progress part file for bucket id={} on checkpoint.", subtaskIndex, bucketId);
            }
            closePartFile();
        }

        if (!pendingFileRecoverablesForCurrentCheckpoint.isEmpty()) {
            pendingFileRecoverablesPerCheckpoint.put(checkpointId, pendingFileRecoverablesForCurrentCheckpoint);
            pendingFileRecoverablesForCurrentCheckpoint = new ArrayList<>();//重置
        }
    }

核心邏輯在closePartFile中，將inprogress狀態的文件關閉並由內存提交到文件系統中，得到pendingFileRecoverable對象並存儲到pendingFileRecoverablesForCurrentCheckpoint列表裏，爲snapshot準備。

private InProgressFileWriter.PendingFileRecoverable closePartFile() throws IOException {
        InProgressFileWriter.PendingFileRecoverable pendingFileRecoverable = null;
        if (inProgressPart != null) {
            pendingFileRecoverable = inProgressPart.closeForCommit();
            pendingFileRecoverablesForCurrentCheckpoint.add(pendingFileRecoverable);
            inProgressPart = null;//置位null
        }
        return pendingFileRecoverable;
    }

寫入中的文件是in progress，此時是不可以讀取的，什麼時候纔可以被下游讀取，取決於文件什麼時候提交。上一步已經將數據寫入文件了，但是還沒有正式提交。我們知道checkpoint的幾個步驟，不瞭解的可以參考之前的博文，在最後一步checkpointcoordinator會調用各operator的notifyCheckpointComplete方法。

public void notifyCheckpointComplete(long checkpointId) throws Exception {
        super.notifyCheckpointComplete(checkpointId);
        commitUpToCheckpoint(checkpointId);
    }

public void commitUpToCheckpoint(final long checkpointId) throws IOException {
        final Iterator<Map.Entry<BucketID, Bucket<IN, BucketID>>> activeBucketIt =
                activeBuckets.entrySet().iterator();

        LOG.info("Subtask {} received completion notification for checkpoint with id={}.", subtaskIndex, checkpointId);

        while (activeBucketIt.hasNext()) {
            final Bucket<IN, BucketID> bucket = activeBucketIt.next().getValue();
            bucket.onSuccessfulCompletionOfCheckpoint(checkpointId);

            if (!bucket.isActive()) {//由於前面一系列清理動作，這裏的bucket將不會是active狀態
                // We've dealt with all the pending files and the writer for this bucket is not currently open.
                // Therefore this bucket is currently inactive and we can remove it from our state.
                activeBucketIt.remove();
                notifyBucketInactive(bucket);
            }
        }
    }

文件的提交是在Bucket的onSuccessfulCompletionOfCheckpoint

void onSuccessfulCompletionOfCheckpoint(long checkpointId) throws IOException {
        checkNotNull(bucketWriter);

        Iterator<Map.Entry<Long, List<InProgressFileWriter.PendingFileRecoverable>>> it =
                pendingFileRecoverablesPerCheckpoint.headMap(checkpointId, true)
                        .entrySet().iterator();

        while (it.hasNext()) {
            Map.Entry<Long, List<InProgressFileWriter.PendingFileRecoverable>> entry = it.next();

            for (InProgressFileWriter.PendingFileRecoverable pendingFileRecoverable : entry.getValue()) {
                bucketWriter.recoverPendingFile(pendingFileRecoverable).commit();
            }
            it.remove();
        }

        cleanupInProgressFileRecoverables(checkpointId);
    }

在commit方法中對文件進行重命名，使其能夠被下游讀取，比如hadoop的commit實現

@Override
        public void commit() throws IOException {
            final Path src = recoverable.tempFile();
            final Path dest = recoverable.targetFile();
            final long expectedLength = recoverable.offset();

            final FileStatus srcStatus;
            try {
                srcStatus = fs.getFileStatus(src);
            }
            catch (IOException e) {
                throw new IOException("Cannot clean commit: Staging file does not exist.");
            }

            if (srcStatus.getLen() != expectedLength) {
                // something was done to this file since the committer was created.
                // this is not the "clean" case
                throw new IOException("Cannot clean commit: File has trailing junk data.");
            }

            try {
                fs.rename(src, dest);
            }
            catch (IOException e) {
                throw new IOException("Committing file by rename failed: " + src + " to " + dest, e);
            }
        }

最後會對InprogressFile的一些狀態做清理工作

private void cleanupInProgressFileRecoverables(long checkpointId) throws IOException {
        Iterator<Map.Entry<Long, InProgressFileWriter.InProgressFileRecoverable>> it =
                inProgressFileRecoverablesPerCheckpoint.headMap(checkpointId, false)
                        .entrySet().iterator();

        while (it.hasNext()) {
            final InProgressFileWriter.InProgressFileRecoverable inProgressFileRecoverable = it.next().getValue();

            // this check is redundant, as we only put entries in the inProgressFileRecoverablesPerCheckpoint map
            // list when the requiresCleanupOfInProgressFileRecoverableState() returns true, but having it makes
            // the code more readable.

            final boolean successfullyDeleted = bucketWriter.cleanupInProgressFileRecoverable(inProgressFileRecoverable);//除了s3，都返回false
            if (LOG.isDebugEnabled() && successfullyDeleted) {
                LOG.debug("Subtask {} successfully deleted incomplete part for bucket id={}.", subtaskIndex, bucketId);
            }
            it.remove();//清除
        }
    }

partition commit

分區提交的觸發以及提交的策略。觸發條件分爲process-time和partition-time。process time的原理是當前Checkpoint需要提交的分區和當前系統時間註冊到pendingPartitions map中，在提交時判斷註冊時間+delay是否小於當前系統時間來確定是否需要提交分區，如果delay=0直接提交。所以如果delay=0立即提交，如果有數據延遲的話可能導致該分區過早的提交。如果delay=分區大小，那麼就是在Checkpoint間隔+delay後提交上一次Checkpoint需要提交的分區。

@Override
    public void addPartition(String partition) {
        if (!StringUtils.isNullOrWhitespaceOnly(partition)) {
            this.pendingPartitions.putIfAbsent(partition, procTimeService.getCurrentProcessingTime());
        }
    }

    @Override
    public List<String> committablePartitions(long checkpointId) {
        List<String> needCommit = new ArrayList<>();
        long currentProcTime = procTimeService.getCurrentProcessingTime();
        Iterator<Map.Entry<String, Long>> iter = pendingPartitions.entrySet().iterator();
        while (iter.hasNext()) {
            Map.Entry<String, Long> entry = iter.next();
            long creationTime = entry.getValue();
            if (commitDelay == 0 || currentProcTime > creationTime + commitDelay) {
                needCommit.add(entry.getKey());
                iter.remove();
            }
        }
        return needCommit;
    }

partition time的原理是基於watermark是否達到分區時間+delay來判斷是否要提交。

@Override
    public void addPartition(String partition) {
        if (!StringUtils.isNullOrWhitespaceOnly(partition)) {
            this.pendingPartitions.add(partition);
        }
    }

    @Override
    public List<String> committablePartitions(long checkpointId) {
        if (!watermarks.containsKey(checkpointId)) {
            throw new IllegalArgumentException(String.format(
                    "Checkpoint(%d) has not been snapshot. The watermark information is: %s.",
                    checkpointId, watermarks));
        }

        long watermark = watermarks.get(checkpointId);
        watermarks.headMap(checkpointId, true).clear();

        List<String> needCommit = new ArrayList<>();
        Iterator<String> iter = pendingPartitions.iterator();
        while (iter.hasNext()) {
            String partition = iter.next();
            LocalDateTime partTime = extractor.extract(
                    partitionKeys, extractPartitionValues(new Path(partition)));//根據path來抽取時間，比如partition='day=2020-12-01/hour=11/minute=11' 轉換成 2020-12-01 11：11：00
            if (watermark > toMills(partTime) + commitDelay) {
                needCommit.add(partition);
                iter.remove();
            }
        }
        return needCommit;
    }

Source

讀取數據相對於寫入數據要簡單些。

創建FileSystemTableSource對象

public TableSource<RowData> createTableSource(TableSourceFactory.Context context) {
        Configuration conf = new Configuration();
        context.getTable().getOptions().forEach(conf::setString);

        return new FileSystemTableSource(
                context.getTable().getSchema(),
                getPath(conf),
                context.getTable().getPartitionKeys(),
                conf.get(PARTITION_DEFAULT_NAME),
                context.getTable().getProperties());
    }

構造source function，傳入input format用於讀取源數據。

public DataStream<RowData> getDataStream(StreamExecutionEnvironment execEnv) {
   @SuppressWarnings("unchecked")
   TypeInformation<RowData> typeInfo =
         (TypeInformation<RowData>) TypeInfoDataTypeConverter.fromDataTypeToTypeInfo(getProducedDataType());
   // Avoid using ContinuousFileMonitoringFunction
   InputFormatSourceFunction<RowData> func = new InputFormatSourceFunction<>(getInputFormat(), typeInfo);
   DataStreamSource<RowData> source = execEnv.addSource(func, explainSource(), typeInfo);
   return source.name(explainSource());
}

在run方法中，循環讀取數據，發送到下游算子

public void run(SourceContext<OUT> ctx) throws Exception {
        try {

            Counter completedSplitsCounter = getRuntimeContext().getMetricGroup().counter("numSplitsProcessed");
            if (isRunning && format instanceof RichInputFormat) {
                ((RichInputFormat) format).openInputFormat();
            }

            OUT nextElement = serializer.createInstance();
            while (isRunning) {
                format.open(splitIterator.next());

                // for each element we also check if cancel
                // was called by checking the isRunning flag

                while (isRunning && !format.reachedEnd()) {
                    nextElement = format.nextRecord(nextElement);
                    if (nextElement != null) {
                        ctx.collect(nextElement);
                    } else {
                        break;
                    }
                }
                format.close();
                completedSplitsCounter.inc();

                if (isRunning) {
                    isRunning = splitIterator.hasNext();
                }
            }
        } finally {
            format.close();
            if (format instanceof RichInputFormat) {
                ((RichInputFormat) format).closeInputFormat();
            }
            isRunning = false;
        }
    }

一個完整的案例：

從Kafka流式讀取數據，流式寫入FileSystem，並從fs_table流式查詢

CREATE TABLE kafka_table (
  user_id STRING,
  order_amount DOUBLE,
  log_ts TIMESTAMP(3),
  WATERMARK FOR log_ts AS log_ts - INTERVAL '5' SECOND
) WITH (...);

CREATE TABLE fs_table (
  user_id STRING,
  order_amount DOUBLE,
  dt STRING,
  `hour` STRING
) PARTITIONED BY (dt, `hour`) WITH (
  'connector'='filesystem',
  'path'='...',
  'format'='parquet',
  'sink.partition-commit.delay'='1 h',
  'sink.partition-commit.policy.kind'='success-file'
);

-- streaming sql, insert into file system table
INSERT INTO fs_table SELECT user_id, order_amount, DATE_FORMAT(log_ts, 'yyyy-MM-dd'), DATE_FORMAT(log_ts, 'HH') FROM kafka_table;

-- batch sql, select with partition pruning
SELECT * FROM fs_table WHERE dt='2020-05-20' and `hour`='12';

JDBC Connector

JDBC connector的入口JdbcDynamicTableFactory，提供了source和sink的支持。

Source

在Factory類中通過createDynamicTableSource來創建JdbcDynamicTableSource，並將需要的所有參數傳遞過去。jdbc作爲source有兩種用途：1.數據源使用Scan 2.維表關聯使用Lookup。

Scan

通過JdbcRowDataInputFormat來實現數據讀取，同時支持了列裁剪，limit下推。

注意：scan source只支持batch。

public ScanRuntimeProvider getScanRuntimeProvider(ScanContext runtimeProviderContext) {
              //構造JdbcRowDataInputFormat，傳遞基礎屬性
        final JdbcRowDataInputFormat.Builder builder =
                JdbcRowDataInputFormat.builder()
                        .setDrivername(options.getDriverName())
                        .setDBUrl(options.getDbURL())
                        .setUsername(options.getUsername().orElse(null))
                        .setPassword(options.getPassword().orElse(null))
                        .setAutoCommit(readOptions.getAutoCommit());

        if (readOptions.getFetchSize() != 0) {
            builder.setFetchSize(readOptions.getFetchSize());
        }
        final JdbcDialect dialect = options.getDialect();//jdbc方言，目前支持mysql、postgres、derby，是根據url來推斷
        String query =
                dialect.getSelectFromStatement(
                        options.getTableName(), physicalSchema.getFieldNames(), new String[0]);//構造select語句
        if (readOptions.getPartitionColumnName().isPresent()) {//支持併發讀取，提高讀取速度
            long lowerBound = readOptions.getPartitionLowerBound().get();
            long upperBound = readOptions.getPartitionUpperBound().get();
            int numPartitions = readOptions.getNumPartitions().get();
            builder.setParametersProvider(
                    new JdbcNumericBetweenParametersProvider(lowerBound, upperBound)
                            .ofBatchNum(numPartitions));
            query +=
                    " WHERE "
                            + dialect.quoteIdentifier(readOptions.getPartitionColumnName().get())
                            + " BETWEEN ? AND ?";//拼上sql
        }
        if (limit >= 0) {//如果指定了limit
            query = String.format("%s %s", query, dialect.getLimitClause(limit));
        }
        builder.setQuery(query);
        final RowType rowType = (RowType) physicalSchema.toRowDataType().getLogicalType();
        builder.setRowConverter(dialect.getRowConverter(rowType));//對應的converter，用於轉換jdbc數據成flink數據類型
        builder.setRowDataTypeInfo(
                runtimeProviderContext.createTypeInformation(physicalSchema.toRowDataType()));

        return InputFormatProvider.of(builder.build());
    }

JdbcRowDataInputFormat幾個核心方法如下：

openInputFormat 初始化jdbc鏈接，創建PreparedStatement
open批量拉取數據
reachedEnd是否結束
nextRecord下一條記錄

Lookup

用作維表關聯時，主要實現在JdbcRowDataLookupFunction類中，邏輯類似，不同在於，這裏限制了特定的的key去查詢，並且支持對查詢的結果進行緩存來加速。這裏使用的緩存是Guava cache，支持緩存大小和失效時間的配置。

public void eval(Object... keys) {
        RowData keyRow = GenericRowData.of(keys);
        if (cache != null) {
            List<RowData> cachedRows = cache.getIfPresent(keyRow);
            if (cachedRows != null) {
                for (RowData cachedRow : cachedRows) {
                    collect(cachedRow);
                }
                return;
            }
        }

        for (int retry = 0; retry <= maxRetryTimes; retry++) {
            try {
                statement.clearParameters();
                statement = lookupKeyRowConverter.toExternal(keyRow, statement);
                try (ResultSet resultSet = statement.executeQuery()) {
                    if (cache == null) {
                        while (resultSet.next()) {
                            collect(jdbcRowConverter.toInternal(resultSet));
                        }
                    } else {
                        ArrayList<RowData> rows = new ArrayList<>();
                        while (resultSet.next()) {
                            RowData row = jdbcRowConverter.toInternal(resultSet);
                            rows.add(row);
                            collect(row);
                        }
                        rows.trimToSize();
                        cache.put(keyRow, rows);
                    }
                }
                break;
            } catch (SQLException e) {
                LOG.error(String.format("JDBC executeBatch error, retry times = %d", retry), e);
                if (retry >= maxRetryTimes) {
                    throw new RuntimeException("Execution of JDBC statement failed.", e);
                }

                try {
                    if (!connectionProvider.isConnectionValid()) {
                        statement.close();
                        connectionProvider.closeConnection();
                        establishConnectionAndStatement();
                    }
                } catch (SQLException | ClassNotFoundException excpetion) {
                    LOG.error(
                            "JDBC connection is not valid, and reestablish connection failed",
                            excpetion);
                    throw new RuntimeException("Reestablish JDBC connection failed", excpetion);
                }

                try {
                    Thread.sleep(1000 * retry);
                } catch (InterruptedException e1) {
                    throw new RuntimeException(e1);
                }
            }
        }
    }

Sink

作爲sink支持append only和upsert兩種模式，根據ddl中是否定義了primary key來確定那種模式。Factory創建JdbcDynamicTableSink對象，進一步通過JdbcDynamicOutputFormatBuilder創建JdbcBatchingOutputFormat來寫數據。從類名就可以看出來是批量寫入的，實際情況也確實如此，因爲flink是實時流處理引擎，如果每條數據都要寫db的話，首先性能上得不到保證，同時對db也會造成很大壓力。

主要寫數據的邏輯如下方法：

@Override
    public final synchronized void writeRecord(In record) throws IOException {
        checkFlushException();

        try {
            addToBatch(record, jdbcRecordExtractor.apply(record));
            batchCount++;
            if (executionOptions.getBatchSize() > 0
                    && batchCount >= executionOptions.getBatchSize()) {
                flush();
            }
        } catch (Exception e) {
            throw new IOException("Writing records to JDBC failed.", e);
        }
    }

    protected void addToBatch(In original, JdbcIn extracted) throws SQLException {
        jdbcStatementExecutor.addToBatch(extracted);
    }

根據jdbcStatementExecutor的不同實現，insert的話只是單純的添加到statement中，

public void addToBatch(RowData record) throws SQLException {
    converter.toExternal(record, st);
    st.addBatch();
}

如果是upsert那麼會先判斷數據是否存在，不存在則insert，存在則update。

public void addToBatch(RowData record) throws SQLException {
    processOneRowInBatch(keyExtractor.apply(record), record);
}

private void processOneRowInBatch(RowData pk, RowData row) throws SQLException {
    if (exist(pk)) {
        updateSetter.toExternal(row, updateStatement);
        updateStatement.addBatch();
    } else {
        insertSetter.toExternal(row, insertStatement);
        insertStatement.addBatch();
    }
}

當滿足batch的大小或達到指定的時間間隔後就會進行flush操作，在Checkpoint時如果有緩存的數據也會進行flush。

PS：可以擴展JDBC dialect來實現其他依賴於jdbc的數據庫的實現，比如clickhouse等。

一個完整的案例：

-- register a MySQL table 'users' in Flink SQL
CREATE TABLE MyUserTable (
  id BIGINT,
  name STRING,
  age INT,
  status BOOLEAN,
  PRIMARY KEY (id) NOT ENFORCED
) WITH (
   'connector' = 'jdbc',
   'url' = 'jdbc:mysql://localhost:3306/mydatabase',
   'table-name' = 'users'
);

-- write data into the JDBC table from the other table "T"
INSERT INTO MyUserTable
SELECT id, name, age, status FROM T;

-- scan data from the JDBC table
SELECT id, name, age, status FROM MyUserTable;

-- temporal join the JDBC table as a dimension table
SELECT * FROM myTopic
LEFT JOIN MyUserTable FOR SYSTEM_TIME AS OF myTopic.proctime
ON myTopic.key = MyUserTable.id;

Kafka Connector

本文基於Flink 1.12版本，目前這個版本已經不需要再指定具體的kafka版本了。

本文從Sql角度分析一下，創建一個kafka的table之後，flink是如何從kafka中讀寫數據的。

入口

依然是通過SPI機制找到kafka的factory（KafkaDynamicTableFactory），Flink中大量使用了SPI機制，有時間再整理一篇SPI在Flink中的應用。話不多說，進入正題。

Source

通過createDynamicTableSource方法創建 kafka source，這裏主要做幾件事：

從context獲取table ddl中相關的信息、比如schema、with屬性，生成TableFactoryHelper輔助工具類。
根據with中的key/value format配置discover key/value的format。
各種參數校驗。
構造KafkaDynamicSource對象。

在KafkaDynamicSource中通過key/value 的format創建對應的deserialization schema，將schema中的metadata字段和普通字段分開，創建FlinkKafkaConsumer對象封裝在SourceFunctionProvider當中。

@Override
    public ScanRuntimeProvider getScanRuntimeProvider(ScanContext context) {
        final DeserializationSchema<RowData> keyDeserialization =
                createDeserialization(context, keyDecodingFormat, keyProjection, keyPrefix);

        final DeserializationSchema<RowData> valueDeserialization =
                createDeserialization(context, valueDecodingFormat, valueProjection, null);

        final TypeInformation<RowData> producedTypeInfo =
                context.createTypeInformation(producedDataType);

        final FlinkKafkaConsumer<RowData> kafkaConsumer =
                createKafkaConsumer(keyDeserialization, valueDeserialization, producedTypeInfo);

        return SourceFunctionProvider.of(kafkaConsumer, false);
    }

FlinkKafkaConsumer就是用來讀取kafka的，核心邏輯在其父類FlinkKafkaConsumerBase中，幾個核心方法：

open：kafka consumer相關對象的初始化，包括offset提交模式、動態分區發現、消費模式、反序列化器
run: 通過kafkaFetcher從kafka中拉取數據
runWithPartitionDiscovery：獨立線程運行動態分區發現
snapshotState：Checkpoint時對partition和offset信息進行快照，用於failover
initializeState：從Checkpoint恢復時用來恢復現場
notifyCheckpointComplete：Checkpoint完成時進行offset提交到kafka

關於動態分區發現，在open中就一次性拉取了topic的所有分區，當週期性的執行分區發現，如果有新的partition加入，就會再拉取一次所有的partition，根據partition id判斷哪些是基於上次新增的，並根據一下分配算法決定由哪個subtask進行訂閱消費。

public static int assign(KafkaTopicPartition partition, int numParallelSubtasks) {
        int startIndex =
                ((partition.getTopic().hashCode() * 31) & 0x7FFFFFFF) % numParallelSubtasks;

        // here, the assumption is that the id of Kafka partitions are always ascending
        // starting from 0, and therefore can be used directly as the offset clockwise from the
        // start index
        return (startIndex + partition.getPartition()) % numParallelSubtasks;
    }

KafkaFetcher通過消費線程KafkaConsumerThread來消費kafka的數據，內部是使用kafka的KafkaConsumer實現。kafkaFetcher每次從Handover中pollnext，KafkaConsumerThread消費到數據然後produce到handover當中，handover充當了生產者-消費者模型中阻塞隊列的作用。

public void runFetchLoop() throws Exception {
        try {
            // kick off the actual Kafka consumer
            consumerThread.start();

            while (running) {
                // this blocks until we get the next records
                // it automatically re-throws exceptions encountered in the consumer thread
                final ConsumerRecords<byte[], byte[]> records = handover.pollNext();

                // get the records for each topic partition
                for (KafkaTopicPartitionState<T, TopicPartition> partition :
                        subscribedPartitionStates()) {

                    List<ConsumerRecord<byte[], byte[]>> partitionRecords =
                            records.records(partition.getKafkaPartitionHandle());

                    partitionConsumerRecordsHandler(partitionRecords, partition);
                }
            }
        } finally {
            // this signals the consumer thread that no more work is to be done
            consumerThread.shutdown();
        }

        // on a clean exit, wait for the runner thread
        try {
            consumerThread.join();
        } catch (InterruptedException e) {
            // may be the result of a wake-up interruption after an exception.
            // we ignore this here and only restore the interruption state
            Thread.currentThread().interrupt();
        }
    }

Sink

sink也類似，在createDynamicTableSink方法中創建KafkaDynamicSink，主要負責：

同source，有個特殊處理，如果是avro-confluent或debezium-avro-confluent，且schema-registry.subject沒有設置的話，自動補齊。
根據with熟悉discover key/value的encoding format
參數校驗
構造KafkaDynamicSink對象

在SinkRuntimeProvider#getSinkRuntimeProvider構造FlinkKafkaProducer封裝在SinkFunctionProvider當中。

public SinkRuntimeProvider getSinkRuntimeProvider(Context context) {
        final SerializationSchema<RowData> keySerialization =
                createSerialization(context, keyEncodingFormat, keyProjection, keyPrefix);

        final SerializationSchema<RowData> valueSerialization =
                createSerialization(context, valueEncodingFormat, valueProjection, null);

        final FlinkKafkaProducer<RowData> kafkaProducer =
                createKafkaProducer(keySerialization, valueSerialization);

        return SinkFunctionProvider.of(kafkaProducer, parallelism);
    }

FlinkKafkaProducer向kafka中寫數據，爲了保證exactly-once語義，其繼承了TwoPhaseCommitSinkFunction兩段式提交方法，利用kafka事務機制保證了數據的僅此一次語義。

FlinkKafkaProducer幾個核心方法：

open：kafka相關屬性初始化
invoke：數據處理邏輯，將key和value序列化後構造成ProducerRecord，根據分區策略調用kafka的API KafkaProducer來發送數據
beginTransaction：開啓事務
preCommit：預提交
commit：正式提交
snapshotState：Checkpoint時對狀態進行快照，主要是事務相關的狀態
notifyCheckpointComplete:父類方法，用於Checkpoint完成時回調，提交事務
initializeState：狀態初始化，用於任務從Checkpoint恢復時恢復狀態

整個過程發送數據以及事務提交過程如下：

initializeState（程序啓動或從cp恢復開啓第一次事務 beginTransaction）→invoke（處理數據併發送kafka）→snapshotState（將當前事務存儲並記錄到狀態，並開啓下一次事務，同時進行預提交preCommit）→notifyCheckpointComplete（提交之前pending的事務，並進行正式提交commit）

如果中間有報錯，最終會調用close方法來終止事務。

一個完整的案例：

CREATE TABLE KafkaTable (
  `user_id` BIGINT,
  `item_id` BIGINT,
  `behavior` STRING,
  `ts` TIMESTAMP(3) METADATA FROM 'timestamp'
) WITH (
  'connector' = 'kafka',
  'topic' = 'user_behavior',
  'properties.bootstrap.servers' = 'localhost:9092',
  'properties.group.id' = 'testGroup',
  'scan.startup.mode' = 'earliest-offset',
  'format' = 'csv'
)

大數據下的精準實時監控系統 | Promethus or Zabbix?

企業數據治理及在美團的最佳實踐

你愛或者不愛，他都在那裏 - 雲/邊/端三協同下的邊緣計算

歡迎點贊+收藏+轉發朋友圈素質三連

文章不錯？點個【在看】吧！

本文分享自微信公衆號 - 大數據技術與架構（import_bigdata）。
如有侵權，請聯繫 [email protected] 刪除。
本文參與“OSC源創計劃”，歡迎正在閱讀的你也加入，一起分享。

FileSystem/JDBC/Kafka - Flink三大Connector實現原理及案例

FileSystem Connector

JDBC Connector

Kafka Connector

2天擼了個大數據中臺出來，我飄了~

我們在學習Yarn的時候，到底在學習什麼？（附源碼）

Yarn 源碼 | 分佈式資源調度引擎 Yarn 內核源碼剖析

三流面試聊數據庫，二流面試聊數倉，一流面試…

字節大數據手冊火了 ! 離線數據/實時數據/數據倉庫ETL/實時交易系統/啥都有 !

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結