FlinkX的数据类型

原創

2020-05-24 21:14

FlinkX的数据类型

从上一章节里面看到：

DataStream<Row> dataStream = dataReader.readData();

这个简单的代码里面我们可以得出

每一行数据都转化为了Row对象
数据转化为了数据流

我们下面看一下Row是如何满足所有的数据类型的？

FlinkX 中的 Row

这里的Row是指的org.apache.flink.types.Row

A Row can have arbitrary number of fields and contain a set of fields, which may all be different types. The fields in Row can be null. Due to Row is not strongly typed, Flink’s type extraction mechanism can’t extract correct field types. So that users should manually tell Flink the type information via creating a RowTypeInfo.
The fields in the Row can be accessed by position (zero-based) getField(int). And can set fields by setField(int, Object).
Row is in principle serializable. However, it may contain non-serializable fields, in which case serialization will fail.

Row 介绍

下面先看一下Row 在整个Flink的定位

Flink 在其内部构建了一套自己的类型系统，Flink 现阶段支持的类型分类如图所示，从图中可以看到 Flink 类型可以分为基础类型（Basic）、数组（Arrays）、复合类型（Composite）、辅助类型（Auxiliary）、泛型和其它类型（Generic）。Flink 支持任意的 Java 或是 Scala 类型。不需要像 Hadoop 一样去实现一个特定的接口（org.apache.hadoop.io.Writable），Flink 能够自动识别数据类型。

示例

所以Row不是FlinkX的概念，而是Flinx的概念，就是一行数据的抽象。同样的在DataX中是Record、在Hbase中也是Row，Hive的一行数据，关系数据库的一行数据，等等…

Mysql 读取Row

public Row nextRecordInternal(Row row) throws IOException {
        if (!hasNext) {
            return null;
        }
        row = new Row(columnCount);

        try {
            for (int pos = 0; pos < row.getArity(); pos++) {
                Object obj = resultSet.getObject(pos + 1);
                if(obj != null) {
                    if(CollectionUtils.isNotEmpty(descColumnTypeList)) {
                        String columnType = descColumnTypeList.get(pos);
                        if("year".equalsIgnoreCase(columnType)) {
                            java.util.Date date = (java.util.Date) obj;
                            obj = DateUtil.dateToYearString(date);
                        } else if("tinyint".equalsIgnoreCase(columnType)
                                    || "bit".equalsIgnoreCase(columnType)) {
                            if(obj instanceof Boolean) {
                                obj = ((Boolean) obj ? 1 : 0);
                            }
                        }
                    }
                    obj = clobToString(obj);
                }

                row.setField(pos, obj);
            }
            return super.nextRecordInternal(row);
        }catch (Exception e) {
            throw new IOException("Couldn't read data - " + e.getMessage(), e);
        }
    }

可以看到上图的第5行先初始化row，

然后第9行根据ResultSet获取对应的字段值，

中间会做一些处理与转换（year类型、bit=>0,1、 clob => string）

最后通过row.setField为row赋值。

以上是关系型数据库Mysql的实现。

那么非关系型数据源呢？

使用Row的优势？当存在嵌套类型的时候怎么解决？
测试一下MongoDB

MongoDB 读写Row

构造客户端

client = MongodbClientUtil.getClient(mongodbConfig);
        MongoDatabase db = client.getDatabase(mongodbConfig.getDatabase());
        MongoCollection<Document> collection = db.getCollection(mongodbConfig.getCollectionName());

        if(filter == null){
            findIterable = collection.find();
        } else {
            findIterable = collection.find(filter);
        }

        findIterable = findIterable.skip(split.getSkip())
                .limit(split.getLimit())
                .batchSize(mongodbConfig.getFetchSize());
        cursor = findIterable.iterator();

连接mongo获取mongodb连接？（没有连接池的概念的）
采用基础的mongoCollection的 find方法。根据split的skip和limit过滤，并且使用batchSize 进行批量的数据拉取
返回MongoCursor在调用find时，MongoDB shell并不立即查询数据库，而是在等待真正开始获取数据时才发送查询。（类似Linq中IQueryable），你可以通过游标来对最终结果进行控制。
以下处理每一条数据。

Document doc = cursor.next();
        if(metaColumns.size() == 1 && ConstantValue.STAR_SYMBOL.equals(metaColumns.get(0).getName())){
            row = new Row(doc.size());
            String[] names = doc.keySet().toArray(new String[0]);
            for (int i = 0; i < names.length; i++) {
                row.setField(i,doc.get(names[i]));
            }
        } else {
            row = new Row(metaColumns.size());
            for (int i = 0; i < metaColumns.size(); i++) {
                MetaColumn metaColumn = metaColumns.get(i);

                Object value = null;
                if(metaColumn.getName() != null){
                    value = doc.get(metaColumn.getName());
                    if(value == null && metaColumn.getValue() != null){
                        value = metaColumn.getValue();
                    }
                } else if(metaColumn.getValue() != null){
                    value = metaColumn.getValue();
                }

                if(value instanceof String){
                    value = StringUtil.string2col(String.valueOf(value),metaColumn.getType(),metaColumn.getTimeFormat());
                }

                row.setField(i,value);
            }
        }

        return row;

DEBUG org.mongodb.driver.protocol.command - Sending command '{"find": "bond_info", "limit": 37, "batchSize": 100}' with request id 32 to database data on connection [connectionId{localValue:7, serverValue:30}] to server 192.168.1.101:27017
23:31:01.902 [Legacy Source Thread - Source: mongodbreader -> Sink: mysqlwriter (1/3)] DEBUG org.mongodb.driver.protocol.command - Sending command '{"find": "bond_info", "skip": 37, "limit": 37, "batchSize": 100}' with request id 31 to database data on connection [connectionId{localValue:6, serverValue:31}] to server 192.168.1.101:27017
23:31:01.902 [Legacy Source Thread - Source: mongodbreader -> Sink: mysqlwriter (2/3)] DEBUG org.mongodb.driver.protocol.command - Sending command '{"find": "bond_info", "skip": 74, "limit": 37, "batchSize": 100}' with request id 30 to database data on connection [connectionId{localValue:8, serverValue:32}] to server 192.168.1.101:27017
23:31:01.905 [Legacy Source Thread - Source: mongodbreader -> Sink: mysqlwriter (3/3

从日志可以看到，在我们配置了并行度为3的时候，会分为三个区间分页查询语句执行。

可以知道，上面的也不支持嵌套数据的简单映射。

番外 clobToString

public static Object clobToString(Object obj) throws Exception{
        String dataStr;
        if(obj instanceof Clob){
            Clob clob = (Clob)obj;
            BufferedReader bf = new BufferedReader(clob.getCharacterStream());
            StringBuilder stringBuilder = new StringBuilder();
            String line;
            while ((line = bf.readLine()) != null){
                stringBuilder.append(line);
            }
            dataStr = stringBuilder.toString();
        } else {
            return obj;
        }

        return dataStr;
    }

可以看出来clob类型是字符流

总结

本文对FlinkX中的每一行数据的抽象类Row进行了详解，我们可以知道这个对象是Flink的原生类型。所以FlinkX可以对所有的类型进行很好的支持，这都取决于Flink的能力。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

FlinkX的数据类型

FlinkX的数据类型

FlinkX 中的 Row

Row 介绍

Mysql 读取Row

MongoDB 读写Row

番外 clobToString

总结

再谈23种设计模式（3）：行为型模式（学习笔记）

Power Automate Desktop 安装完，登录后老是提示one driver 错误

微前端学习笔记(4):从微前端到微模块之EMP与hel-micro方案探索

微前端学习笔记（1）：微前端总体架构概述，从微服务发微

985 硕士程序员，空窗 4 个月没有 Offer！

一文搞懂 Spring 循环依赖

赛博斗地主——使用大语言模型扮演Agent智能体玩牌类游戏。

VScode右键打开(添加到右键)

记一次 .NET某工控视觉自动化系统卡死分析

WindowsServer--SQL Server搭建主从同步实现读写分离 - 事务性分发

【Druid 實戰】Druid 的 SQL 中文亂碼問題（avatica）

【Flink博客閱讀】 Flink 作業執行深度解析(WordCount) 讀後實戰總結

最近總結出來的一個素質N連方法論

K8S-鏡像管理（安裝harbor）

FlinkX 代碼總體結構

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結