apache beam 入门之数据集PCollection的拆分和合并

原創

DrawnBreak

2020-06-04 09:59

目录：apache beam 个人使用经验总结目录和入门指导（Java）

有时候我们会需要对PCollection数据集进行拆分，分别进行不同的计算后，再合并成1个数据集。

拆分数据集

拆分数据集有2种方式，1种是使用PartitionFn。
假设我们要将1个整数的数据集分成2份，奇数和偶数分别处理，则可以如下：

    PCollection<Integer> numbers = pipeline.apply(Create.of(1,2,3,4,5));
    PCollectionList<Integer> numbersList
            = numbers.apply(Partition.of(2, new PartitionFn<Integer>(){
        // numPartitions是上面这行定义的这个2，即分区数量
        public int partitionFor(Integer input, int numPartitions) {
// 如果是偶数，返回0，表示输出到第0个数据集中
            if(input % 2 == 0) {
                return 0;
            }
// 返回1表示输出到第1个数据集中
            else{
                return 1
            }
        }
    }));

其中Partition.of(count,PartitionFn) 中的count指的是你需要分成多少份，然后在实现partitonFor接口的时候，你返回几，就代表这份数据被分到哪个数据集中。
生成了numberList后，就可以从中取出自己需要的数据集进行处理

PCollection<Integer> pEven = numbersList.get(0);
PCollection<Integer> pOdd = numbersList.get(1);

但这种方式只能通过确定数据集序号的方式去拆分。
注意分区数count必须是运行前就确定好的，而不能运行中去计算
——————————————————————————————————————

另一种拆分方式是自己写过滤器，过滤2次，把数据集过滤成2份数据集。

    PCollection<Integer> pEven = numbers.apply(ParDo.of(new DoFn<Integer, Integer>() {
        @ProcessElement
        public void processElement(ProcessContext context) {
            int element = context.element();
            if (element % 2 == 0) {
                context.output(element);
            }
        }
    }));
    PCollection<Integer> pOdd = numbers.apply(ParDo.of(new DoFn<Integer, Integer>() {
        @ProcessElement
        public void processElement(ProcessContext context) {
            int element = context.element();
            if (element % 2 == 1) {
                context.output(element);
            }
        }
    }));

不过这种方式相当于操作了2次原数据集，性能上会有影响。尽可能使用Partition去拆分数据集。

数据集合并

合并时，需要先合成1个PCollectionList，才能用Flatten进行合并。

PCollectionList<Integer> numberList = PCollectionList.of(pOdd).and(pEven);
pCollection<Integer> mergeNumber = numberList.apply(Flatten.<Integer>pCollections());

注意合并时，数据集内的元素类型必须一致，否则运行时会报错

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

apache beam 入门之数据集PCollection的拆分和合并

拆分数据集

数据集合并

圖解spark的任務構建和提交流程

圖解Spark原理之memoryStore如何管理內存的寫入

圖解spark RDD緩存管理cacheManager和磁盤管理DiskStore/DiskBlockManager

leetcode 1190. 反轉每對括號間的子串

spark——BlockManager筆記整理和學習

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結