一、前言
当您想要拆分数据流时,通常的做法是复制流,然后从每个流过滤出您不想拥有的数据,但是如果使用split 或者side output操作可以更好的解决这种问题。
split和select组合使用,对流按照条件进行拆分,取出。但是有一定的局限性,就是只能分流一次,不支持二级分流。
side outputs可以支持二级分流。
二、实践
2.1 split方式
分流一次是可以的,分流两次就会报:Consecutive multiple splits are not supported. Splits are deprecated. Please use side-outputs.
/**
* java.lang.IllegalStateException:
* Consecutive multiple splits are not supported. Splits are deprecated. Please use side-outputs.
* split不支持多级拆分,请使用side-outputs
*
* @param accessStream
* @return
*/
def splitTest(accessStream: DataStream[SplitAccess]) = {
val splitProvinceStream = accessStream.split(new OutputSelector[SplitAccess] {
override def select(splitAccess: SplitAccess): lang.Iterable[String] = {
val list = new util.ArrayList[String]()
if ("广东省" == splitAccess.province) list.add("广东省")
else if ("江苏省" == splitAccess.province) list.add("苏州省")
list
}
})
//splitStream.select("广东省").print()
val splitCityStream = splitProvinceStream.select("广东省").split(new OutputSelector[SplitAccess] {
override def select(splitAccess: SplitAccess): lang.Iterable[String] = {
val list = new util.ArrayList[String]()
if ("深圳市" == splitAccess.province) list.add("深圳市")
else if ("广州市" == splitAccess.province) list.add("广州市")
list
}
})
splitCityStream.select("深圳市").print()
}
2.2 side outputs方式
支持多级分流
/**
* side outputs
*
* @param accessStream
* @return
*/
def sideOutputsTest(accessStream: DataStream[SplitAccess]) = {
//1、按照省份进行划分
val guangdongTag = new OutputTag[SplitAccess]("guangdong")
val jiangsuTag = new OutputTag[SplitAccess]("jiangsu")
val provinceSplitStream = accessStream.process(new ProcessFunction[SplitAccess, SplitAccess] {
override def processElement(value: SplitAccess, ctx: ProcessFunction[SplitAccess, SplitAccess]#Context,
out: Collector[SplitAccess]): Unit = {
if ("广东省" == value.province) {
ctx.output(guangdongTag, value)
} else {
ctx.output(jiangsuTag, value)
}
}
})
//provinceSplitStream.getSideOutput(guangdongTag).print("广东省分流结果:")
//2、按照城市进行划分
val shenzhenTag = new OutputTag[SplitAccess]("shenzhen")
val guangzhouTag = new OutputTag[SplitAccess]("guangzhou")
val guangdongDataStream = provinceSplitStream.getSideOutput(guangdongTag)
val citySplitStream = guangdongDataStream.process(new ProcessFunction[SplitAccess, SplitAccess] {
override def processElement(value: SplitAccess, ctx: ProcessFunction[SplitAccess, SplitAccess]#Context,
out: Collector[SplitAccess]): Unit = {
if ("深圳市" == value.city) {
ctx.output(shenzhenTag, value)
} else {
ctx.output(guangzhouTag, value)
}
}
})
citySplitStream.getSideOutput(shenzhenTag).print("广东省深圳市分流结果:")
}
三、总结
- 如果我们只分流一次,使用split或者side outputs方式都可以
- 如果需要多次分流,需要使用side outputs方式