Chapter 2 Data Processing Using the DataStream API- partitioning

Physical partitioning

Flink allows us to perform physical partitioning of the stream data. You have an option to provide custom partitioning. Let us have a look at the different types of partitioning.(Flink 允许我们对流进行物理分区。你也可以选择自定义分区。让我个先看一下不同类型的分区。)

Custom partitioning

As mentioned earlier, you can provide custom implementation of a partitioner.
正如上文提到的,你可以选择一个自定义分区器实现。

In Java

inputStream.partitionCustom(partitioner, "somekey");
inputStream.partitionCuatom (partitioner, 0);

In Scala:

inputStream.partitionCustom(partitioner, "somekey");
inputStream.partitionCustom(partitioner, o).

While writing a custom partitioner you need make sure you implement an efficient hash function.
编写自定义分区器的同时需要一个高效的hash 算法。

Random partitioning

Random partitioning randomly partitions data streams in an evenly manner.
Random partitioning平均地对数据流进行随机分区。
In Java

inputStream,shuffle();

In Scala

inputstream, shuffle ()

Rebalancing partitioning

This type of partitioning helps distribute the data evenly. It uses a round robin method for distribution. This type of partitioning is good when data is skewed.
Rebalancing分区有助于平衡地分布数据。它使得round robin (轮询调度算法)方法进行数据分布。当数据倾斜时,这种分区是很好的选择。

In Java

inputStream.rebalance ();

In Scala

inputStream.rebalance ()

Rescaling(重新调整)

Rescaling is used to distribute the data across operations, perform transformations on sub sets of data and combine them together. This rebalancing happens over a single node only hence it does not require any data transfer across networks.
(Rescaling)用于跨operations的数据分布。对数据子集进行 transformation然后将它们组合在一起。这种重平衡只在单机的场景下会发生,因此,它不会产生跨网络的数据传输。
The following diagram shows the distribution:
(下图展示了数据的分布情况)


In Java:

inputStream. rescale()

In Scala:

inputStream.rescale()

Broadcasting

Broadcasting distributes all records to each partition. This fans out each and every element to all partitions.
Broadcasting分将所有的记录都分布到每个分区。Broadcasting分将每一个元素都分散到所有的分区中。

In Java:

inputStream.broadcast ();

In Scala:

inputStream.broadcast ();

Data sinks

After the data transformations are done, we need to save results into some place. The following are some options Flink provides us to save results:
当所有的transformations都执行完之后,我们需要将结果保存起来。下面是Flink提供的一些结果保存选项:

  • writeAslext (): Writes records one line at a time as strings.
    (以字符串的形式写记录,一次写一条。)
  • writeAsCsv (): Writes tuples as comma separated value files. Row and fields delimiter can also be configured.(将tuple与入以,号分隔的文本文件(值文件),行和字段的分隔符是可以配置的。)
  • print ()/priatErr (): Writes records to the standard output. You can also choose to write to the standard error.(将记录输出山到System.out(标准输出)或System.error(标准错误))
  • writeUsingQutputFormat (): You can also choose to provide a custom output format. While defining the custom format you need to extend the OutputFormat which takes care of serialization and deserialization.(可以选择自定义的输出格式,当定义自定义的格式时,你需要继承OutputFormat类,这个类负责序列化和反序列化。)
  • writeToSocket (): Flink supports writing data to a specific socket as well. It is required to define SerializationSchema for proper serialization and formatting
    (Flink 提供将数据写到特定的socket。它需要为序列化和格式化定义适合的 SerializationSchema
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章