Spark Streaming + Kafka Integration Guide 位置策略和消費策略譯文

LocationStrategies 位置策略
The new Kafka consumer API will pre-fetch messages into buffers. Therefore it is important for performance reasons that the Spark integration keep cached consumers on executors (rather than recreating them for each batch), and prefer to schedule partitions on the host locations that have the appropriate consumers.

新的Kafka消費者API可以預獲取消息緩存到緩衝區,因此Spark整合Kafka讓消費者在executor上進行緩存對性能是非常有助的,可以調度消費者所在主機位置的分區。

 

In most cases, you should use LocationStrategies.PreferConsistent as shown above. This will distribute partitions evenly across available executors. If your executors are on the same hosts as your Kafka brokers, use PreferBrokers, which will prefer to schedule partitions on the Kafka leader for that partition. Finally, if you have a significant skew in load among partitions, use PreferFixed. This allows you to specify an explicit mapping of partitions to hosts (any unspecified partitions will use a consistent location).

通常,你可以使用 LocationStrategies.PreferConsistent,這個策略會將分區分佈到所有可獲得的executor上。如果你的executor和kafkabroker在同一主機上的話,可以使用PreferBrokers,這樣kafka leader會爲此分區進行調度。最後,如果你加載數據有傾斜的話可以使用PreferFixed,這將允許你制定一個分區和主機的映射(沒有指定的分區將使用PreferConsistent 策略)

 

The cache for consumers has a default maximum size of 64. If you expect to be handling more than (64 * number of executors) Kafka partitions, you can change this setting via spark.streaming.kafka.consumer.cache.maxCapacity

消費者默認緩存大小是64,如果你期望處理較大的Kafka分區的話,你可以使用spark.streaming.kafka.consumer.cache.maxCapacity設置大小。

 

The cache is keyed by topicpartition and group.id, so use a separate group.id for each call to createDirectStream.

緩存是使用key爲topicpartition 和組id的,因此對於每一次調用createDirectStream可以使用不同的group.id

 

ConsumerStrategies(消費策略)
The new Kafka consumer API has a number of different ways to specify topics, some of which require considerable post-object-instantiation setup.ConsumerStrategies provides an abstraction that allows Spark to obtain properly configured consumers even after restart from checkpoint.

新的kafka消費者API有一些不同的指定topic的方式,其中一些方式需要實例化對象之後進行大量的配置,ConsumerStrategies 提供了一個抽象允許Spark爲consumer獲取適合的配置,甚至在checkpoint重啓之後也可以獲取適合的配置。

 

ConsumerStrategies.Subscribe, as shown above, allows you to subscribe to a fixed collection of topics. SubscribePattern allows you to use a regex to specify topics of interest. Note that unlike the 0.8 integration, using Subscribe or SubscribePattern should respond to adding partitions during a running stream. Finally, Assign allows you to specify a fixed collection of partitions. All three strategies have overloaded constructors that allow you to specify the starting offset for a particular partition.

ConsumerStrategies.Subscribe正如上面所述,你可以使用它去訂閱一個topics集合,SubscribePattern 可以定於匹配表達式的topics。注意:0.10.0的整合不像0.8的整合,使用 Subscribe 或SubscribePattern 會在運行流時候添加分區。最後,Assign可以指定一個分區的集合,這三種策略有重載構造函數,允許你指定分區和對應的偏移。

 

If you have specific consumer setup needs that are not met by the options above, ConsumerStrategy is a public class that you can extend.

如果上面配置之後的consumer仍然無法滿足你的需求的話,ConsumerStrategy 是一個public ckass,你可以繼承並進行定義。


Creating an RDD
If you have a use case that is better suited to batch processing, you can create an RDD for a defined range of offsets.

如果你的需求更適合進行批處理的話,你可以根據Kafka的偏移創建RDD進行批處理

// Import dependencies and create kafka params as in Create Direct Stream above
val offsetRanges = Array(
  // topic, partition, inclusive starting offset, exclusive ending offset
  OffsetRange("test", 0, 0, 100),
  OffsetRange("test", 1, 0, 100)
)
 
val rdd = KafkaUtils.createRDD[String, String](sparkContext, kafkaParams, offsetRanges, PreferConsistent)


OffsetRange的定義:

final class OffsetRange private( val topic: String,val partition: Int,val fromOffset: Long,val untilOffset: Long) extends Serializable
Note that you cannot use PreferBrokers, because without the stream there is not a driver-side consumer to automatically look up broker metadata for you. Use PreferFixed with your own metadata lookups if necessary.

注意:這種情況下你不可以使用PreferBroker策略,因爲沒有stream的話就沒有driver端的consumer進行檢索元數據。可以必要時候檢索元數據結合PreferFixed 策略使用。


Obtaining Offsets(獲取偏移)

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  rdd.foreachPartition { iter =>
    val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
    println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
  }
}

補充KafkaRDD類的繼承結構:

private[spark] class KafkaRDD extends RDD with Logging with HasOffsetRanges 
Note that the typecast to HasOffsetRanges will only succeed if it is done in the first method called on the result of createDirectStream, not later down a chain of methods. Be aware that the one-to-one mapping between RDD partition and Kafka partition does not remain after any methods that shuffle or repartition, e.g. reduceByKey() or window().

注意類型轉換爲HasOffsetRanges 在調用createDirectStream之後就可以成功轉換,並不會遲到於方法鏈之後(意思就是說並不會在我們想獲取offset時候還沒有完成轉換)。在shuffle和repartition之前RDD分區和Kafka分區都是一對一的,例如:reduceByKey() 或window()操作.

餘下相對比較好理解,就不翻譯了。

Storing Offsets
Kafka delivery semantics in the case of failure depend on how and when offsets are stored. Spark output operations are at-least-once. So if you want the equivalent of exactly-once semantics, you must either store offsets after an idempotent output, or store offsets in an atomic transaction alongside output. With this integration, you have 3 options, in order of increasing reliablity (and code complexity), for how to store offsets.

Checkpoints
If you enable Spark checkpointing, offsets will be stored in the checkpoint. This is easy to enable, but there are drawbacks. Your output operation must be idempotent, since you will get repeated outputs; transactions are not an option. Furthermore, you cannot recover from a checkpoint if your application code has changed. For planned upgrades, you can mitigate this by running the new code at the same time as the old code (since outputs need to be idempotent anyway, they should not clash). But for unplanned failures that require code changes, you will lose data unless you have another way to identify known good starting offsets.

Kafka itself
Kafka has an offset commit API that stores offsets in a special Kafka topic. By default, the new consumer will periodically auto-commit offsets. This is almost certainly not what you want, because messages successfully polled by the consumer may not yet have resulted in a Spark output operation, resulting in undefined semantics. This is why the stream example above sets “enable.auto.commit” to false. However, you can commit offsets to Kafka after you know your output has been stored, using the commitAsync API. The benefit as compared to checkpoints is that Kafka is a durable store regardless of changes to your application code. However, Kafka is not transactional, so your outputs must still be idempotent.

Scala
Java
stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
 
  // some time later, after outputs have completed
  stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
As with HasOffsetRanges, the cast to CanCommitOffsets will only succeed if called on the result of createDirectStream, not after transformations. The commitAsync call is threadsafe, but must occur after outputs if you want meaningful semantics.

Your own data store
For data stores that support transactions, saving offsets in the same transaction as the results can keep the two in sync, even in failure situations. If you’re careful about detecting repeated or skipped offset ranges, rolling back the transaction prevents duplicated or lost messages from affecting results. This gives the equivalent of exactly-once semantics. It is also possible to use this tactic even for outputs that result from aggregations, which are typically hard to make idempotent.

Scala
Java
// The details depend on your data store, but the general idea looks like this
 
// begin from the the offsets committed to the database
val fromOffsets = selectOffsetsFromYourDatabase.map { resultSet =>
  new TopicPartition(resultSet.string("topic"), resultSet.int("partition")) -> resultSet.long("offset")
}.toMap
 
val stream = KafkaUtils.createDirectStream[String, String](
  streamingContext,
  PreferConsistent,
  Assign[String, String](fromOffsets.keys.toList, kafkaParams, fromOffsets)
)
 
stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
 
  val results = yourCalculation(rdd)
 
  // begin your transaction
 
  // update results
  // update offsets where the end of existing offsets matches the beginning of this batch of offsets
  // assert that offsets were updated correctly
 
  // end your transaction
}

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章