How-to: make spark streaming collect data from Kafka topics and store data into hdfs

Develop steps:
  1. Develop class which is used for connect kafka topics and store data into hdfs.
    In spark project:
    ./examples/src/main/scala/org/apche/spark/examples/streaming/Kafka.scala
    package org.apache.spark.examples.streaming

    import java.util.Properties



    import org.apache.spark.streaming._
    import org.apache.spark.streaming.StreamingContext._
    import org.apache.spark.streaming.kafka._
    import org.apache.spark.SparkConf


    object Kafka {

      def main(args:Array[String])
      {
        if (args.length < 5) {
          System.err.println("Usage: Kafka <zkQuorum> <group> <topics> <numThreads> <output>")
          System.exit(1)
        }
        val Array(zkQuorum, group, topics, numThreads,output) = args
        val sparkConf = new SparkConf().setAppName("Kafka")
        val ssc =  new StreamingContext(sparkConf, Seconds(2))
        ssc.checkpoint("checkpoint")

        val topicpMap = topics.split(",").map((_,numThreads.toInt)).toMap
        val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
        lines.print()
        lines.saveAsTextFiles(output, "txt")
        ssc.start()
        ssc.awaitTermination()
      }
    }
  2. Generate new spark examples jar:
    cd examples
    mvn -Pyarn -DskipTests clean package
  3. Replace cluster's spark-exapmes-*.jar with upper generated new one
Testing:
  • Start Kafka server, kafka producer:
    cd ${KAFKA_HOME}
    bin/kafka-server-start.sh config/server.properties
    bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
  • Start spark streaming to connect kafka:
    bin/spark-submit --master yarn-cluster --class org.apache.spark.examples.streaming.Kafka /opt/spark/lib/spark-examples-1.3.0-cdh5.4.1-hadoop2.6.0-cdh5.4.1.jar localhost:2183 group_kafka test 1 topics
    Notice: group_kafka: group id for current spark streaming consumer, could be anything
  • When yarn application turns into state: RUNNING, type message in spark producer:
    this is a testing message
  • Data is being written into hdfs:
    The numbers between topics and .txt is TIME_IN_MS(  milliscond).
    [hadoop@master root]$ hadoop fs -ls /user/hadoop/
    Found 82 items
    drwxr-xr-x   - hadoop supergroup          0 2015-06-18 13:13 /user/hadoop/.sparkStaging
    drwxr-xr-x   - hadoop supergroup          0 2015-06-18 13:13 /user/hadoop/checkpoint
    drwxr-xr-x   - hadoop supergroup          0 2015-06-18 13:11 /user/hadoop/topics-1434604268000
    drwxr-xr-x   - hadoop supergroup          0 2015-06-18 13:11 /user/hadoop/topics-1434604270000
    drwxr-xr-x   - hadoop supergroup          0 2015-06-18 13:11 /user/hadoop/topics-1434604272000
    drwxr-xr-x   - hadoop supergroup          0 2015-06-18 13:11 /user/hadoop/topics-1434604274000
    [hadoop@master root]$ hadoop fs -cat /user/hadoop/topics-1434604274000/part-00000
    this is a testing message
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章