Spark學習（捌）- Spark Streaming入門

文章目錄

spark概念

Spark流是核心Spark API的擴展，它支持對實時數據流進行可伸縮、高吞吐量、容錯的流處理。數據可以從Kafka、Flume、Kinesis或TCP sockets等許多來源獲取，也可以使用map、reduce、join和window等高級函數表示的複雜算法進行處理。最後，可以將處理後的數據推送到文件系統、數據庫和實時儀表板。事實上，您可以將Spark的機器學習和圖形處理算法應用於數據流。

Spark Streaming個人的定義：
將不同的數據源的數據經過Spark Streaming處理之後將結果輸出到外部文件系統

特點
低延時
能從錯誤中高效的恢復：fault-tolerant
能夠運行在成百上千的節點
能夠將批處理、機器學習、圖計算等子框架和Spark Streaming綜合起來使用

Spark Streaming是否需要獨立安裝？
不需要；因爲spark是一棧式服務框架
One stack to rule them all ：一棧式

Spark Streaming應用場景

上半圖是實時交易欺詐的應用
下半圖是實時電子傳感器監控

現實生產中應用更廣

Spark Streaming集成Spark生態系統的使用

將批處理與流處理相結合

上圖中；後續文章會有講解實現

離線學習模型可以接入sparkstreaming，在線應用它們

使用SQL交互式地查詢流數據

上圖中；後續文章會有講解實現

Spark Streaming發展史

Spark Streaming從0.9版本畢業；開始進入生產環境。

從詞頻統計功能着手入門Spark Streaming

spark源碼地址 GitHub
https://github.com/apache/spark
在裏面有很多examples供學習。

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

// scalastyle:off println
package org.apache.spark.examples.streaming

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * Counts words in UTF8 encoded, '\n' delimited text received from the network every second.
 *
 * Usage: NetworkWordCount <hostname> <port>
 * <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive data.
 *
 * To run this on your local machine, you need to first run a Netcat server
 *    `$ nc -lk 9999`
 * and then run the example
 *    `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999`
 */
object NetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: NetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()

    // Create the context with a 1 second batch size
    val sparkConf = new SparkConf().setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(1))

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}
// scalastyle:on println

NetworkWordCount測試

spark-submit提交

安裝提示打開9999端口

使用spark-submit來提交我們的spark應用程序運行的腳本(生產)

./spark-submit --master local[2] \
--class org.apache.spark.examples.streaming.NetworkWordCount \
--name NetworkWordCount \
/home/hadoop/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/jars/spark-examples_2.11-2.2.0.jar hadoop000 9999

打開另一個client端

測試：輸入

查看spark-submit提交的界面

輸入

查看spark-submit提交的界面

spark-shell提交

如何使用spark-shell來提交(測試)

./spark-shell --master local[2]

只需要在spark-shell啓動界面粘貼以下代碼即可

import org.apache.spark.streaming.{Seconds, StreamingContext}

val ssc = new StreamingContext(sc, Seconds(1))
val lines = ssc.socketTextStream("hadoop000", 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()

測試步驟和spark-submit一樣；都是在一個client輸入測試數據；spark-shell界面查看結果。

Spark Streaming工作原理(粗粒度)

工作原理：粗粒度
Spark Streaming接收到實時數據流，把數據按照指定的時間段切成一片片小的數據塊，然後把小的數據塊傳給Spark Engine處理。

Spark Streaming工作原理(細粒度)

1、在Driver端會構建context來準備處理Application；SparkContext是StreamingContext的底層
2、Dirver端啓動一些Receiver來接受數據（處理數據的交互）
3、把receiver作爲一個任務來運行
4、數據input進來；receiver把數據拆分爲多個block放入內存中。如果設置副本就會拷貝到其他Executor上
5、receiver反饋給StreamingContext的blocks信息；StreamingContext提交jobs給SparkContext
6、SparkContext將jobs分發給各個Executor處理作業。

Spark學習（捌）- Spark Streaming入門

文章目錄

spark概念

Spark Streaming應用場景

Spark Streaming集成Spark生態系統的使用

Spark Streaming發展史

從詞頻統計功能着手入門Spark Streaming

spark-submit提交

spark-shell提交

Spark Streaming工作原理(粗粒度)

Spark Streaming工作原理(細粒度)

kafka配置體系

kafka可靠的數據傳遞

kafka元數據管理

kafka集羣管理

深入Kafka

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結