Spark Streaming和Flume的結合使用

原創

守猫de人

2020-02-20 23:32

首先在IDEA裏面導入依賴包

<groupId>org.apache.spark</groupId>

<artifactId>spark-streaming-flume_2.10</artifactId>

<version>${spark.version}</version>

</dependency>

在linux下安裝flume，減壓flume包，然後到conf裏面複製flume-env.sh，修改裏面的JavaHOME安裝目錄就好了

1、 Flume主動向Streaming推送數據

object FlumePushDemo {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.WARN)
    //local[2]這裏必須是2個或2個以上的線程，一個負責接收數據，一個負責將接收的數據下發到worker上執行
    val config = new SparkConf().setAppName("FlumePushDemo").setMaster("local[2]")
    val sc = new SparkContext(config)
    val ssc = new StreamingContext(sc, Seconds(2))
    //這個地址是spark程序啓動時所在節點的地址
    val flumeStream = FlumeUtils.createStream(ssc, "192.168.10.11", 8008)
    flumeStream.flatMap(x => new String(x.event.getBody.array()).split(" ")).map((_, 1)).reduceByKey(_ + _)
      .print()
    ssc.start()
    ssc.awaitTermination()
  }
}

配置flume文件

# 這個是啓動命令，到flume的安裝路徑
# bin/flume-ng agent -n a1 -c conf/ -f config/flume-push.conf  -Dflume.root.logger=INFO,console
# flume 主動推送數據到spark上
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

# source
a1.sources.r1.type = exec
# 監控linux目錄下的文件
a1.sources.r1.command = tail -F /home/hadoop/access.log
a1.sources.r1.channels = c1

# Describe the sink
# avro綁定一個端口
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.10.11
a1.sinks.k1.port = 8008
#在控制檯打印信息
a1.sinks.k2.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

我去，弄了好幾次了Flume配置文件開頭總是顯示<span style="font-size:14px;">，結尾顯示</span>,大家在使用的時候注意，把這些去掉。

2、Streaming主動向Flume拉取數據（這個要優於上面的，可以根據處理數據的能力去拉取數據）
第一，拷貝三個jar包放到flume的lib目錄下
spark-streaming-flume-sink_2.10-1.6.1.jar
scala-library-2.10.5.jar
commons-lang3-3.3.2.jar
第二，使用創建FlumeUtils.createPollingStream 的dstream

object FlumePullDemo {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.WARN)
    //local[2]這裏必須是2個或2個以上的線程，一個負責接收數據，一個負責將接收的數據下發到worker上執行
    val config = new SparkConf().setAppName("FlumePullDemo").setMaster("local[2]")
    val sc = new SparkContext(config)
    val ssc = new StreamingContext(sc, Seconds(2))
    //這個地址是spark程序啓動時所在節點的地址，後面可以添加多個地址
    val addresses: Seq[InetSocketAddress] = Seq(new InetSocketAddress("192.168.10.11", 8008))
    val flumeStream = FlumeUtils.createPollingStream(ssc, addresses, StorageLevel.MEMORY_ONLY)
    flumeStream.flatMap(x => new String(x.event.getBody.array()).split(" ")).map((_, 1)).reduceByKey(_ + _)
      .print()
    ssc.start()
    ssc.awaitTermination()
  }
}

配置flume文件

# 執行代碼
# bin/flume-ng agent -n a1 -c conf/ -f config/flume-pull.conf  -Dflume.root.logger=INFO,console
# spark 主動到flume上拉取數據
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

# source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/access.log
a1.sources.r1.channels = c1

# Describe the sink
# 告訴flume下沉到spark編寫好的組件中
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname = 192.168.10.11
a1.sinks.k1.port = 8008
# 控制檯打印數據信息
a1.sinks.k2.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

守貓de人

發佈了45 篇原創文章 · 獲贊 23 · 訪問量 7萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Spark Streaming和Flume的結合使用

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

企業大模型如何成爲自己數據的“百科全書”？

本地SSL證書過期輸入命令在IIS自動生成

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（二）使用kube-vip實現集羣VIP訪問

.NET週刊【5月第2期 2024-05-12】

根據域名查詢服務器的ip地址

本地多級文件合併上傳到hdfs（遞歸上傳）

本地多級文件原樣上傳到hdfs

Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not acces

Spark Streaming 將數據保存在msyql中

elasticsearch 之Aggregation聚合

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結