Spark Streaming 实时监控一个HDFS的文件夹，当新的文件进来（名字不能重复），将对新文件进行处理。

原創

2020-02-20 23:32

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Spark Streaming 实时监控一个HDFS的文件夹，当新的文件进来（名字不能重复），将对新文件进行处理。
  * Created by csw on 2017/7/4.
  */
object HDFSDemo {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.WARN)
    val config = new SparkConf().setAppName("Spark shell")
    val ssc = new StreamingContext(config, Seconds(10))
    val lines = ssc.textFileStream("hdfs://master:9000/csw/tmp2/test/")
    val words: DStream[String] = lines.flatMap(_.split(" "))
    val wordCounts: DStream[(String, Int)] = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

//下满是获取Linux本地的文件

val lines = ssc.textFileStream("file:///csw/tmp/test2")

发布了45 篇原创文章 · 获赞 23 · 访问量 7万+

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

第四范式OpenMLDB: 拓展Spark源码实现高性能Join

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"

第四范式技术团队

2021-09-18 17:23:51

伴鱼数仓演进

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

伴鱼技术团队

2021-08-14 08:03:57

Apache Kyuubi PPMC燕青：为什么说这是开源最好的时代？

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-08-04 09:33:50

如何从Pandas迁移到Spark？这8个问答解决你所有疑问

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-06-18 08:03:55

伴鱼实时计算平台 Palink 的设计与实现

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

伴鱼技术团队

2021-06-13 07:03:55

提效7倍，Apache Spark 自适应查询优化在网易的深度实践及改进

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-05-19 11:08:57

大数据技术升级脉络及认知陷阱 | InfoQ 大咖说

直播內容：多年來，大數據技術經歷了幾輪更迭，在計算、存儲、大規模落地等層面均取得了不錯的進展，並在不斷的成長和成熟，整個生態領域也得到了快速發展。目前，基於分析的大數據計算平臺在各大公司發揮着非常重要的基礎設施的作用。本期，網易數據科學

InfoQ 中文站

2021-04-26 10:43:51

实时数据仓库的发展、架构和趋势

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-04-02 09:43:51

大数据+云：Kylin/Spark/Clickhouse/Hudi 的大佬们怎么看？

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-03-22 18:35:29

如何用Spark计算引擎执行FATE联邦学习任务？

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-03-22 18:34:37

估值突破280亿美元！大数据独角兽公司Databricks再获10亿美元融资

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-02-02 03:03:58

数据倾斜？Spark 3.0 AQE专治各种不服

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-01-21 19:33:54

英雄惜英雄-当Spark遇上Zeppelin之实战案例

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-01-18 18:53:58

Apache Spark 3.0新特性在FreeWheel核心业务数据团队的应用与实战

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"引言"}]},{"t

2021-01-06 15:53:58

深入浅出Spark（四）：存储系统

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2020-12-28 09:03:52

24小時熱門文章

最新文章

最新評論文章