spark file streams

原創

tsf_1993

2020-02-21 20:25

For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as:

streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)

Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported). Note that

The files must have the same data format.
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.For simple text files, there is an easier method streamingContext.textFileStream(dataDirectory). And file streams do not require running a receiver, hence does not require allocating cores.

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
  * Created by MingDong on 2016/12/6.
  */
object FileWordCount {
  System.setProperty("hadoop.home.dir", "D:\\hadoop");
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("FileWordCount").setMaster("local[2]")

    // 創建Streaming的上下文，包括Spark的配置和時間間隔，這裏時間爲間隔20秒
    val ssc = new StreamingContext(sparkConf, Seconds(20))

    // 指定監控的目錄，
    val lines = ssc.textFileStream("file:///D://data//")

    // 對指定文件夾變化的數據進行單詞統計並且打印
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()

    // 啓動Streaming
    ssc.start()
    ssc.awaitTermination()
  }
}

tsf_1993

發佈了53 篇原創文章 · 獲贊 8 · 訪問量 10萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

spark file streams

Android啓動過程-萬字長文(Android14)

這種嵌套字典類型的數據，我想把它讀取到df裏，如何操作？

微調真的能讓LLM學到新東西嗎:引入新知識可能讓模型產生更多的幻覺

iNeuOS工業互聯網操作系統，增加電力IEC104協議

微服務實踐k8s&dapr開發部署實驗（3）訂閱發佈

kbgressdb之數據結構V0.2

solr5.5.4擴展ansj_lucene5

Ubuntu下添加開機啓動

20170329

spark hbase

jupyterhub test

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結