For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as:
streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)
Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported). Note that
- The files must have the same data format.
- The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
- Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.For simple text files, there is an easier method streamingContext.textFileStream(dataDirectory). And file streams do not require running a receiver, hence does not require allocating cores.
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Created by MingDong on 2016/12/6.
*/
object FileWordCount {
System.setProperty("hadoop.home.dir", "D:\\hadoop");
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("FileWordCount").setMaster("local[2]")
// 創建Streaming的上下文,包括Spark的配置和時間間隔,這裏時間爲間隔20秒
val ssc = new StreamingContext(sparkConf, Seconds(20))
// 指定監控的目錄,
val lines = ssc.textFileStream("file:///D://data//")
// 對指定文件夾變化的數據進行單詞統計並且打印
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
// 啓動Streaming
ssc.start()
ssc.awaitTermination()
}
}