使用SparkStreaming+SparkSQL實現在線動態計算出特定時間窗口下的不同種類商品中的熱門商品排名

1、Streaming+SQL技術實現解析
2、Streaming+SQL實現實戰

啓動hive metastore

hive --service metastore &

package com.tom.spark.sparkstreaming

import org.apache.spark.{SparkConf, rdd}
import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.streaming.{Durations, Seconds, StreamingContext}

/**
  * 使用SparkStreaming+Spark SQL來在線動態計算電商中不同類別中最熱門的商品排名，例如手機這個類別下面最熱門的三種手機
  * 電視這個類別下最熱門的三種電視，該實例在實際生產環境下具有非常重大的意義
  *
  * 實現技術：SparkStreaming+Spark SQL，之所以Spark Streaming能夠使用ML、SQL、graphX等功能是因爲有foreachRDD和transform
  * 等接口，這些接口中其實是基於RDD進行操作的，所以以RDD爲基石，就可以直接使用Spark其他所有的功能，就像直接調用API一樣簡單。
  * 假設說這裏的數據的格式：user item category，例如Rocky Samsung Android
  */
object OnlineTop3ItemForEachCategory2DB {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("OnlineTop3ItemForEachCategory2DB").setMaster("local[2]")
    //此處設置Batch Interval是在Spark Streaming中生成基本Job的時間單位，窗口和滑動時間間隔
    // 一定是該Batch Interval的整數倍
    val ssc = new StreamingContext(conf, Durations.seconds(5))
    ssc.checkpoint("/root/Documents/sparkApps/checkpoint")
    val userClickLogDStream = ssc.socketTextStream("Master", 9999)
    //用戶搜索的格式簡化爲name item，在這裏我們由於要計算出熱點內容，所以只需要提取出item即可,
    //提取出的item然後通過map轉換爲(item, 1)格式
    val formattedUserClickLogsDStream = userClickLogDStream.filter(_.split(" ").length == 3).map(clickLog => {
      (clickLog.split(" ")(2) + "_" + clickLog.split(" ")(2), 1)
    })
    val categoryUserClickLogsDStream = formattedUserClickLogsDStream.reduceByKeyAndWindow(_+_, _-_, Seconds(60), Seconds(20))
    categoryUserClickLogsDStream.foreachRDD( rdd => {
      if(rdd.isEmpty()){
        println("No data inputted!!!")
      } else {
        val categoryItemRow = rdd.map(reducedItem => {
          val category = reducedItem._1.split("_")(0)
          val item = reducedItem._1.split("_")(1)
          val click_count = reducedItem._2
          Row(category, item, click_count)
        })
        val structType = StructType(Array(
          new StructField("category", StringType, true),
          new StructField("item", StringType, true),
          new StructField("click_count", IntegerType, true)
        ))
        val hiveContext = new HiveContext(rdd.context)
        val categoryItemDF = hiveContext.createDataFrame(categoryItemRow, structType)
        categoryItemDF.registerTempTable("categoryItemTable")
        val resultDataFrame = hiveContext.sql("SELECT category, item, click_count FROM " +
          "(SELECT category, item, click_count, row_number() OVER (PARTITION BY category ORDER BY click_count DESC) rank " +
          "FROM categoryItemTable) subquery" +
          "WHERE rank <=3")
        val resultRowRDD = resultDataFrame.rdd

        resultRowRDD.foreachPartition( partitionOfRecords => {
          // ConnectionPool is a static, lazily initialized pool of connections
          if(partitionOfRecords.isEmpty) {
            println("This RDD is not null, but partition is null!!!")
          }else {
            val connection = ConnectionPool.getConnection()
            partitionOfRecords.foreach(record => {
              val sql = "insert into categorytop3(category, item, click_count) values ('" + record.getAs("category") + "','" +
                record.getAs("item") + "'," + record.getAs("click_count") + ")"
              val stmt = connection.createStatement()
              stmt.executeUpdate(sql)
            })
            ConnectionPool.returnConnection(connection)
          }
        })
      }
    })

    //計算後的有效數據一般都會寫入Kafka中，下游的計費系統會從Kafka中Pull到有效數據進行計費
    ssc.start()
    ssc.awaitTermination()
  }
}

tom_8899_li

發佈了125 篇原創文章 · 獲贊 5 · 訪問量 6萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

大數據IMF傳奇行動絕密課程第97課：使用SparkStreaming+SparkSQL實現在線動態計算出特定時間窗口

使用SparkStreaming+SparkSQL實現在線動態計算出特定時間窗口下的不同種類商品中的熱門商品排名

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

2020年上半年數據庫系統工程師考試

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

大數據IMF傳奇行動絕密課程第85課：基於HDFS的SparkStreaming案例實戰和內幕源碼解密

大數據IMF傳奇行動絕密課程第100-101課：使用Spark Streaming+Spark SQL+Kafka+FileSystem綜合案例

大數據IMF傳奇行動絕密課程第84課：圖解StreamingContext、DStream、Receiver並結合源碼分析

大數據IMF傳奇行動絕密課程第86課：SparkStreaming數據源Flume實際案例分享

大數據IMF傳奇行動絕密課程第97課：使用SparkStreaming+SparkSQL實現在線動態計算出特定時間窗口

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結