構建項目步驟
- 首先要安裝好scala、sbt、spark,並且要知道對應的版本
- sbt版本可以在sbt命令行中使用
sbtVersion
查看 spark-shell
可以知曉機器上spark以及對應的scala的版本
- sbt版本可以在sbt命令行中使用
- IDEA中plugin安裝scala插件
- pass
修改配置文件改變IDEA下sbt依賴下載速度慢的問題
並在IDEA中找到sbt如圖修改,並重啓:vi ~/.sbt/repositories 加入: [repositories] local osc: http://maven.aliyun.com/nexus/content/groups/public typesafe: http://repo.typesafe.com/typesafe/ivy-releases/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext], bootOnly sonatype-oss-releases maven-central sonatype-oss-snapshots 結束
通過sbt構建scala項目,選對版本
修改build.sbt和build.properties,在其中加入適合的版本,並引入Spark依賴
# build.sbt name := "Name_of_APP" version := "0.1" scalaVersion := "2.12.8" libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.2" # build.properties sbt.version = 1.2.4
其中spark的依賴可以通過spark下載頁面找到,或者參考http://spark.apache.org/docs/latest/rdd-programming-guide.html 中的Link with Spark
代碼
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.log4j.{Level,Logger}
object ScalaApp {
def main(args: Array[String]) {
//屏蔽啓動spark等日誌
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
// 設置數據路徑
val path = "/Users/shayue/Sample_Code/Machine-Learning-with-Spark/Chapter01/scala-spark-app/data/UserPurchaseHistory.csv"
// 初始化SparkContext
val sc = new SparkContext("local[2]", "First Spark App")
// 將 CSV 格式的原始數據轉化爲(user,product,price)格式的記錄集
val data = sc.textFile(path)
.map(line => line.split(","))
.map(purchaseRecord => (purchaseRecord(0), purchaseRecord(1), purchaseRecord(2)))
// 求購買總次數
val numPurchases = data.count()
// 求有多少個不同用戶購買過商品
val uniqueUsers = data.map{ case (user, product, price) => user }.distinct().count()
// 求和得出總收入
val totalRevenue = data.map{ case (user, product, price) => price.toDouble }.sum()
// 求最暢銷的產品是什麼
val productsByPopularity = data
.map{ case (user, product, price) => (product, 1) }
.reduceByKey(_ + _ ).collect()
.sortBy(-_._2)
val mostPopular = productsByPopularity(0)
// 打印
println("Total purchases: " + numPurchases)
println("Unique users: " + uniqueUsers)
println("Total revenue: " + totalRevenue)
println("Most popular product: %s with %d purchases" .format(mostPopular._1, mostPopular._2))
}
}
輸出:
Total purchases: 5
Unique users: 4
Total revenue: 39.91
Most popular product: iPhone Cover with 2 purchases
參考
- 第一張IDEA圖片來自https://www.cnblogs.com/memento/p/9153012.html
- 代碼來自《Spark機器學習》第二版