Spark + sbt + IDEA + HelloWorld + MacOS

构建项目步骤

  1. 首先要安装好scala、sbt、spark,并且要知道对应的版本
    • sbt版本可以在sbt命令行中使用sbtVersion查看
    • spark-shell可以知晓机器上spark以及对应的scala的版本
  2. IDEA中plugin安装scala插件
    • pass
  3. 修改配置文件改变IDEA下sbt依赖下载速度慢的问题

    vi ~/.sbt/repositories
    加入:
    [repositories]
    local
    osc: http://maven.aliyun.com/nexus/content/groups/public
    typesafe: http://repo.typesafe.com/typesafe/ivy-releases/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext], bootOnly
    sonatype-oss-releases
    maven-central
    sonatype-oss-snapshots
    结束
    并在IDEA中找到sbt如图修改,并重启:
  4. 通过sbt构建scala项目,选对版本
    -w822

  5. 修改build.sbt和build.properties,在其中加入适合的版本,并引入Spark依赖

    # build.sbt
    name := "Name_of_APP"
    
    version := "0.1"
    
    scalaVersion := "2.12.8"
    
    libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.2"
    
    # build.properties
    sbt.version = 1.2.4
    

    其中spark的依赖可以通过spark下载页面找到,或者参考http://spark.apache.org/docs/latest/rdd-programming-guide.html 中的Link with Spark

代码

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.log4j.{Level,Logger}

object ScalaApp {
    def main(args: Array[String]) {
        //屏蔽启动spark等日志
        Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
        Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

        // 设置数据路径
        val path = "/Users/shayue/Sample_Code/Machine-Learning-with-Spark/Chapter01/scala-spark-app/data/UserPurchaseHistory.csv"

        // 初始化SparkContext
        val sc = new SparkContext("local[2]", "First Spark App")

        // 将 CSV 格式的原始数据转化为(user,product,price)格式的记录集
        val data = sc.textFile(path)
            .map(line => line.split(","))
            .map(purchaseRecord => (purchaseRecord(0), purchaseRecord(1), purchaseRecord(2)))

        // 求购买总次数
        val numPurchases = data.count()

        // 求有多少个不同用户购买过商品
        val uniqueUsers = data.map{ case (user, product, price) => user }.distinct().count()

        // 求和得出总收入
        val totalRevenue = data.map{ case (user, product, price) => price.toDouble }.sum()

        // 求最畅销的产品是什么
        val productsByPopularity = data
            .map{ case (user, product, price) => (product, 1) }
            .reduceByKey(_ + _ ).collect()
            .sortBy(-_._2)
        val mostPopular = productsByPopularity(0)

        // 打印
        println("Total purchases: " + numPurchases)
        println("Unique users: " + uniqueUsers)
        println("Total revenue: " + totalRevenue)
        println("Most popular product: %s with %d purchases" .format(mostPopular._1, mostPopular._2))
    }
}

输出:

Total purchases: 5
Unique users: 4
Total revenue: 39.91
Most popular product: iPhone Cover with 2 purchases

参考

  • 第一张IDEA图片来自https://www.cnblogs.com/memento/p/9153012.html
  • 代码来自《Spark机器学习》第二版
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章