Spark學習(六):Spark SQL二

目錄

4.數據的read、write和savemode

4.1 數據的讀取

4.2 數據的寫出

4.3  數據保存的模式

5. Spark SQL數據源

5.1 數據源之json

5.2 數據源之parquet

5.3 數據源之csv

5.4 數據源之JDBC

5.5 數據源之hive


4.數據的read、write和savemode

4.1 數據的讀取

一些常見的數據源,parquet:是之前輸出parquet文件的目錄,讀取該目錄下的所有文件

student.json

{"name":"jack", "age":"22"}
{"name":"rose", "age":"21"}
{"name":"mike", "age":"19"}

 product.csv

phone,5000,100
xiaomi,3000,300

val spark = SparkSession.builder()
  .master("local[*]")
  .appName(this.getClass.getSimpleName)
  .getOrCreate()

//方式一:
val jsonSource: DataFrame = spark.read.json("E:\\student.json")
val csvSource: DataFrame = spark.read.csv("e://product.csv")
val parquetSource: DataFrame = spark.read.parquet("E:/parquetOutput/*")

//方式二:
val jsonSource1: DataFrame = spark.read.format("json").load("E:\\student.json")
val csvSource1: DataFrame = spark.read.format("csv").load("e://product.csv")
val parquetSource1: DataFrame = spark.read.format("parquet").load("E:/parquetOutput/*")
//方式三:默認是paprquet格式
val df: DataFrame = spark.sqlContext.load("E:/parquetOutput/*")

4.2 數據的寫出

//方式一:
jsonSource.write.json("./jsonOutput")
jsonSource.write.parquet("./parquetOutput")
jsonSource.write.csv("./scvOut")
//方式二:
jsonSource.write.format("json").save("./jsonOutput")
jsonSource.write.format("parquet").save("./parquetOutput")
jsonSource.write.format("csv").save("./scvOut")
//方式三:默認parquet格式
jsonSource.write.save("./parquetOutput")

4.3  數據保存的模式

result1.write.mode(SaveMode.Append).json("spark_day01/jsonOutput1")

Scala/Java

Any Language

Meaning

SaveMode.ErrorIfExists(default)

"error"(default)

 如果文件存在,則報錯

SaveMode.Append

"append"

追加

SaveMode.Overwrite

"overwrite"

覆寫

SaveMode.Ignore

"ignore"

數據存在,則忽略

5. Spark SQL數據源

5.1 數據源之json

如上所示,之前程序中用的數據源均爲json

5.2 數據源之parquet

如上所示,spark數據源的默認格式,是一種壓縮格式

5.3 數據源之csv

如上所示

默認分割符爲,可以直接使用excel打開

默認生成的schema信息爲_c0,_c1...默認的類都是String

5.4 數據源之JDBC

要操作數據庫(mysql)首先需要導入Mysql的依賴

<!--導入mysql的依賴-->
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.35</version>
</dependency>

在配置文件中添加application.conf

db.driver="com.mysql.jdbc.Driver"
db.url="jdbc:mysql://localhost:3306/test?characterEncoding=utf-8"
db.user="root"
db.password="1234"
import java.util.Properties
import com.typesafe.config.ConfigFactory
import org.apache.spark.sql._

object JDBCSource {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder()
      .master("local[*]")
      .appName(this.getClass.getSimpleName)
      .getOrCreate()

    import spark.implicits._
    //默認的加載次序:application.conf  application.json  application.properties
    val config = ConfigFactory.load()

    // 讀取mysql數據庫  --->  操作之後 ---> 寫到mysql中
    //設置連接數據庫的連接信息
    val url = config.getString("db.url")
    val conn = new Properties()
    conn.setProperty("user", config.getString("db.user"))
    conn.setProperty("password", config.getString("db.password"))

    // 讀取數據
    val jdbc: DataFrame = spark.read.jdbc(url, "emp", conn)
    jdbc.printSchema()
    
    val result1: Dataset[Row] = jdbc.where("sal > 2500").select("empno")
    // 寫數據
    result1.write.mode(SaveMode.Append).jdbc(url, "emp10", conn)
    spark.close()
  }
}

5.5 數據源之hive

準備工作,導入下面依賴

<!--必須導入spark對hive的支持jar包-->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.11</artifactId>
    <version>${spark.version}</version>
</dependency>

在使用IDEA開發的時候在resources配置文件下加一個hive-site.xml文件,集羣環境把hive的配置文件要發到$SPARK_HOME/conf目錄下

hive-site.xml文件內容,修改連接的路徑,數據庫名,表名,用戶及密碼

<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://hadoop101:3306/metastore?createDatabaseIfNotExist=true</value>
        <description>JDBC connect string for a JDBC metastore</description>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
        <description>username to use against metastore database</description>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>123456</value>
        <description>password to use against metastore database</description>
    </property>
</configuration>

測試代碼:

import org.apache.spark.sql.{DataFrame, Dataset, SaveMode, SparkSession}

object HiveDemo {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local[*]")
      .appName(this.getClass.getSimpleName)
      .enableHiveSupport() //開啓spark對hive的支持
      .getOrCreate()

    import spark.implicits._

    //連接集羣中的數據庫,先僞裝用戶身份
    System.setProperty("HADOOP_USER_NAME","root")

    /*val result1 = spark.sql("select * from db_hive.student")
    result1.show()*/

    //創建表
    //spark.sql("create table student(name string, age string, sex string) row format delimited fields terminated by ','")

    //刪除表
    //spark.sql("drop table student")

    //插入數據
    //val result = spark.sql("insert into student select * from db_hive.student")

    //覆蓋寫數據
    //spark.sql("insert overwrite table student select * from db_hive.student")

    //覆蓋load新數據
    //spark.sql("load data local inpath 'spark_day01/student.txt' overwrite into table default.student")

    //清空數據
    //spark.sql("truncate table student")

    //寫入自定義數據
    val students: Dataset[String] = spark.createDataset(List("jack,18,male","rose,19,female","mike,20,male"))
    val result: DataFrame = students.map(student => {
      val fields = student.split(",")
      (fields(0), fields(1), fields(2))
    }).toDF("name", "age", "sex")
    result.show()

    result.createTempView("v_student")

    //將自定義的數據插入到表中
    //spark.sql("insert into student select * from v_student")

    //將自定字的數據寫入到數據庫
    result.write.mode(SaveMode.Append).insertInto("student")

    //查詢
    spark.sql("select * from default.student").show()

    spark.close()
  }
}

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章