目錄
4.數據的read、write和savemode
4.1 數據的讀取
一些常見的數據源,parquet:是之前輸出parquet文件的目錄,讀取該目錄下的所有文件
student.json
{"name":"jack", "age":"22"}
{"name":"rose", "age":"21"}
{"name":"mike", "age":"19"}
product.csv
phone,5000,100
xiaomi,3000,300
val spark = SparkSession.builder() .master("local[*]") .appName(this.getClass.getSimpleName) .getOrCreate() //方式一: val jsonSource: DataFrame = spark.read.json("E:\\student.json") val csvSource: DataFrame = spark.read.csv("e://product.csv") val parquetSource: DataFrame = spark.read.parquet("E:/parquetOutput/*") //方式二: val jsonSource1: DataFrame = spark.read.format("json").load("E:\\student.json") val csvSource1: DataFrame = spark.read.format("csv").load("e://product.csv") val parquetSource1: DataFrame = spark.read.format("parquet").load("E:/parquetOutput/*") //方式三:默認是paprquet格式 val df: DataFrame = spark.sqlContext.load("E:/parquetOutput/*")
4.2 數據的寫出
//方式一: jsonSource.write.json("./jsonOutput") jsonSource.write.parquet("./parquetOutput") jsonSource.write.csv("./scvOut") //方式二: jsonSource.write.format("json").save("./jsonOutput") jsonSource.write.format("parquet").save("./parquetOutput") jsonSource.write.format("csv").save("./scvOut") //方式三:默認parquet格式 jsonSource.write.save("./parquetOutput")
4.3 數據保存的模式
result1.write.mode(SaveMode.Append).json("spark_day01/jsonOutput1")
Scala/Java |
Any Language |
Meaning |
SaveMode.ErrorIfExists(default) |
"error"(default) |
如果文件存在,則報錯 |
SaveMode.Append |
"append" |
追加 |
SaveMode.Overwrite |
"overwrite" |
覆寫 |
SaveMode.Ignore |
"ignore" |
數據存在,則忽略 |
5. Spark SQL數據源
5.1 數據源之json
如上所示,之前程序中用的數據源均爲json
5.2 數據源之parquet
如上所示,spark數據源的默認格式,是一種壓縮格式
5.3 數據源之csv
如上所示
默認分割符爲,可以直接使用excel打開
默認生成的schema信息爲_c0,_c1...默認的類都是String
5.4 數據源之JDBC
要操作數據庫(mysql)首先需要導入Mysql的依賴
<!--導入mysql的依賴--> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.35</version> </dependency>
在配置文件中添加application.conf
db.driver="com.mysql.jdbc.Driver" db.url="jdbc:mysql://localhost:3306/test?characterEncoding=utf-8" db.user="root" db.password="1234"
import java.util.Properties import com.typesafe.config.ConfigFactory import org.apache.spark.sql._ object JDBCSource { def main(args: Array[String]): Unit = { val spark: SparkSession = SparkSession.builder() .master("local[*]") .appName(this.getClass.getSimpleName) .getOrCreate() import spark.implicits._ //默認的加載次序:application.conf application.json application.properties val config = ConfigFactory.load() // 讀取mysql數據庫 ---> 操作之後 ---> 寫到mysql中 //設置連接數據庫的連接信息 val url = config.getString("db.url") val conn = new Properties() conn.setProperty("user", config.getString("db.user")) conn.setProperty("password", config.getString("db.password")) // 讀取數據 val jdbc: DataFrame = spark.read.jdbc(url, "emp", conn) jdbc.printSchema() val result1: Dataset[Row] = jdbc.where("sal > 2500").select("empno") // 寫數據 result1.write.mode(SaveMode.Append).jdbc(url, "emp10", conn) spark.close() } }
5.5 數據源之hive
準備工作,導入下面依賴
<!--必須導入spark對hive的支持jar包--> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.11</artifactId> <version>${spark.version}</version> </dependency>
在使用IDEA開發的時候在resources配置文件下加一個hive-site.xml文件,集羣環境把hive的配置文件要發到$SPARK_HOME/conf目錄下
hive-site.xml文件內容,修改連接的路徑,數據庫名,表名,用戶及密碼
<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoop101:3306/metastore?createDatabaseIfNotExist=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123456</value> <description>password to use against metastore database</description> </property> </configuration>
測試代碼:
import org.apache.spark.sql.{DataFrame, Dataset, SaveMode, SparkSession} object HiveDemo { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .master("local[*]") .appName(this.getClass.getSimpleName) .enableHiveSupport() //開啓spark對hive的支持 .getOrCreate() import spark.implicits._ //連接集羣中的數據庫,先僞裝用戶身份 System.setProperty("HADOOP_USER_NAME","root") /*val result1 = spark.sql("select * from db_hive.student") result1.show()*/ //創建表 //spark.sql("create table student(name string, age string, sex string) row format delimited fields terminated by ','") //刪除表 //spark.sql("drop table student") //插入數據 //val result = spark.sql("insert into student select * from db_hive.student") //覆蓋寫數據 //spark.sql("insert overwrite table student select * from db_hive.student") //覆蓋load新數據 //spark.sql("load data local inpath 'spark_day01/student.txt' overwrite into table default.student") //清空數據 //spark.sql("truncate table student") //寫入自定義數據 val students: Dataset[String] = spark.createDataset(List("jack,18,male","rose,19,female","mike,20,male")) val result: DataFrame = students.map(student => { val fields = student.split(",") (fields(0), fields(1), fields(2)) }).toDF("name", "age", "sex") result.show() result.createTempView("v_student") //將自定義的數據插入到表中 //spark.sql("insert into student select * from v_student") //將自定字的數據寫入到數據庫 result.write.mode(SaveMode.Append).insertInto("student") //查詢 spark.sql("select * from default.student").show() spark.close() } }