Spark-Schema分析--doublehappy

schema

1.如果你使用Spark獲取到數據之後 使用它的Schema

可以用來做什麼？？
最簡單的：
1.自動創建 Hive表  分區表或者 普通的表
2.判斷 數據源 字段變更問題  進而修改 Hive表

這個東西非常有用：
	1.如果上游數據源 字段變更了 沒有通知你  導致數據 入不進去數倉 或者
	直接把鍋摔給你 你無力解釋 
	那麼你可以通過修改一套通用的代碼 讓使用者感知不到 字段變更了 這豈不是美滋滋
   
   2. 這個 比使用 Hive方式 寫腳本 操作方便 
     Hive倒入數據 上游發生變更 修改起來 真的是  恨不得對上游數據人員罵街了

大家都是出來賺錢 真的沒有必要勾心鬥角  對我而言 老老實實幹活學技術 現在畢竟年輕 
賺點小錢 回家種地去了。職場文化太深奧 不適合我這個山村憨批。

Schema：
	是 StructType 類型

StructType：
	是StructType(fields: Seq[StructField]) 
	裏面有 	StructField
	你可以理解爲 ：
		StructType 是一張表
		StructField 就是 表裏面的字段

StructField：
	/**
 * A field inside a StructType.
 * @param name The name of this field.
 * @param dataType The data type of this field.
 * @param nullable Indicates if values of this field can be `null` values.
 * @param metadata The metadata of this field. The metadata should be preserved during
 *                 transformation if the content of the column is not modified, e.g, in selection.
 *
 * @since 1.3.0
 */
@InterfaceStability.Stable
case class StructField(
    name: String,
    dataType: DataType,
    nullable: Boolean = true,
    metadata: Metadata = Metadata.empty) 

StructField：
	name: String,
    dataType: DataType,
    nullable: Boolean = true,
    metadata: Metadata 

就是這點東西，拿到這些東西 就可以幹跟多事情啦！

測試：

package com.offline.task

import com.common.util.ContextUtil
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType

object Test {

  def main(args: Array[String]): Unit = {

    System.setProperty("HADOOP_USER_NAME", "hive")
    val spark: SparkSession = ContextUtil.getSparkSession(this.getClass.getSimpleName)

    val dbase = spark.sparkContext.getConf.get("spark.source.hive.dbabase","doublehappy_erp")
    val table = spark.sparkContext.getConf.get("spark.source.hive.table","doublehappy_depothead")

    val schema: StructType = spark.table(s"${dbase}.ods_${table}").schema


    println("schema : " + schema)
//    schema.map()



    spark.stop()
  }

}

結果：
schema : 
StructType(StructField(Id,LongType,true), 
StructField(Type,StringType,true), 
StructField(SubType,StringType,true), 
StructField(ProjectId,LongType,true), 
StructField(DefaultNumber,StringType,true), 
StructField(Number,StringType,true), 
StructField(OperPersonName,StringType,true), 
StructField(CreateTime,TimestampType,true),
 StructField(OperTime,TimestampType,true), 
 StructField(OrganId,LongType,true), 
 StructField(HandsPersonId,LongType,true),
  StructField(AccountId,LongType,true), 
  StructField(ChangeAmount,DecimalType(24,6),true), 
  StructField(AllocationProjectId,LongType,true), 
  StructField(TotalPrice,DecimalType(24,6),true),
   StructField(PayType,StringType,true), 
   StructField(Remark,StringType,true), 
   StructField(Salesman,StringType,true), 
   StructField(AccountIdList,StringType,true), 
   StructField(AccountMoneyList,StringType,true), 
   StructField(Discount,DecimalType(24,6),true), 
   StructField(DiscountMoney,DecimalType(24,6),true), 
StructField(DiscountLastMoney,DecimalType(24,6),true), 
StructField(OtherMoney,DecimalType(24,6),true), 
   StructField(OtherMoneyList,StringType,true), 
   StructField(OtherMoneyItem,StringType,true), 
   StructField(AccountDay,IntegerType,true), 
   StructField(Status,StringType,true), 
   StructField(LinkNumber,StringType,true), 
   StructField(tenant_id,LongType,true), 
   StructField(delete_Flag,StringType,true), 
   StructField(UpdateTime,TimestampType,true))


可以拿到你的 data：
	1.字段名字  name
	2. 字段類型  DataType

這兩個就足夠了

注意：
DecimalType(24,6)
如果不是24，6  對應別的 你如何創建呢？？

數據寫入Hive 分區表

首先寫入Hive分區表的方式真的是多了去了：

但是 選一個最好的方式：
	1.spark.sql()   裏面直接 insert overwrite 方式 對吧 最簡單  （但是 這樣代碼就寫死了 換一個業務就很費勁）
	2.api的方式 ： 
		1. 分區表的 路徑下 方式 :
		       1.寫到分區路徑 + MSCK repair table    ：append
		       2.使用 add patition 方式   ： append
		2. insertInto 方式    ： overwrite
	
	我個人使用 api的方式 或者 sql的方式 

但是哈：
	api方式 是有坑的：
		1.數據的冪等性 你如何保證 
			1.insertInto 沒有問題
		    2.你寫入分區表寫路徑的方式 一定是append的 ，sql方式 是overwrite的對吧
問題就出現在這：
	寫入分區表寫路徑的方式！！！數據冪等性你如何保證 
	解決 ：
		這兩種 你都可以 ：
			先刪掉之前路徑 再 寫數據   
			再 ：
				msck 或者 add partition 方式 對吧

注意：
	drop table 正常情況下：
		1.分區裏的數據同時也被刪掉了對吧 
		但是 add partition 方式 你刪表 但是路徑 和裏面的數據 是不會被刪掉的喲！！！

api  寫路徑方式問題：
   1.冪等性問題：
	    解決思路都是 先drop partition  再補數據對吧（這是大部分人的理解）

但是：
	alter table ods_ruozedata_depothead_update drop partition (UpdateDay='2020-03-20');

這個操作 真的會把 hdfs上的數據刪掉嗎？？？？

測試：

先：
alter table ods_ruozedata_depothead_update drop partition (UpdateDay='2020-03-20');

查看HDFS路徑：
[hive@hadoop003 ~]$ hadoop fs -ls  /user/hive/warehouse/ruozedata_erp.db/ods_ruozedata_depothead_update/UpdateDay=2020-03-20
Found 1 items
-rwxrwx--x+  1 hive hive       7895 /user/hive/warehouse/ruozedata_erp.db/ods_ruozedata_depothead_update/UpdateDay=2020-03-20/part-00000-33d18a8c-0e94-417e-be2a-c00581721545.c000.snappy.parquet
[hive@hadoop003 ~]$ 


所以呀：
	不要別人說什麼你信什麼，
	這樣刪完了 確實查不到數了。

這裏只能使用：
	ALTER TABLE ods_ruozedata_depothead_update ADD PARTITION (UpdateDay='2020-03-20')
LOCATION '/user/hive/warehouse/ruozedata_erp.db/ods_ruozedata_depothead_update/UpdateDay=2020-03-20';


恢復數據

這裏還有一個有趣的事情：
	1.使用 add partition 後 
	 drop table  是刪不掉 hdfs上的數據的 
正常的 drop table 數據同時也被刪掉的 （內部表）

這裏 就不會被刪掉 


所以：寫路徑的方式 寫Hive 保證冪等性
	思路是： add partition
		1.dropPartition
		2.deleteOldPartitionDir
		3.寫數據
		4.addPartition
   思路：msck
      1.deleteOldPartitionDir
      2.寫數據
      3.msck


我會選擇 add partition 方式 因爲drop table  數據也不會被刪掉

只需要：
	重新建表 + add partition 就可以恢復數據了

今天也是無聊：
	把數倉的代碼完整的重新 寫了一遍 補充了早期一致沒有加的東西！！
所以學習大數據 思路+架構設計 真的很重要 要不然 真的是 勞民傷財呀

Spark-Schema分析--doublehappy

schema

數據寫入Hive 分區表

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

nodejs學習06——小案例

評估統計算法在銀行僞造鈔票檢測中的價值

C# Xmlserializer 程序集內存泄露

Java ThreadPoolShutdown

5月21日相聚上海張江！與文心大模型一起共建大模型產業應用生態圈

superset--doublehappy

Kafka--doublehappy

Scala語法細節05--double_happy

phoniex--double_happy

Kubernetes--k8s--double_happy

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結