一、特徵工程
- 查看數據的分佈
1df.describe().show()
2df.summary().show() //這個顯示比describe更全
1 唯一值刪除
移除列特徵中只有一個值的列
1
2`def UniqueValueRemove(df:DataFrame)={
3
4 val df_ttmp=df.select(df.columns.map(c=>countDistinct(col(c)).alias(c)):_*)
5 val map_dict =df_ttmp.columns.zip(df_ttmp.first.toSeq).toMap
6 val cols =df.columns
7 val filterCol= cols.map(co=>if (map_dict(co)==1) "" else co)
8 df.select(filterCol.filter(v=>v!="").map(x=>col(x)):_*)
9
10
11}
12
13// 優化版
14 def removeSingValue():DataFrame={
15 val colsName=df.columns
16 val colsList=colsName.map{line=>
17 if(df.select(line).distinct.count!=1) line else ""
18 }
19 df.select(colsList.filter(p=>p!="").map(x=>col(x)):_*)
20}
2 刪除重複行
1 刪除指定列的重複值
newDF=df.dropDuplicates(Seq("salary"))
2 刪除重複行(如有兩行的所有字段全部相同,則刪除一行)
newDF=df.dropDuplicates()
3 條件過濾某些異常值
3.1 filter
如 filter(“age>=5”)是取出age>=5的記錄,不要理解反了
df.filter("id = 1 or c1 = 'b'" ).show()
val id_col ="col =1"
df.filter("$id_col==2")
3.2 where
df.where("age>10").describe.show(10)
3.3 對指定字段進行特殊處理
df.selectExpr("id" , "c3 as time" , "round(c4) as c4" ).show(false)
4 填充缺失值
對連續值進行缺失值填充
1離散值默認填充other,連續值填充均值
def fillDataframeNumberMeanNa(df:DataFrame,num_df:DataFrame,obj_col:Array[String]):DataFrame={
println("填充缺失值-離散值默認填充other,連續值填充均值")
var df_temp=df.na.fill(value="other",cols=obj_col)
df_temp=df_temp.na.fill(num_df.columns.zip(
num_df.select(num_df.columns.map(mean(_)): _*).first.toSeq
).toMap)
df_temp
}
2離散值默認填充other,連續值填充max
def fillDataframeNumberMaxNa(df:DataFrame,num_df:DataFrame,obj_col:Array[String]):DataFrame={
println("填充缺失值-離散值默認填充other,連續值填充均值")
var df_temp=df.na.fill(value="other",cols=obj_col)
df_temp=df_temp.na.fill(num_df.columns.zip(
num_df.select(num_df.columns.map(max(_)): _*).first.toSeq
).toMap)
df_temp
}
3 填充其他依次類推
也可指定值填充
val map_dict = Map("poi_type1"-> "other","poi_type2"-> "other")
`var TrainFillData= TrainData.na.fill(map_dict)
5 相關性分析
5.1 pearson
和label的相關性(label爲連續值),feature列也爲連續值
// threshold爲閾值
val featCols = dfTrain.columns.filter(dfTrain
.stat
.corr(_,label_col,"pearson")
.abs > threshold
).filter(_!="cardio")
需要注意的是,官方文檔中強調:
Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson* Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in* MLlib's Statistics.
6 標準化
6.1 onehot
對分類型變量首先進行stringindex轉化,將字符型變量變爲數字型變量
val onehot_col=Array(
"gender"
,"cholesterol"
,"gluc"
,"smoke"
,"alco"
)
val onehotcolToInt=onehot_col.map(col=>col+"ToInt")
val standardIndex=onehot_col.map{line=>
new StringIndexer().setInputCol(line).setOutputCol(line+"ToInt")
}
val vectorAssembler = new VectorAssembler()
setInputCols(onehot_colToInt++Array("col1","col2"))
setOutputCol("features")
val pipelineFinal = new Pipeline()
setStages(standardIndex++Array(vectorAssembler))
val modelFinal = pipelineFinal.fit(scaledfTrain)
val scaledfTrain1=modelFinal.transform(scaledfTrain)
val scaleDfTest1 = modelFinal.transform(scaleDfTest)
然後使用 OneHotEncoder()進行onehot
val encoder = new OneHotEncoder().setInputCol(indexer.getOutputCol).setOutputCol(s"${cate}classVec")
6.2 zscore
注意⚠️ ,StandardScaler()傳入參數必須是向量
(org.apache.spark.ml.linalg.Vector),所以在進行標準化之前,需要先用 new VectorAssembler()轉化爲向量的形式
val scale_col=Array(
"age"
,"height"
,"weight"
,"ap_hi"
,"ap_lo"
)
val vectorScale = new VectorAssembler()
.setInputCols(scale_col)
.setOutputCol("feaScale")
val scale=new StandardScaler().setInputCol("feaScale").setOutputCol("sfea")
val pipeline = new Pipeline().setStages(Array(vectorScale,scale))
val model = pipeline.fit(dfTrain)
val scaledfTrain=model.transform(dfTrain)
val scaleDfTest = model.transform(dfValid)
6.3 minmax標準化
對onehot離散過的向量進行minmax,依然是minmax
val scale=new MinMaxScalerModel().setInputCol("feaScale").setOutputCol("sfea")