spark特徵工程

一、特徵工程

查看數據的分佈

1df.describe().show()
2df.summary().show() //這個顯示比describe更全

1 唯一值刪除

移除列特徵中只有一個值的列

 1
 2`def UniqueValueRemove(df:DataFrame)={
 3
 4    val df_ttmp=df.select(df.columns.map(c=>countDistinct(col(c)).alias(c)):_*)
 5    val map_dict =df_ttmp.columns.zip(df_ttmp.first.toSeq).toMap
 6    val cols =df.columns
 7    val filterCol= cols.map(co=>if (map_dict(co)==1) "" else co)
 8    df.select(filterCol.filter(v=>v!="").map(x=>col(x)):_*)
 9
10
11}
12

13// 優化版

14	def removeSingValue():DataFrame={
15      val colsName=df.columns
16      val colsList=colsName.map{line=>
17            if(df.select(line).distinct.count!=1) line else ""
18      }
19      df.select(colsList.filter(p=>p!="").map(x=>col(x)):_*)
20}

2 刪除重複行

1 刪除指定列的重複值

newDF=df.dropDuplicates(Seq("salary"))

2 刪除重複行（如有兩行的所有字段全部相同，則刪除一行）

newDF=df.dropDuplicates()

3 條件過濾某些異常值

3.1 filter

如 filter(“age>=5”)是取出age>=5的記錄，不要理解反了

df.filter("id = 1 or c1 = 'b'" ).show()
val id_col ="col =1"
df.filter("$id_col==2")

3.2 where

df.where("age>10").describe.show(10)

3.3 對指定字段進行特殊處理

df.selectExpr("id" , "c3 as time" , "round(c4) as c4" ).show(false)

4 填充缺失值

對連續值進行缺失值填充

1離散值默認填充other，連續值填充均值

def fillDataframeNumberMeanNa(df:DataFrame,num_df:DataFrame,obj_col:Array[String]):DataFrame={
   println("填充缺失值-離散值默認填充other，連續值填充均值")
   var df_temp=df.na.fill(value="other",cols=obj_col)
   df_temp=df_temp.na.fill(num_df.columns.zip(
   num_df.select(num_df.columns.map(mean(_)): _*).first.toSeq
    ).toMap)
  df_temp
}

2離散值默認填充other，連續值填充max

def fillDataframeNumberMaxNa(df:DataFrame,num_df:DataFrame,obj_col:Array[String]):DataFrame={
    println("填充缺失值-離散值默認填充other，連續值填充均值")
   var df_temp=df.na.fill(value="other",cols=obj_col)
  df_temp=df_temp.na.fill(num_df.columns.zip(
  num_df.select(num_df.columns.map(max(_)): _*).first.toSeq
   ).toMap)
   df_temp
 }

3 填充其他依次類推

也可指定值填充

val map_dict = Map("poi_type1"-> "other","poi_type2"-> "other")
`var TrainFillData= TrainData.na.fill(map_dict)

5 相關性分析

5.1 pearson

和label的相關性（label爲連續值），feature列也爲連續值

// threshold爲閾值
val featCols = dfTrain.columns.filter(dfTrain
                                      .stat
                                      .corr(_,label_col,"pearson")
                                     .abs > threshold
                                     ).filter(_!="cardio")

需要注意的是，官方文檔中強調：

Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson* Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in* MLlib's Statistics.

6 標準化

6.1 onehot

對分類型變量首先進行stringindex轉化，將字符型變量變爲數字型變量

val onehot_col=Array(
    "gender"
  ,"cholesterol"
  ,"gluc"
   ,"smoke"
  ,"alco"
)
val onehotcolToInt=onehot_col.map(col=>col+"ToInt")
val standardIndex=onehot_col.map{line=>
 new StringIndexer().setInputCol(line).setOutputCol(line+"ToInt")
}

val vectorAssembler = new VectorAssembler()
setInputCols(onehot_colToInt++Array("col1","col2"))
setOutputCol("features")
val pipelineFinal = new Pipeline()
setStages(standardIndex++Array(vectorAssembler))

val modelFinal = pipelineFinal.fit(scaledfTrain)
val scaledfTrain1=modelFinal.transform(scaledfTrain)
val scaleDfTest1 = modelFinal.transform(scaleDfTest)

然後使用 OneHotEncoder()進行onehot

     val encoder = new OneHotEncoder().setInputCol(indexer.getOutputCol).setOutputCol(s"${cate}classVec")

6.2 zscore

注意⚠️ ，StandardScaler()傳入參數必須是向量
（org.apache.spark.ml.linalg.Vector）,所以在進行標準化之前，需要先用 new VectorAssembler()轉化爲向量的形式

 val scale_col=Array(
    "age"
     ,"height"
     ,"weight"
     ,"ap_hi"
    ,"ap_lo"
  )
 val vectorScale = new VectorAssembler()
    .setInputCols(scale_col)
   .setOutputCol("feaScale")
   val scale=new StandardScaler().setInputCol("feaScale").setOutputCol("sfea")
   val pipeline = new Pipeline().setStages(Array(vectorScale,scale))
val model = pipeline.fit(dfTrain)
val scaledfTrain=model.transform(dfTrain)
val scaleDfTest = model.transform(dfValid)

6.3 minmax標準化

對onehot離散過的向量進行minmax，依然是minmax

val scale=new MinMaxScalerModel().setInputCol("feaScale").setOutputCol("sfea")

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.