首先構造數據
import scala.util.Random.{setSeed, nextDouble} setSeed(1) //創建對象 case class Record(foo: Double, target: Double, x1: Double, x2: Double, x3: Double) //生成10條記錄 val rows = sc.parallelize( (1 to 10).map(_ => Record( nextDouble, nextDouble, nextDouble, nextDouble, nextDouble )) ) //生成臨時表 val df = sqlContext.createDataFrame(rows) df.registerTempTable("df") //查詢,ROUND(foo, 2)精確到小數點後2位 sqlContext.sql(""" SELECT ROUND(foo, 2) foo, ROUND(target, 2) target, ROUND(x1, 2) x1, ROUND(x2, 2) x2, ROUND(x2, 2) x3 FROM df""").show
得到的數據如下
+----+------+----+----+----+ | foo|target| x1| x2| x3| +----+------+----+----+----+ |0.73| 0.41|0.21|0.33|0.33| |0.01| 0.96|0.94|0.95|0.95|
假設我們想排除x2和foo, 抽取 LabeledPoint(target, Array(x1, x3))
import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.regression.LabeledPoint // Map feature names to indices // 獲取這兩個字段的對應數據框的位置 val featInd = List("x1", "x3").map(df.columns.indexOf(_)) // Or if you want to exclude columns // 先刪除這ignored中的這個3個字段,返回剩下的數據框的位置。 val ignored = List("foo", "target", "x2") val featInd = df.columns.diff(ignored).map(df.columns.indexOf(_)) // Get index of target val targetInd = df.columns.indexOf("target") df.rdd.map(r => LabeledPoint( r.getDouble(targetInd), // Get target value // Map feature indices to values Vectors.dense(featInd.map(r.getDouble(_)).toArray) ))