首先构造数据
import scala.util.Random.{setSeed, nextDouble} setSeed(1) //创建对象 case class Record(foo: Double, target: Double, x1: Double, x2: Double, x3: Double) //生成10条记录 val rows = sc.parallelize( (1 to 10).map(_ => Record( nextDouble, nextDouble, nextDouble, nextDouble, nextDouble )) ) //生成临时表 val df = sqlContext.createDataFrame(rows) df.registerTempTable("df") //查询,ROUND(foo, 2)精确到小数点后2位 sqlContext.sql(""" SELECT ROUND(foo, 2) foo, ROUND(target, 2) target, ROUND(x1, 2) x1, ROUND(x2, 2) x2, ROUND(x2, 2) x3 FROM df""").show
得到的数据如下
+----+------+----+----+----+ | foo|target| x1| x2| x3| +----+------+----+----+----+ |0.73| 0.41|0.21|0.33|0.33| |0.01| 0.96|0.94|0.95|0.95|
假设我们想排除x2和foo, 抽取 LabeledPoint(target, Array(x1, x3))
import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.regression.LabeledPoint // Map feature names to indices // 获取这两个字段的对应数据框的位置 val featInd = List("x1", "x3").map(df.columns.indexOf(_)) // Or if you want to exclude columns // 先删除这ignored中的这个3个字段,返回剩下的数据框的位置。 val ignored = List("foo", "target", "x2") val featInd = df.columns.diff(ignored).map(df.columns.indexOf(_)) // Get index of target val targetInd = df.columns.indexOf("target") df.rdd.map(r => LabeledPoint( r.getDouble(targetInd), // Get target value // Map feature indices to values Vectors.dense(featInd.map(r.getDouble(_)).toArray) ))