SparkML之迴歸(一)線性迴歸

----------------------------目錄-----------------------------------------------------------------------

線性迴歸理論

spark源碼

Spark實驗

-------------------------------------------------------一元線性迴歸-------------------------------------------------------------------------

模型

反應一個因變量與一個自變量之間的線性關係,一元線性迴歸模型如下:

                                                         (1)

其中:

迴歸係數

自變量

因變量

隨機誤差,一般假設服從


那麼可以得到結論就是:服從

若我們之前對 (,)進行了 n次觀測,那麼就可以得到如下,一系列的數據

  爲(1,2,...n)

那麼把這些數值,帶入(1)公式,那麼就有 n個包含方程,大家知道當要確定n個參數的時候,滿秩的情況下,只要n個方程就就可以確定了,那麼如何根據歷史的觀測數據來選擇,來選擇最佳的只要把確定了,那麼我們隨便輸入一個就可以得到一個,那麼選擇一個"未來"的,就可以計算一個"未來"的,那麼就達到了預測效果


普通最小二乘法

那麼什麼纔是最佳的 最小二乘法的思想就是把決定後的方程,代入參數使得方差最小,就是最佳的。我們把全部的方差記爲:

                               

那麼現在就是計算關於參數的極小值,當關於參數的偏導爲0的時候,那麼取到極值

           


對其進行整理,得到如下:

              

那麼可以直接計算出:



當自變量x多的時候,就很難直接計算、....、,那麼就必須用克拉姆法則(Cramer's Rule)計算,

其中、、、、....、的最小二乘估計。


擬合效果分析

1、殘差的樣本方差

殘差: (i = 1,2,...n)

殘差的樣本均值:

那麼殘差的樣本方差:

其中n-2是自由度,因爲有約束,所以自由度減2(殘差之間相互獨立,殘差和自變量x相互獨立),如果我們的擬合方程:解釋因變量越強,那麼MSE是越小。你會發現:

這個MSE就是總體迴歸模型中方差的無偏估計量。

那麼它的標準差:


2、判定係數(R)

我們從新考慮我們的樣本回歸函數:



因爲我們的解釋變量的平均值,一定會經過我們的樣本回歸函數,下面證明:


兩邊進行平方之後再加總,然後除以樣本容量n:


其中,,得到:

下面結合圖像進行說明:

結合圖像,我們可以得到下面方程:

兩邊平方之後,進行加總,得到:


:樣本觀測值和其平均值的離差平方和,自由度爲n-1

:擬合直線可解釋部分的平方和,自由度爲1

:樣本的觀測值和估計值之差的平方,既殘差平方和,自由度爲n-2

縮寫全拼(採用國外教材的縮寫方式):

Total sum of squares(SST):總離差平方和

Residual sum of squares (SSR):殘差平方和

explained sum of squares(SSE):迴歸平方和(國人根據實際意義自己命名的?)

所以我們有:

那麼對於我們真正解釋了的部分和總體的比值(用表示):



時,也就是SSR = SSE,那麼就是說原始數據完全可以擬合值來解釋,此時SSR = 0,那麼擬合非常完美

一般

SSR很好計算,就是樣本的實際觀察值與估計值差的平方,所以用SSR去計算R


顯著性檢驗

當你擬合好參數的時候,你要去評定一個這樣的一個模型對於我們想要解釋的問題是否顯著(只有R是不夠的),

如果不顯著那麼就需要換其他模型方法了。對於其中檢驗的方法有F檢驗和T檢驗,本文重點是SparkMlib下的線性迴歸,本節只是一個鋪墊,所以具體如何檢驗,就不贅述了。

-------------------------------------------------------多元線性迴歸----------------------------------------------------------------------------

模型

反應多個因變量與一個自變量之間的線性關係,多元線性迴歸模型如下:

 (2)

其中:都是與無關的未知參數,是迴歸係數。

現在得到n個樣本數據(),=1,....,n,其中,那麼(2)得到:

3)

我們可以把(3)寫成如下模式:

(4)

其中:

,,,

求解過程和一元線性迴歸一樣,可以得到:


判定係數(R)還是按照一元迴歸那樣求解,當R大於0.8才認爲線性關係明顯

===================================最小二乘法的缺陷============================

1、只有當X滿秩的時候,纔可以用最小二乘法。因爲在求解的時候的條件:X是滿秩的,也就是在決定多個因變量

必須是相互獨立的,當如果有關聯,可以用表示,那麼X就不是滿秩的

此時用最小二乘法就是錯誤的,因爲X是不可逆的

2、最小二乘的複雜度高,在處理大規模數據的時候,耗時長。



--------------------------------------------------------------------梯度下降法-------------------------------------------------------------------

由於最小二乘法在求解時,存在侷限,所以在計算機領域一般採用梯度下降法,來近似求解

爲了與文獻2的符號一致,所以放棄前面用過的符號,採用文獻2中的符號。現在直接從多元線性迴歸開始


線性方程:


我們讓,那麼方程變爲:


若我們之前對 (,)進行了 m次觀測,那麼就可以得到如下,一系列的數據

  爲(1,2,...m),按照前面的思路,我們來計算“相差”多少,既所說的cost function:


(小插曲:不知道爲什麼有很多人把上面的m給省略了,在andrew NG課程中和Spark源碼理解中都有這個m

其實加上m更能體現問題)


也就說讓最小。如果用之前的最小二乘法,那麼就是,求偏導,讓等式都等於0,建立方程,聯合求解


我們知道最小二乘法的弊端,所以採用梯度下降法來求解最優的:


其中是學習效率,而且迭代的初始值設置爲n+1列的零向量,然後一直迭代,直到收斂爲止。

當樣本很大的時候,如果迭代次數很大,那麼我們會選擇一部分樣本進行對的更新計算。

更多細節,請看:http://blog.csdn.net/legotime/article/details/51277141

-------------------------------------------------------------------------------------------------------------------------------------------------

Spark源碼

package org.apache.spark.mllib.regression包含了兩個部分:LinearRegressionModel和LinearRegressionWithSGD

1、迴歸的模型(class和object),class 的參數是繼承GeneralizedLinearModel廣義迴歸模型,之後形成一個完整的

性回歸模型,object上面的方法用於導出已經保存的模型進行迴歸

2、LinearRegressionWithSGD:隨機梯度下降法,cost  function:f(weights) = 1/n ||A weights-y||^2也就是前面



記住這個還是加上m更能體現問題,(除以m表示均方誤差)

LinearRegressionWithSGD是繼承GeneralizedLinearAlgorithm[LinearRegressionModel]廣義迴歸類



1、迴歸模型源碼如下

/**
 * Regression model trained using LinearRegression.
 *
 * @param weights Weights computed for every feature.(每個特徵的權重向量)
 * @param intercept Intercept computed for this model.(此模型的偏置或殘差)
 *
 */
@Since("0.8.0")
class LinearRegressionModel @Since("1.1.0") (
    @Since("1.0.0") override val weights: Vector,
    @Since("0.8.0") override val intercept: Double)
  extends GeneralizedLinearModel(weights, intercept) with RegressionModel with Serializable
  with Saveable with PMMLExportable {

  //進行預測:Y = W*X+intercept
  override protected def predictPoint(
      dataMatrix: Vector,
      weightMatrix: Vector,
      intercept: Double): Double = {
    weightMatrix.toBreeze.dot(dataMatrix.toBreeze) + intercept
  }
  //模型保存包含:保存的位置,名字,權重和偏置
  @Since("1.3.0")
  override def save(sc: SparkContext, path: String): Unit = {
    GLMRegressionModel.SaveLoadV1_0.save(sc, path, this.getClass.getName, weights, intercept)
  }

  override protected def formatVersion: String = "1.0"
}
//加載上面保存和的模型,用load(sc,存儲路徑)
@Since("1.3.0")
object LinearRegressionModel extends Loader[LinearRegressionModel] {

  @Since("1.3.0")
  override def load(sc: SparkContext, path: String): LinearRegressionModel = {
    val (loadedClassName, version, metadata) = Loader.loadMetadata(sc, path)
    // Hard-code class name string in case it changes in the future
    val classNameV1_0 = "org.apache.spark.mllib.regression.LinearRegressionModel"
    (loadedClassName, version) match {
      case (className, "1.0") if className == classNameV1_0 =>
        val numFeatures = RegressionModel.getNumFeatures(metadata)
        val data = GLMRegressionModel.SaveLoadV1_0.loadData(sc, path, classNameV1_0, numFeatures)
        new LinearRegressionModel(data.weights, data.intercept)
      case _ => throw new Exception(
        s"LinearRegressionModel.load did not recognize model with (className, format version):" +
        s"($loadedClassName, $version).  Supported:\n" +
        s"  ($classNameV1_0, 1.0)")
    }
  }
}

2、LinearRegressionWithSGD類,該類是基於無正規化的隨機梯度下降,而且是繼承GeneralizedLinearAlgorithm[LinearRegressionModel]廣義迴歸類

/**
 * Train a linear regression model with no regularization using Stochastic Gradient Descent.
 * This solves the least squares regression formulation
 *              f(weights) = 1/n ||A weights-y||^2^
 * (which is the mean squared error).
 * Here the data matrix has n rows, and the input RDD holds the set of rows of A, each with
 * its corresponding right hand side label y.
 * See also the documentation for the precise formulation.
 */
@Since("0.8.0")
class LinearRegressionWithSGD private[mllib] (
    private var stepSize: Double,//步長
    private var numIterations: Int,//迭代次數
    private var miniBatchFraction: Double)//參與迭代樣本的比列
  extends GeneralizedLinearAlgorithm[LinearRegressionModel] with Serializable {

  private val gradient = new LeastSquaresGradient()  //閱讀:3
  private val updater = new SimpleUpdater()  //閱讀:4
  @Since("0.8.0")
  override val optimizer = new GradientDescent(gradient, updater) //閱讀:5
    .setStepSize(stepSize)
    .setNumIterations(numIterations)
    .setMiniBatchFraction(miniBatchFraction)

  /**
   * Construct a LinearRegression object with default parameters: {stepSize: 1.0,
   * numIterations: 100, miniBatchFraction: 1.0}.
   */
  @Since("0.8.0")
  def this() = this(1.0, 100, 1.0) 

  override protected[mllib] def createModel(weights: Vector, intercept: Double) = {
    new LinearRegressionModel(weights, intercept)
  }
}

/**
 * Top-level methods for calling LinearRegression.
 *
 */
@Since("0.8.0")
object LinearRegressionWithSGD {

  /**
   * Train a Linear Regression model given an RDD of (label, features) pairs. We run a fixed number
   * of iterations of gradient descent using the specified step size. Each iteration uses
   * `miniBatchFraction` fraction of the data to calculate a stochastic gradient. The weights used
   * in gradient descent are initialized using the initial weights provided.
   *
   * @param input RDD of (label, array of features) pairs. Each pair describes a row of the data
   *              matrix A as well as the corresponding right hand side label y
   * @param numIterations Number of iterations of gradient descent to run.
   * @param stepSize Step size to be used for each iteration of gradient descent.
   * @param miniBatchFraction Fraction of data to be used per iteration.
   * @param initialWeights Initial set of weights to be used. Array should be equal in size to
   *        the number of features in the data.
   *
   */
  @Since("1.0.0")
  def train(
      input: RDD[LabeledPoint],
      numIterations: Int,
      stepSize: Double,
      miniBatchFraction: Double,
      initialWeights: Vector): LinearRegressionModel = {
    new LinearRegressionWithSGD(stepSize, numIterations, miniBatchFraction)
      .run(input, initialWeights)
  }

  /**
   * Train a LinearRegression model given an RDD of (label, features) pairs. We run a fixed number
   * of iterations of gradient descent using the specified step size. Each iteration uses
   * `miniBatchFraction` fraction of the data to calculate a stochastic gradient.
   *
   * @param input RDD of (label, array of features) pairs. Each pair describes a row of the data
   *              matrix A as well as the corresponding right hand side label y
   * @param numIterations Number of iterations of gradient descent to run.
   * @param stepSize Step size to be used for each iteration of gradient descent.
   * @param miniBatchFraction Fraction of data to be used per iteration.
   *
   */
  @Since("0.8.0")
  def train(
      input: RDD[LabeledPoint],
      numIterations: Int,
      stepSize: Double,
      miniBatchFraction: Double): LinearRegressionModel = {
    new LinearRegressionWithSGD(stepSize, numIterations, miniBatchFraction).run(input)
  }

  /**
   * Train a LinearRegression model given an RDD of (label, features) pairs. We run a fixed number
   * of iterations of gradient descent using the specified step size. We use the entire data set to
   * compute the true gradient in each iteration.
   *
   * @param input RDD of (label, array of features) pairs. Each pair describes a row of the data
   *              matrix A as well as the corresponding right hand side label y
   * @param stepSize Step size to be used for each iteration of Gradient Descent.
   * @param numIterations Number of iterations of gradient descent to run.
   * @return a LinearRegressionModel which has the weights and offset from training.
   *
   */
  @Since("0.8.0")
  def train(
      input: RDD[LabeledPoint],
      numIterations: Int,
      stepSize: Double): LinearRegressionModel = {
    train(input, numIterations, stepSize, 1.0)
  }

  /**
   * Train a LinearRegression model given an RDD of (label, features) pairs. We run a fixed number
   * of iterations of gradient descent using a step size of 1.0. We use the entire data set to
   * compute the true gradient in each iteration.
   *
   * @param input RDD of (label, array of features) pairs. Each pair describes a row of the data
   *              matrix A as well as the corresponding right hand side label y
   * @param numIterations Number of iterations of gradient descent to run.
   * @return a LinearRegressionModel which has the weights and offset from training.
   *
   */
  @Since("0.8.0")
  def train(
      input: RDD[LabeledPoint],
      numIterations: Int): LinearRegressionModel = {
    train(input, numIterations, 1.0, 1.0)
  }
}

3、最小平方梯度,首先聯繫我們的代價(損失)函數,如下:


損失函數源碼標記爲:L = 1/2n ||A weights-y||^2

每個樣本的梯度值:

每個樣本的誤差值:

第一個compute返回的是 ,第二個compute返回的是

class LeastSquaresGradient extends Gradient {
  override def compute(data: Vector, label: Double, weights: Vector): (Vector, Double) = {
    val diff = dot(data, weights) - label
    val loss = diff * diff / 2.0//誤差
    val gradient = data.copy
    scal(diff, gradient)////梯度值x*(y-h(x))
    (gradient, loss)
  }

  override def compute(
      data: Vector,
      label: Double,
      weights: Vector,
      cumGradient: Vector): Double = {
    val diff = dot(data, weights) - label//h(x)-y
    axpy(diff, data, cumGradient)//y = x*(h(x)-y)+cumGradient
    /**axpy用法:
      * Computes y += x * a, possibly doing less work than actually doing that operation
      *  def axpy[A, X, Y](a: A, x: X, y: Y)(implicit axpy: CanAxpy[A, X, Y]) { axpy(a,x,y) }
      */
    diff * diff / 2.0
  }
}

4、權重更新(SimpleUpdater),更新公式如下:


返回的時候偏置項設置爲0了

class SimpleUpdater extends Updater {
  override def compute(
      weightsOld: Vector,//上一次計算後的權重向量
      gradient: Vector,//本次迭代的權重向量
      stepSize: Double,//步長
      iter: Int,//當前迭代次數
      regParam: Double): (Vector, Double) = {
    val thisIterStepSize = stepSize / math.sqrt(iter)//學習速率  a
    val brzWeights: BV[Double] = weightsOld.toBreeze.toDenseVector
    brzAxpy(-thisIterStepSize, gradient.toBreeze, brzWeights)
    //brzWeights + = gradient.toBreeze-thisIterStepSize

    (Vectors.fromBreeze(brzWeights), 0)
  }
}

5權重優化

權重優化採用的是隨機梯度降,但是默認的是miniBatchFraction= 1.0。


/**
 * Class used to solve an optimization problem using Gradient Descent.
 * @param gradient Gradient function to be used.
 * @param updater Updater to be used to update weights after every iteration.
 */
class GradientDescent private[spark] (private var gradient: Gradient, private var updater: Updater)
  extends Optimizer with Logging {

  private var stepSize: Double = 1.0
  private var numIterations: Int = 100
  private var regParam: Double = 0.0
  private var miniBatchFraction: Double = 1.0
  private var convergenceTol: Double = 0.001//收斂公差

  /**
   * Set the initial step size of SGD for the first step. Default 1.0.
   * In subsequent steps, the step size will decrease with stepSize/sqrt(t)
   */
  def setStepSize(step: Double): this.type = {
    this.stepSize = step
    this
  }

  /**
   * :: Experimental ::
   * Set fraction of data to be used for each SGD iteration.
   * Default 1.0 (corresponding to deterministic/classical gradient descent)
   */
  @Experimental
  def setMiniBatchFraction(fraction: Double): this.type = {
    this.miniBatchFraction = fraction
    this
  }

  /**
   * Set the number of iterations for SGD. Default 100.
   */
  def setNumIterations(iters: Int): this.type = {
    this.numIterations = iters
    this
  }

  /**
   * Set the regularization parameter. Default 0.0.
   */
  def setRegParam(regParam: Double): this.type = {
    this.regParam = regParam
    this
  }

  /**
   * Set the convergence tolerance. Default 0.001
   * convergenceTol is a condition which decides iteration termination.
   * The end of iteration is decided based on below logic.
   *
   *  - If the norm of the new solution vector is >1, the diff of solution vectors
   *    is compared to relative tolerance which means normalizing by the norm of
   *    the new solution vector.
   *  - If the norm of the new solution vector is <=1, the diff of solution vectors
   *    is compared to absolute tolerance which is not normalizing.
   *
   * Must be between 0.0 and 1.0 inclusively.
   */
  def setConvergenceTol(tolerance: Double): this.type = {
    require(0.0 <= tolerance && tolerance <= 1.0)
    this.convergenceTol = tolerance
    this
  }

  /**
   * Set the gradient function (of the loss function of one single data example)
   * to be used for SGD.
   */
  def setGradient(gradient: Gradient): this.type = {
    this.gradient = gradient
    this
  }


  /**
   * Set the updater function to actually perform a gradient step in a given direction.
   * The updater is responsible to perform the update from the regularization term as well,
   * and therefore determines what kind or regularization is used, if any.
   */
  def setUpdater(updater: Updater): this.type = {
    this.updater = updater
    this
  }

  /**
   * :: DeveloperApi ::
   * Runs gradient descent on the given training data.
   * @param data training data
   * @param initialWeights initial weights
   * @return solution vector
   */
  @DeveloperApi
  def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector = {
    val (weights, _) = GradientDescent.runMiniBatchSGD(
      data,
      gradient,
      updater,
      stepSize,
      numIterations,
      regParam,
      miniBatchFraction,
      initialWeights,
      convergenceTol)
    weights
  }

}

/**
 * :: DeveloperApi ::
 * Top-level method to run gradient descent.
 */
@DeveloperApi
object GradientDescent extends Logging {
  /**
   * Run stochastic gradient descent (SGD) in parallel using mini batches.
   * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data
   * in order to compute a gradient estimate.
   * Sampling, and averaging the subgradients over this subset is performed using one standard
   * spark map-reduce in each iteration.
   *
   * @param data Input data for SGD. RDD of the set of data examples, each of
   *             the form (label, [feature values]).
   * @param gradient Gradient object (used to compute the gradient of the loss function of
   *                 one single data example)
   * @param updater Updater function to actually perform a gradient step in a given direction.
   * @param stepSize initial step size for the first step
   * @param numIterations number of iterations that SGD should be run.
   * @param regParam regularization parameter
   * @param miniBatchFraction fraction of the input data set that should be used for
   *                          one iteration of SGD. Default value 1.0.
   * @param convergenceTol Minibatch iteration will end before numIterations if the relative
   *                       difference between the current weight and the previous weight is less
   *                       than this value. In measuring convergence, L2 norm is calculated.
   *                       Default value 0.001. Must be between 0.0 and 1.0 inclusively.
   * @return A tuple containing two elements. The first element is a column matrix containing
   *         weights for every feature, and the second element is an array containing the
   *         stochastic loss computed for every iteration.
   */
  def runMiniBatchSGD(
      data: RDD[(Double, Vector)],
      gradient: Gradient,
      updater: Updater,
      stepSize: Double,
      numIterations: Int,
      regParam: Double,
      miniBatchFraction: Double,
      initialWeights: Vector,
      convergenceTol: Double): (Vector, Array[Double]) = {

    // convergenceTol should be set with non minibatch settings
    if (miniBatchFraction < 1.0 && convergenceTol > 0.0) {
      logWarning("Testing against a convergenceTol when using miniBatchFraction " +
        "< 1.0 can be unstable because of the stochasticity in sampling.")
    }
    //把歷史的權重放在一個數組中
    val stochasticLossHistory = new ArrayBuffer[Double](numIterations)
    // Record previous weight and current one to calculate solution vector difference
    //初始化權重
    var previousWeights: Option[Vector] = None
    var currentWeights: Option[Vector] = None
    //訓練的樣本數
    val numExamples = data.count()

    // if no data, return initial weights to avoid NaNs
    if (numExamples == 0) {
      logWarning("GradientDescent.runMiniBatchSGD returning initial weights, no data found")
      return (initialWeights, stochasticLossHistory.toArray)
    }

    if (numExamples * miniBatchFraction < 1) {
      logWarning("The miniBatchFraction is too small")
    }

    // Initialize weights as a column vector
    var weights = Vectors.dense(initialWeights.toArray)
    val n = weights.size

    /**
     * For the first iteration, the regVal will be initialized as sum of weight squares
     * if it's L2 updater; for L1 updater, the same logic is followed.
     */
    var regVal = updater.compute(
      weights, Vectors.zeros(weights.size), 0, 1, regParam)._2

    var converged = false // indicates whether converged based on convergenceTol判斷是否收斂
    var i = 1
    while (!converged && i <= numIterations) {
      //廣播weights
      val bcWeights = data.context.broadcast(weights)

      // Sample a subset (fraction miniBatchFraction) of the total data
      // compute and sum up the subgradients on this subset (this is one map-reduce)
      val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
        .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
          seqOp = (c, v) => {
            // c: (grad, loss, count), v: (label, features)
            val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
            (c._1, c._2 + l, c._3 + 1)
          },
          combOp = (c1, c2) => {
            // c: (grad, loss, count)
            (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
          })

      if (miniBatchSize > 0) {
        /**
         * lossSum is computed using the weights from the previous iteration
         * and regVal is the regularization value computed in the previous iteration as well.
         */
        //保存誤差,更新權重
        stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
        val update = updater.compute(
          weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble),
          stepSize, i, regParam)
        weights = update._1
        regVal = update._2

        previousWeights = currentWeights
        currentWeights = Some(weights)
        if (previousWeights != None && currentWeights != None) {
          converged = isConverged(previousWeights.get,
            currentWeights.get, convergenceTol)
        }
      } else {
        logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero")
      }
      i += 1
    }

    logInfo("GradientDescent.runMiniBatchSGD finished. Last 10 stochastic losses %s".format(
      stochasticLossHistory.takeRight(10).mkString(", ")))
    //返回權重和歷史誤差數組
    (weights, stochasticLossHistory.toArray)

  }


SparkML實驗:

package Regression

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.{LabeledPoint, LinearRegressionModel, LinearRegressionWithSGD}
import org.apache.spark.{SparkConf, SparkContext}


object RegressionWithSGD {
  def main(args: Array[String]) {
   val conf = new SparkConf().setAppName("LinearRegressionWithSGDExample").setMaster("local")
    val sc = new SparkContext(conf)

    // Load and parse the data
    val data = sc.textFile("E:\\SparkCore2\\data\\mllib\\ridge-data\\lpsa.data")
    val parsedData = data.map { line =>
      val parts = line.split(',')
      LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
    }
    /**parsedData形式:
      * (-0.4307829,[-1.63735562648104,-2.00621178480549,-1.86242597251066,-1.02470580167082,-0.522940888712441,
      * -0.863171185425945,-1.04215728919298,-0.864466507337306])
      */

    // Building the model
    val numIterations = 100//迭代次數
    val stepSize = 0.00000001//步長
    val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)//訓練模型

    // Evaluate model on training examples and compute training error
    val valuesAndPreds = parsedData.map { point =>
      val prediction = model.predict(point.features)
      (point.label, prediction)
    }
    val numCount = valuesAndPreds.count()
    println("The sample count"+numCount)

    val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2) }.mean()//殘差的樣本方差
    println("training Mean Squared Error = " + MSE)
    println("模型的權重"+model.weights)
    println("模型的殘差"+model.intercept)

    // Save and load model
    model.save(sc, "E:\\SparkCore2\\data\\mllib\\ridge-data\\scalaLinearRegressionWithSGDModel")
    val sameModel = LinearRegressionModel.load(sc, "E:\\SparkCore2\\data\\mllib\\ridge-data\\scalaLinearRegressionWithSGDModel")

    sc.stop()

    /**
      * The sample count:67
      * training Mean Squared Error = 7.4510328101026
      *模型的權重[1.440209460949548E-8,1.0686674736254139E-8,9.608973495307957E-9,4.553409983798095E-9,1.2221496560765207E-8,8.910773406981891E-9,5.5962085583952E-9,1.2255699128757168E-8]
      *模型的殘差0.0
      */

  }
}

參考文獻:

1andrew NG線性迴歸課件:鏈接:http://pan.baidu.com/s/1bTgHgq 密碼:7mbt











發佈了70 篇原創文章 · 獲贊 211 · 訪問量 31萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章