SparkML之迴歸(三)保序迴歸

在寫這篇博客的時候，翻閱了一些互聯網上的資料，發現文獻[1]寫的比較系統。所以推薦大家讀讀文獻[1].但是出現了一些錯誤，所以我在此簡述一些。如果推理不過去了。可以看看我的簡述。

------------------------------------前言

背景：

（1）在醫學領域藥物劑量反應中，隨着藥物劑量的增加，療效和副作用會呈現一定趨勢。比如劑量越高，療效越

高，劑量越高，毒性越大等

（2）評估藥物在不同劑量水平下的毒性，並且建議一個對病人既安全又有效的劑量稱爲最大耐受劑量（Maximum Tolerated Dose）簡稱 MTD。

（3）隨着藥物的增加，藥物的毒性是非減的。MTD被定義爲毒性概率不超過毒性靶水平的最高劑量水平

（4）基於每個劑量水平下病人的毒性反應的比率估計不同，劑量水平下的毒性概率可能不是劑量水平的非減函

數，於是我們可以採用保序迴歸的方法

L2保序迴歸

L2保序迴歸算法

一些具體的定義和命題查看文獻[1]

Spark源碼分析(大圖見附錄)

/**
 * 保序迴歸模型
 *
 * @param boundaries 用於預測的邊界數組，它必須是排好順序的。（分段函數的分段點數組）
 * @param predictions 保序迴歸的結果，即分段點x對應的預測值
 * @param isotonic 升序還是降序（true爲升）
 */
@Since("1.3.0")
class IsotonicRegressionModel @Since("1.3.0") (
    @Since("1.3.0") val boundaries: Array[Double],
    @Since("1.3.0") val predictions: Array[Double],
    @Since("1.3.0") val isotonic: Boolean) extends Serializable with Saveable {

  private val predictionOrd = if (isotonic) Ordering[Double] else Ordering[Double].reverse

  require(boundaries.length == predictions.length)
  assertOrdered(boundaries)
  assertOrdered(predictions)(predictionOrd)

  /**
   * A Java-friendly constructor that takes two Iterable parameters and one Boolean parameter.
   */
  @Since("1.4.0")
  def this(boundaries: java.lang.Iterable[Double],
      predictions: java.lang.Iterable[Double],
      isotonic: java.lang.Boolean) = {
    this(boundaries.asScala.toArray, predictions.asScala.toArray, isotonic)
  }

  /** 序列順序的檢測 */
  private def assertOrdered(xs: Array[Double])(implicit ord: Ordering[Double]): Unit = {
    var i = 1
    val len = xs.length
    while (i < len) {
      require(ord.compare(xs(i - 1), xs(i)) <= 0,
        s"Elements (${xs(i - 1)}, ${xs(i)}) are not ordered.")
      i += 1
    }
  }

  /**
   * 利用分段函數的線性函數，輸入feature進行預測
   *
   * @param testData Features to be labeled.
   * @return Predicted labels.
   *
   */
  @Since("1.3.0")
  def predict(testData: RDD[Double]): RDD[Double] = {
    testData.map(predict)
  }

  /**
   * 利用分段函數的線性函數，輸入feature進行預測
   *
   * @param testData Features to be labeled.
   * @return Predicted labels.
   *
   */
  @Since("1.3.0")
  def predict(testData: JavaDoubleRDD): JavaDoubleRDD = {
    JavaDoubleRDD.fromRDD(predict(testData.rdd.retag.asInstanceOf[RDD[Double]]))
  }

  /**
   * 利用分段函數的線性函數，輸入feature進行預測
   *
   * @param testData Feature to be labeled.
   * @return Predicted label.
   *         1) 如果testdata可以精確匹配到一個邊界數組，那麼就返回對應的數值，如果多個，那麼隨機返回一個
   *         2) 如果testdata 低於或者高於所有的邊界數組，那麼返回第一個或者最後一個If testData is lower or higher than all boundaries then first or last prediction
   *         3) 如果testdat在兩個邊界數組之間，那麼採用分段函數的線性插值方法得到的數值
   *
   */
  @Since("1.3.0")
  def predict(testData: Double): Double = {

    def linearInterpolation(x1: Double, y1: Double, x2: Double, y2: Double, x: Double): Double = {
      y1 + (y2 - y1) * (x - x1) / (x2 - x1)
    }

    val foundIndex = binarySearch(boundaries, testData)
    val insertIndex = -foundIndex - 1

    // Find if the index was lower than all values,
    // higher than all values, in between two values or exact match.
    if (insertIndex == 0) {
      predictions.head
    } else if (insertIndex == boundaries.length) {
      predictions.last
    } else if (foundIndex < 0) {
      linearInterpolation(
        boundaries(insertIndex - 1),
        predictions(insertIndex - 1),
        boundaries(insertIndex),
        predictions(insertIndex),
        testData)
    } else {
      predictions(foundIndex)
    }
  }

  /** A convenient method for boundaries called by the Python API. */
  private[mllib] def boundaryVector: Vector = Vectors.dense(boundaries)

  /** A convenient method for boundaries called by the Python API. */
  private[mllib] def predictionVector: Vector = Vectors.dense(predictions)

  @Since("1.4.0")
  override def save(sc: SparkContext, path: String): Unit = {
    IsotonicRegressionModel.SaveLoadV1_0.save(sc, path, boundaries, predictions, isotonic)
  }

  override protected def formatVersion: String = "1.0"
}

@Since("1.4.0")
object IsotonicRegressionModel extends Loader[IsotonicRegressionModel] {

  import org.apache.spark.mllib.util.Loader._

  private object SaveLoadV1_0 {

    def thisFormatVersion: String = "1.0"

    /** Hard-code class name string in case it changes in the future */
    def thisClassName: String = "org.apache.spark.mllib.regression.IsotonicRegressionModel"

    /** Model data for model import/export */
    case class Data(boundary: Double, prediction: Double)

    def save(
        sc: SparkContext,
        path: String,
        boundaries: Array[Double],
        predictions: Array[Double],
        isotonic: Boolean): Unit = {
      val sqlContext = SQLContext.getOrCreate(sc)

      val metadata = compact(render(
        ("class" -> thisClassName) ~ ("version" -> thisFormatVersion) ~
          ("isotonic" -> isotonic)))
      sc.parallelize(Seq(metadata), 1).saveAsTextFile(metadataPath(path))

      sqlContext.createDataFrame(
        boundaries.toSeq.zip(predictions).map { case (b, p) => Data(b, p) }
      ).write.parquet(dataPath(path))
    }

    def load(sc: SparkContext, path: String): (Array[Double], Array[Double]) = {
      val sqlContext = SQLContext.getOrCreate(sc)
      val dataRDD = sqlContext.read.parquet(dataPath(path))

      checkSchema[Data](dataRDD.schema)
      val dataArray = dataRDD.select("boundary", "prediction").collect()
      val (boundaries, predictions) = dataArray.map { x =>
        (x.getDouble(0), x.getDouble(1))
      }.toList.sortBy(_._1).unzip
      (boundaries.toArray, predictions.toArray)
    }
  }

  @Since("1.4.0")
  override def load(sc: SparkContext, path: String): IsotonicRegressionModel = {
    implicit val formats = DefaultFormats
    val (loadedClassName, version, metadata) = loadMetadata(sc, path)
    val isotonic = (metadata \ "isotonic").extract[Boolean]
    val classNameV1_0 = SaveLoadV1_0.thisClassName
    (loadedClassName, version) match {
      case (className, "1.0") if className == classNameV1_0 =>
        val (boundaries, predictions) = SaveLoadV1_0.load(sc, path)
        new IsotonicRegressionModel(boundaries, predictions, isotonic)
      case _ => throw new Exception(
        s"IsotonicRegressionModel.load did not recognize model with (className, format version):" +
        s"($loadedClassName, $version).  Supported:\n" +
        s"  ($classNameV1_0, 1.0)"
      )
    }
  }
}

/**
 * Isotonic regression.
 * Currently implemented using parallelized pool adjacent violators algorithm.
 * Only univariate (single feature) algorithm supported.
 *
 * Sequential PAV implementation based on:
 * Tibshirani, Ryan J., Holger Hoefling, and Robert Tibshirani.
 *   "Nearly-isotonic regression." Technometrics 53.1 (2011): 54-61.
 *   Available from [[http://www.stat.cmu.edu/~ryantibs/papers/neariso.pdf]]
 *
 * Sequential PAV parallelization based on:
 * Kearsley, Anthony J., Richard A. Tapia, and Michael W. Trosset.
 *   "An approach to parallelizing isotonic regression."
 *   Applied Mathematics and Parallel Computing. Physica-Verlag HD, 1996. 141-147.
 *   Available from [[http://softlib.rice.edu/pub/CRPC-TRs/reports/CRPC-TR96640.pdf]]
 *
 * @see [[http://en.wikipedia.org/wiki/Isotonic_regression Isotonic regression (Wikipedia)]]
 */
@Since("1.3.0")
class IsotonicRegression private (private var isotonic: Boolean) extends Serializable {

  /**
   * 構建IsotonicRegression實例的默認參數：isotonic = true
   *
   * @return New instance of IsotonicRegression.
   */
  @Since("1.3.0")
  def this() = this(true)

  /**
   * 設置序列的參數（Sets the isotonic parameter）.
   *
   * @param isotonic 序列是遞增的還是遞減的
   * @return This instance of IsotonicRegression.
   */
  @Since("1.3.0")
  def setIsotonic(isotonic: Boolean): this.type = {
    this.isotonic = isotonic
    this
  }

  /**
   * 運行保序迴歸算法，來構建保序迴歸模型
   * @param input 輸入一個 RDD 內部數據形式爲 tuples (label, feature, weight) ，其中，label 是對每次計算都會改變
   *	feature 特徵變量 你weight 權重（默認爲1）        
   * @return Isotonic regression model.
   */
  @Since("1.3.0")
  def run(input: RDD[(Double, Double, Double)]): IsotonicRegressionModel = {
    val preprocessedInput = if (isotonic) {
      input
    } else {
      input.map(x => (-x._1, x._2, x._3))
    }

    val pooled = parallelPoolAdjacentViolators(preprocessedInput)

    val predictions = if (isotonic) pooled.map(_._1) else pooled.map(-_._1)
    val boundaries = pooled.map(_._2)

    new IsotonicRegressionModel(boundaries, predictions, isotonic)
  }

  /**
   * Run pool adjacent violators algorithm to obtain isotonic regression model.
   *
   * @param input JavaRDD of tuples (label, feature, weight) where label is dependent variable
   *              for which we calculate isotonic regression, feature is independent variable
   *              and weight represents number of measures with default 1.
   *              If multiple labels share the same feature value then they are ordered before
   *              the algorithm is executed.
   * @return Isotonic regression model.
   */
  @Since("1.3.0")
  def run(input: JavaRDD[(JDouble, JDouble, JDouble)]): IsotonicRegressionModel = {
    run(input.rdd.retag.asInstanceOf[RDD[(Double, Double, Double)]])
  }

  /**
   * Performs a pool adjacent violators algorithm (PAV算法).
   * @param input 輸入的數據  形式爲： (label, feature, weight).
   * @return 按照保序迴歸的定義，返回一個有序的序列
   */
  private def poolAdjacentViolators(
      input: Array[(Double, Double, Double)]): Array[(Double, Double, Double)] = {

    if (input.isEmpty) {
      return Array.empty
    }

    // Pools sub array within given bounds assigning weighted average value to all elements.
    def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = {
      val poolSubArray = input.slice(start, end + 1)

      val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
      val weight = poolSubArray.map(_._3).sum

      var i = start
      while (i <= end) {
        input(i) = (weightedSum / weight, input(i)._2, input(i)._3)
        i = i + 1
      }
    }

    var i = 0
    val len = input.length
    while (i < len) {
      var j = i

      // Find monotonicity violating sequence, if any.
      while (j < len - 1 && input(j)._1 > input(j + 1)._1) {
        j = j + 1
      }

      // If monotonicity was not violated, move to next data point.
      if (i == j) {
        i = i + 1
      } else {
        // Otherwise pool the violating sequence
        // and check if pooling caused monotonicity violation in previously processed points.
        while (i >= 0 && input(i)._1 > input(i + 1)._1) {
          pool(input, i, j)
          i = i - 1
        }

        i = j
      }
    }

    // For points having the same prediction, we only keep two boundary points.
    val compressed = ArrayBuffer.empty[(Double, Double, Double)]

    var (curLabel, curFeature, curWeight) = input.head
    var rightBound = curFeature
    def merge(): Unit = {
      compressed += ((curLabel, curFeature, curWeight))
      if (rightBound > curFeature) {
        compressed += ((curLabel, rightBound, 0.0))
      }
    }
    i = 1
    while (i < input.length) {
      val (label, feature, weight) = input(i)
      if (label == curLabel) {
        curWeight += weight
        rightBound = feature
      } else {
        merge()
        curLabel = label
        curFeature = feature
        curWeight = weight
        rightBound = curFeature
      }
      i += 1
    }
    merge()

    compressed.toArray
  }

  /**
   * Performs並行PAV算法實現
   * 將pav應用在每個分區，之後再進行合併。
   * @param input Input data of tuples (label, feature, weight).
   * @return Result tuples (label, feature, weight) where labels were updated
   *         to form a monotone sequence as per isotonic regression definition.
   */
  private def parallelPoolAdjacentViolators(
      input: RDD[(Double, Double, Double)]): Array[(Double, Double, Double)] = {
    val parallelStepResult = input
      .sortBy(x => (x._2, x._1))
      .glom()
      .flatMap(poolAdjacentViolators)
      .collect()
      .sortBy(x => (x._2, x._1)) // Sort again because collect() doesn't promise ordering.
    poolAdjacentViolators(parallelStepResult)
  }
}

spark實驗

import org.apache.spark.mllib.regression.{IsotonicRegression, IsotonicRegressionModel}
import org.apache.spark.{SparkConf, SparkContext}
object IsotonicRegressionExample {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("IsotonicRegressionExample").setMaster("local")
    val sc = new SparkContext(conf)

    val data = sc.textFile("C:\\Users\\alienware\\IdeaProjects\\sparkCore\\data\\mllib\\sample_isotonic_regression_data.txt")

    // Create label, feature, weight tuples from input data with weight set to default value 1.0.
    val parsedData = data.map { line =>
      val parts = line.split(',').map(_.toDouble)
      (parts(0), parts(1), 1.0)
    }

    // Split data into training (60%) and test (40%) sets.
    val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
    val training = splits(0)
    val test = splits(1)

    // Create isotonic regression model from training data.
    // Isotonic parameter defaults to true so it is only shown for demonstration
    val model = new IsotonicRegression().setIsotonic(true).run(training)

    // Create tuples of predicted and real labels.
    val predictionAndLabel = test.map { point =>
      val predictedLabel = model.predict(point._2)
      (predictedLabel, point._1)
    }
    //predictionAndLabel.foreach(println)

    /**
      * (0.16868944399999988,0.31208567)
(0.16868944399999988,0.35900051)
(0.16868944399999988,0.03926568)
(0.16868944399999988,0.12952575)
(0.16868944399999988,0.0)
(0.16868944399999988,0.01376849)
(0.16868944399999988,0.13105558)
(0.19545421571428565,0.13717491)
(0.19545421571428565,0.19020908)
(0.19545421571428565,0.19581846)
(0.31718510999999966,0.29576747)
(0.5322114566666667,0.4854666)
(0.5368859433333334,0.49209587)
(0.5602243760000001,0.5017848)
(0.5701674724126985,0.58286588)
(0.5801105688253968,0.64660887)
(0.5900536652380952,0.65782764)
(0.5900536652380952,0.63029067)
(0.5900536652380952,0.63233044)
(0.5900536652380952,0.33299337)
(0.5900536652380952,0.36206017)
(0.5900536652380952,0.56348802)
(0.5900536652380952,0.48393677)
(0.5900536652380952,0.46965834)
(0.5900536652380952,0.45843957)
(0.5900536652380952,0.47118817)
(0.5900536652380952,0.51555329)
(0.5900536652380952,0.56297807)
(0.6881693,0.65119837)
(0.7135390099999999,0.66598674)
(0.861295255,0.91330954)
(0.903875573,0.90719021)
(0.9275879659999999,0.93115757)
(0.9275879659999999,0.91942886)
      */

    // Calculate mean squared error between predicted and real labels.
    val meanSquaredError = predictionAndLabel.map { case (p, l) => math.pow((p - l), 2) }.mean()
    println("Mean Squared Error = " + meanSquaredError)
    //Mean Squared Error = 0.010049744711808193

    // Save and load model
    model.save(sc, "target/tmp/myIsotonicRegressionModel")
    val sameModel = IsotonicRegressionModel.load(sc, "target/tmp/myIsotonicRegressionModel")



  }
}

參考文獻

1、http://wenku.baidu.com/link?url=rbcbI3L7M83F62Aey_kyGZk7kwuJxr5ZW61EqFH5T45umsdZOCrAbfpl8a1yuMyzObd1_kG-kQ9DPcSTl7wnoX6UyNN_gT5bBYh_p1yMgD7url=rbcbI3L7M83F62Aey_kyGZk7kwuJxr5ZW61EqFH5T45umsdZOCrAbfpl8a1yuMyzObd1_kG-kQ9DPcSTl7wnoX6UyNN_gT5bBYh_p1yMgD7

附錄

鏈接：http://pan.baidu.com/s/1i4DwQs1 密碼：moor

legotime

發佈了70 篇原創文章 · 獲贊 211 · 訪問量 31萬+

私信關注

SparkML之迴歸(三)保序迴歸

EXCEL中下拉菜單中添加新選項或者刪除選項

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

同事使用 insert into select 遷移數據，開開心心上線，上線後被公司開除！

Git使用經驗總結5-修改提交信息

Python 爬蟲：Spring Boot 反爬蟲的成功案例

京東科技數字化營銷能力的演進與最佳實踐| 京東雲技術團隊

Git使用經驗總結4-撤回上一次本地提交

Java中止線程的方式

壓榨數據庫的真實處理速度

[轉帖]Oracle Exadata 學習筆記之核心特性Part1

SparkML之迴歸(二)嶺迴歸和Lasso闡述及OLS,梯度下降比較

SparkML之分類(四)決策樹

SparkML之特徵提取（二）詞項加權之DF-IDF

SparkML之分類(三）支持向量機（SVM）

SparkML之聚類(一)Kmeans聚類

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結