在寫這篇博客的時候,翻閱了一些互聯網上的資料,發現文獻[1]寫的比較系統。所以推薦大家讀讀文獻[1].但是出現了一些錯誤,所以我在此簡述一些。如果推理不過去了。可以看看我的簡述。
------------------------------------前言
背景:
(1)在醫學領域藥物劑量反應中,隨着藥物劑量的增加,療效和副作用會呈現一定趨勢。比如劑量越高,療效越
高,劑量越高,毒性越大等
(2)評估藥物在不同劑量水平下的毒性,並且建議一個對病人既安全又有效的劑量稱爲最大耐受劑量(Maximum Tolerated Dose)簡稱 MTD。
(3)隨着藥物的增加,藥物的毒性是非減的。MTD被定義爲毒性概率不超過毒性靶水平的最高劑量水平
(4)基於每個劑量水平下病人的毒性反應的比率估計不同,劑量水平下的毒性概率可能不是劑量水平的非減函
數,於是我們可以採用保序迴歸的方法
L2保序迴歸
L2保序迴歸算法
一些具體的定義和命題查看文獻[1]
Spark源碼分析(大圖見附錄)
/**
* 保序迴歸模型
*
* @param boundaries 用於預測的邊界數組,它必須是排好順序的。(分段函數的分段點數組)
* @param predictions 保序迴歸的結果,即分段點x對應的預測值
* @param isotonic 升序還是降序(true爲升)
*/
@Since("1.3.0")
class IsotonicRegressionModel @Since("1.3.0") (
@Since("1.3.0") val boundaries: Array[Double],
@Since("1.3.0") val predictions: Array[Double],
@Since("1.3.0") val isotonic: Boolean) extends Serializable with Saveable {
private val predictionOrd = if (isotonic) Ordering[Double] else Ordering[Double].reverse
require(boundaries.length == predictions.length)
assertOrdered(boundaries)
assertOrdered(predictions)(predictionOrd)
/**
* A Java-friendly constructor that takes two Iterable parameters and one Boolean parameter.
*/
@Since("1.4.0")
def this(boundaries: java.lang.Iterable[Double],
predictions: java.lang.Iterable[Double],
isotonic: java.lang.Boolean) = {
this(boundaries.asScala.toArray, predictions.asScala.toArray, isotonic)
}
/** 序列順序的檢測 */
private def assertOrdered(xs: Array[Double])(implicit ord: Ordering[Double]): Unit = {
var i = 1
val len = xs.length
while (i < len) {
require(ord.compare(xs(i - 1), xs(i)) <= 0,
s"Elements (${xs(i - 1)}, ${xs(i)}) are not ordered.")
i += 1
}
}
/**
* 利用分段函數的線性函數,輸入feature進行預測
*
* @param testData Features to be labeled.
* @return Predicted labels.
*
*/
@Since("1.3.0")
def predict(testData: RDD[Double]): RDD[Double] = {
testData.map(predict)
}
/**
* 利用分段函數的線性函數,輸入feature進行預測
*
* @param testData Features to be labeled.
* @return Predicted labels.
*
*/
@Since("1.3.0")
def predict(testData: JavaDoubleRDD): JavaDoubleRDD = {
JavaDoubleRDD.fromRDD(predict(testData.rdd.retag.asInstanceOf[RDD[Double]]))
}
/**
* 利用分段函數的線性函數,輸入feature進行預測
*
* @param testData Feature to be labeled.
* @return Predicted label.
* 1) 如果testdata可以精確匹配到一個邊界數組,那麼就返回對應的數值,如果多個,那麼隨機返回一個
* 2) 如果testdata 低於或者高於所有的邊界數組,那麼返回第一個或者最後一個If testData is lower or higher than all boundaries then first or last prediction
* 3) 如果testdat在兩個邊界數組之間,那麼採用分段函數的線性插值方法得到的數值
*
*/
@Since("1.3.0")
def predict(testData: Double): Double = {
def linearInterpolation(x1: Double, y1: Double, x2: Double, y2: Double, x: Double): Double = {
y1 + (y2 - y1) * (x - x1) / (x2 - x1)
}
val foundIndex = binarySearch(boundaries, testData)
val insertIndex = -foundIndex - 1
// Find if the index was lower than all values,
// higher than all values, in between two values or exact match.
if (insertIndex == 0) {
predictions.head
} else if (insertIndex == boundaries.length) {
predictions.last
} else if (foundIndex < 0) {
linearInterpolation(
boundaries(insertIndex - 1),
predictions(insertIndex - 1),
boundaries(insertIndex),
predictions(insertIndex),
testData)
} else {
predictions(foundIndex)
}
}
/** A convenient method for boundaries called by the Python API. */
private[mllib] def boundaryVector: Vector = Vectors.dense(boundaries)
/** A convenient method for boundaries called by the Python API. */
private[mllib] def predictionVector: Vector = Vectors.dense(predictions)
@Since("1.4.0")
override def save(sc: SparkContext, path: String): Unit = {
IsotonicRegressionModel.SaveLoadV1_0.save(sc, path, boundaries, predictions, isotonic)
}
override protected def formatVersion: String = "1.0"
}
@Since("1.4.0")
object IsotonicRegressionModel extends Loader[IsotonicRegressionModel] {
import org.apache.spark.mllib.util.Loader._
private object SaveLoadV1_0 {
def thisFormatVersion: String = "1.0"
/** Hard-code class name string in case it changes in the future */
def thisClassName: String = "org.apache.spark.mllib.regression.IsotonicRegressionModel"
/** Model data for model import/export */
case class Data(boundary: Double, prediction: Double)
def save(
sc: SparkContext,
path: String,
boundaries: Array[Double],
predictions: Array[Double],
isotonic: Boolean): Unit = {
val sqlContext = SQLContext.getOrCreate(sc)
val metadata = compact(render(
("class" -> thisClassName) ~ ("version" -> thisFormatVersion) ~
("isotonic" -> isotonic)))
sc.parallelize(Seq(metadata), 1).saveAsTextFile(metadataPath(path))
sqlContext.createDataFrame(
boundaries.toSeq.zip(predictions).map { case (b, p) => Data(b, p) }
).write.parquet(dataPath(path))
}
def load(sc: SparkContext, path: String): (Array[Double], Array[Double]) = {
val sqlContext = SQLContext.getOrCreate(sc)
val dataRDD = sqlContext.read.parquet(dataPath(path))
checkSchema[Data](dataRDD.schema)
val dataArray = dataRDD.select("boundary", "prediction").collect()
val (boundaries, predictions) = dataArray.map { x =>
(x.getDouble(0), x.getDouble(1))
}.toList.sortBy(_._1).unzip
(boundaries.toArray, predictions.toArray)
}
}
@Since("1.4.0")
override def load(sc: SparkContext, path: String): IsotonicRegressionModel = {
implicit val formats = DefaultFormats
val (loadedClassName, version, metadata) = loadMetadata(sc, path)
val isotonic = (metadata \ "isotonic").extract[Boolean]
val classNameV1_0 = SaveLoadV1_0.thisClassName
(loadedClassName, version) match {
case (className, "1.0") if className == classNameV1_0 =>
val (boundaries, predictions) = SaveLoadV1_0.load(sc, path)
new IsotonicRegressionModel(boundaries, predictions, isotonic)
case _ => throw new Exception(
s"IsotonicRegressionModel.load did not recognize model with (className, format version):" +
s"($loadedClassName, $version). Supported:\n" +
s" ($classNameV1_0, 1.0)"
)
}
}
}
/**
* Isotonic regression.
* Currently implemented using parallelized pool adjacent violators algorithm.
* Only univariate (single feature) algorithm supported.
*
* Sequential PAV implementation based on:
* Tibshirani, Ryan J., Holger Hoefling, and Robert Tibshirani.
* "Nearly-isotonic regression." Technometrics 53.1 (2011): 54-61.
* Available from [[http://www.stat.cmu.edu/~ryantibs/papers/neariso.pdf]]
*
* Sequential PAV parallelization based on:
* Kearsley, Anthony J., Richard A. Tapia, and Michael W. Trosset.
* "An approach to parallelizing isotonic regression."
* Applied Mathematics and Parallel Computing. Physica-Verlag HD, 1996. 141-147.
* Available from [[http://softlib.rice.edu/pub/CRPC-TRs/reports/CRPC-TR96640.pdf]]
*
* @see [[http://en.wikipedia.org/wiki/Isotonic_regression Isotonic regression (Wikipedia)]]
*/
@Since("1.3.0")
class IsotonicRegression private (private var isotonic: Boolean) extends Serializable {
/**
* 構建IsotonicRegression實例的默認參數:isotonic = true
*
* @return New instance of IsotonicRegression.
*/
@Since("1.3.0")
def this() = this(true)
/**
* 設置序列的參數(Sets the isotonic parameter).
*
* @param isotonic 序列是遞增的還是遞減的
* @return This instance of IsotonicRegression.
*/
@Since("1.3.0")
def setIsotonic(isotonic: Boolean): this.type = {
this.isotonic = isotonic
this
}
/**
* 運行保序迴歸算法,來構建保序迴歸模型
* @param input 輸入一個 RDD 內部數據形式爲 tuples (label, feature, weight) ,其中,label 是對每次計算都會改變
* feature 特徵變量 你weight 權重(默認爲1)
* @return Isotonic regression model.
*/
@Since("1.3.0")
def run(input: RDD[(Double, Double, Double)]): IsotonicRegressionModel = {
val preprocessedInput = if (isotonic) {
input
} else {
input.map(x => (-x._1, x._2, x._3))
}
val pooled = parallelPoolAdjacentViolators(preprocessedInput)
val predictions = if (isotonic) pooled.map(_._1) else pooled.map(-_._1)
val boundaries = pooled.map(_._2)
new IsotonicRegressionModel(boundaries, predictions, isotonic)
}
/**
* Run pool adjacent violators algorithm to obtain isotonic regression model.
*
* @param input JavaRDD of tuples (label, feature, weight) where label is dependent variable
* for which we calculate isotonic regression, feature is independent variable
* and weight represents number of measures with default 1.
* If multiple labels share the same feature value then they are ordered before
* the algorithm is executed.
* @return Isotonic regression model.
*/
@Since("1.3.0")
def run(input: JavaRDD[(JDouble, JDouble, JDouble)]): IsotonicRegressionModel = {
run(input.rdd.retag.asInstanceOf[RDD[(Double, Double, Double)]])
}
/**
* Performs a pool adjacent violators algorithm (PAV算法).
* @param input 輸入的數據 形式爲: (label, feature, weight).
* @return 按照保序迴歸的定義,返回一個有序的序列
*/
private def poolAdjacentViolators(
input: Array[(Double, Double, Double)]): Array[(Double, Double, Double)] = {
if (input.isEmpty) {
return Array.empty
}
// Pools sub array within given bounds assigning weighted average value to all elements.
def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = {
val poolSubArray = input.slice(start, end + 1)
val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
val weight = poolSubArray.map(_._3).sum
var i = start
while (i <= end) {
input(i) = (weightedSum / weight, input(i)._2, input(i)._3)
i = i + 1
}
}
var i = 0
val len = input.length
while (i < len) {
var j = i
// Find monotonicity violating sequence, if any.
while (j < len - 1 && input(j)._1 > input(j + 1)._1) {
j = j + 1
}
// If monotonicity was not violated, move to next data point.
if (i == j) {
i = i + 1
} else {
// Otherwise pool the violating sequence
// and check if pooling caused monotonicity violation in previously processed points.
while (i >= 0 && input(i)._1 > input(i + 1)._1) {
pool(input, i, j)
i = i - 1
}
i = j
}
}
// For points having the same prediction, we only keep two boundary points.
val compressed = ArrayBuffer.empty[(Double, Double, Double)]
var (curLabel, curFeature, curWeight) = input.head
var rightBound = curFeature
def merge(): Unit = {
compressed += ((curLabel, curFeature, curWeight))
if (rightBound > curFeature) {
compressed += ((curLabel, rightBound, 0.0))
}
}
i = 1
while (i < input.length) {
val (label, feature, weight) = input(i)
if (label == curLabel) {
curWeight += weight
rightBound = feature
} else {
merge()
curLabel = label
curFeature = feature
curWeight = weight
rightBound = curFeature
}
i += 1
}
merge()
compressed.toArray
}
/**
* Performs並行PAV算法實現
* 將pav應用在每個分區,之後再進行合併。
* @param input Input data of tuples (label, feature, weight).
* @return Result tuples (label, feature, weight) where labels were updated
* to form a monotone sequence as per isotonic regression definition.
*/
private def parallelPoolAdjacentViolators(
input: RDD[(Double, Double, Double)]): Array[(Double, Double, Double)] = {
val parallelStepResult = input
.sortBy(x => (x._2, x._1))
.glom()
.flatMap(poolAdjacentViolators)
.collect()
.sortBy(x => (x._2, x._1)) // Sort again because collect() doesn't promise ordering.
poolAdjacentViolators(parallelStepResult)
}
}
spark實驗
import org.apache.spark.mllib.regression.{IsotonicRegression, IsotonicRegressionModel} import org.apache.spark.{SparkConf, SparkContext} object IsotonicRegressionExample { def main(args: Array[String]) { val conf = new SparkConf().setAppName("IsotonicRegressionExample").setMaster("local") val sc = new SparkContext(conf) val data = sc.textFile("C:\\Users\\alienware\\IdeaProjects\\sparkCore\\data\\mllib\\sample_isotonic_regression_data.txt") // Create label, feature, weight tuples from input data with weight set to default value 1.0. val parsedData = data.map { line => val parts = line.split(',').map(_.toDouble) (parts(0), parts(1), 1.0) } // Split data into training (60%) and test (40%) sets. val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0) val test = splits(1) // Create isotonic regression model from training data. // Isotonic parameter defaults to true so it is only shown for demonstration val model = new IsotonicRegression().setIsotonic(true).run(training) // Create tuples of predicted and real labels. val predictionAndLabel = test.map { point => val predictedLabel = model.predict(point._2) (predictedLabel, point._1) } //predictionAndLabel.foreach(println) /** * (0.16868944399999988,0.31208567) (0.16868944399999988,0.35900051) (0.16868944399999988,0.03926568) (0.16868944399999988,0.12952575) (0.16868944399999988,0.0) (0.16868944399999988,0.01376849) (0.16868944399999988,0.13105558) (0.19545421571428565,0.13717491) (0.19545421571428565,0.19020908) (0.19545421571428565,0.19581846) (0.31718510999999966,0.29576747) (0.5322114566666667,0.4854666) (0.5368859433333334,0.49209587) (0.5602243760000001,0.5017848) (0.5701674724126985,0.58286588) (0.5801105688253968,0.64660887) (0.5900536652380952,0.65782764) (0.5900536652380952,0.63029067) (0.5900536652380952,0.63233044) (0.5900536652380952,0.33299337) (0.5900536652380952,0.36206017) (0.5900536652380952,0.56348802) (0.5900536652380952,0.48393677) (0.5900536652380952,0.46965834) (0.5900536652380952,0.45843957) (0.5900536652380952,0.47118817) (0.5900536652380952,0.51555329) (0.5900536652380952,0.56297807) (0.6881693,0.65119837) (0.7135390099999999,0.66598674) (0.861295255,0.91330954) (0.903875573,0.90719021) (0.9275879659999999,0.93115757) (0.9275879659999999,0.91942886) */ // Calculate mean squared error between predicted and real labels. val meanSquaredError = predictionAndLabel.map { case (p, l) => math.pow((p - l), 2) }.mean() println("Mean Squared Error = " + meanSquaredError) //Mean Squared Error = 0.010049744711808193 // Save and load model model.save(sc, "target/tmp/myIsotonicRegressionModel") val sameModel = IsotonicRegressionModel.load(sc, "target/tmp/myIsotonicRegressionModel") } }
參考文獻
1、http://wenku.baidu.com/link?url=rbcbI3L7M83F62Aey_kyGZk7kwuJxr5ZW61EqFH5T45umsdZOCrAbfpl8a1yuMyzObd1_kG-kQ9DPcSTl7wnoX6UyNN_gT5bBYh_p1yMgD7url=rbcbI3L7M83F62Aey_kyGZk7kwuJxr5ZW61EqFH5T45umsdZOCrAbfpl8a1yuMyzObd1_kG-kQ9DPcSTl7wnoX6UyNN_gT5bBYh_p1yMgD7
附錄
鏈接:http://pan.baidu.com/s/1i4DwQs1 密碼:moor