30.機器學習應用-樸素貝葉斯二元分類算法

1、樸素貝葉斯簡介

      樸素貝葉斯(Naive-Bayes)分析以貝氏定理(Bayes theorem)爲基礎,通過概率統計的分析來判斷未知類的數據應該屬於哪一類。

      樸素貝葉斯分析藉助分析數據中的特徵與標籤之間的概率來作爲分類的依據。

      以溼度(特徵)、氣壓(特徵)、風向(特徵)。是否會下雨(標籤)爲例:

P(高氣壓、溼度51~60、西風、會下雨)=P(會)*P(高|會)*P(51~60|會)*P(西|會)=(6/10)*(2/6)*(2/6)*(2/6)=0.02222

 P(高氣壓、溼度51~60、西風、不會下雨)=P(不會)*P(高|不會)*P(51~60|不會)*P(西|不會)=(4/10)*(3/4)*(2/4)*(2/4)=0.075

2、基於MLLIB的實現 

               NaiveBayesModel NaiveBayes.train(input,lambda)

參數 說明
input 輸入的訓練數據
lambda 設置lambda參數,默認1
# -*- coding: UTF-8 -*-
import sys
from time import time
import pandas as pd
import matplotlib.pyplot as plt
from pyspark import SparkConf, SparkContext
from pyspark.mllib.classification import NaiveBayes
from pyspark.mllib.regression import LabeledPoint
import numpy as np
from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.mllib.feature import StandardScaler


def SetLogger( sc ):
    logger = sc._jvm.org.apache.log4j
    logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
    logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )
    logger.LogManager.getRootLogger().setLevel(logger.Level.ERROR)    

def SetPath(sc):
    global Path
    if sc.master[0:5]=="local" :
        Path="file:/home/hduser/pythonwork/PythonProject/"
    else:   
        Path="hdfs://master:9000/user/hduser/"
#如果要在cluster模式運行(hadoop yarn 或Spark Stand alone),請按照書上的說明,先把文件上傳到HDFS目錄

def get_mapping(rdd, idx):
    return rdd.map(lambda fields: fields[idx]).distinct().zipWithIndex().collectAsMap()

def extract_label(record):
    label=(record[-1])
    return float(label)

def extract_features(field,categoriesMap,featureEnd):
    categoryIdx = categoriesMap[field[3]]
    categoryFeatures = np.zeros(len(categoriesMap))
    categoryFeatures[categoryIdx] = 1
    numericalFeatures=[convert_float(field)  for  field in field[4: featureEnd]]    
    return  np.concatenate(( categoryFeatures, numericalFeatures))

def convert_float(x):
    ret=(0 if x=="?" else float(x))
    return(0 if ret<0 else ret)


def PrepareData(sc): 
    #----------------------1.導入並轉換數據-------------
    print("開始導入數據...")
    rawDataWithHeader = sc.textFile(Path+"data/train.tsv")
    header = rawDataWithHeader.first() 
    rawData = rawDataWithHeader.filter(lambda x:x !=header)    
    rData=rawData.map(lambda x: x.replace("\"", ""))    
    lines = rData.map(lambda x: x.split("\t"))
    print("共計:" + str(lines.count()) + "項")
    #----------------------2.建立訓練評估所需數據 RDD[LabeledPoint]-------------
    print "標準化之前:",        
    categoriesMap = lines.map(lambda fields: fields[3]). \
                                        distinct().zipWithIndex().collectAsMap()
    labelRDD = lines.map(lambda r:  extract_label(r))
    featureRDD = lines.map(lambda r:  extract_features(r,categoriesMap,len(r) - 1))
    for i in featureRDD.first():
        print (str(i)+","),
    print ""       
    
    print "標準化之後:",    
    stdScaler = StandardScaler(withMean=False, withStd=True).fit(featureRDD)
    ScalerFeatureRDD=stdScaler.transform(featureRDD)
    for i in ScalerFeatureRDD.first():
        print (str(i)+","),        
                
    labelpoint=labelRDD.zip(ScalerFeatureRDD)
    labelpointRDD=labelpoint.map(lambda r: LabeledPoint(r[0], r[1]))
    
    #----------------------3.以隨機方式將數據分爲3個部分並且返回-------------
    (trainData, validationData, testData) = labelpointRDD.randomSplit([8, 1, 1])
    print("將數據分trainData:" + str(trainData.count()) + 
              "   validationData:" + str(validationData.count()) +
              "   testData:" + str(testData.count()))
    return (trainData, validationData, testData, categoriesMap) #返回數據

    
def PredictData(sc,model,categoriesMap): 
    print("開始導入數據...")
    rawDataWithHeader = sc.textFile(Path+"data/test.tsv")
    header = rawDataWithHeader.first() 
    rawData = rawDataWithHeader.filter(lambda x:x !=header)    
    rData=rawData.map(lambda x: x.replace("\"", ""))    
    lines = rData.map(lambda x: x.split("\t"))
    print("共計:" + str(lines.count()) + "項")
    dataRDD = lines.map(lambda r:  ( r[0]  ,
                            extract_features(r,categoriesMap,len(r) )))
    DescDict = {
           0: "暫時性網頁(ephemeral)",
           1: "長青網頁(evergreen)"
     }
    for data in dataRDD.take(10):
        predictResult = model.predict(data[1])
        print " 網址:  " +str(data[0])+"\n" +\
                  "             ==>預測:"+ str(predictResult)+ \
                  " 說明:"+DescDict[predictResult] +"\n"

def evaluateModel(model, validationData):
    score = model.predict(validationData.map(lambda p: p.features))
    scoreAndLabels=score.zip(validationData \
                                   .map(lambda p: p.label))  \
                                   .map(lambda (x,y): (float(x),float(y)) )
    metrics = BinaryClassificationMetrics(scoreAndLabels)
    AUC=metrics.areaUnderROC
    return( AUC)


def trainEvaluateModel(trainData,validationData,lambdaParam):
    startTime = time()
    model = NaiveBayes.train(trainData,   lambdaParam)
    AUC = evaluateModel(model, validationData)
    duration = time() - startTime
    print    "訓練評估:使用參數" + \
                " lambda="+str( lambdaParam) +\
                 " 所需時間="+str(duration) + \
                 " 結果AUC = " + str(AUC) 
    return (AUC,duration,  lambdaParam,model)


def evalParameter(trainData, validationData, evalparm,
                  lambdaParamList):
    
    metrics = [trainEvaluateModel(trainData, validationData,regParam ) 
                                  for regParam in  lambdaParamList]
    
    evalparm="lambdaParam"
    IndexList=lambdaParamList
    
    df = pd.DataFrame(metrics,index=IndexList,
            columns=['AUC', 'duration',' lambdaParam','model'])
    showchart(df,evalparm,'AUC','duration',0.5,0.7 )
    
def showchart(df,evalparm ,barData,lineData,yMin,yMax):
    ax = df[barData].plot(kind='bar', title =evalparm,figsize=(10,6),legend=True, fontsize=12)
    ax.set_xlabel(evalparm,fontsize=12)
    ax.set_ylim([yMin,yMax])
    ax.set_ylabel(barData,fontsize=12)
    ax2 = ax.twinx()
    ax2.plot(df[[lineData ]].values, linestyle='-', marker='o', linewidth=2.0,color='r')
    plt.show()
def evalAllParameter(training_RDD, validation_RDD, lambdaParamList):    
    metrics = [trainEvaluateModel(trainData, validationData,  lambdaParam  ) 
                        for lambdaParam in lambdaParamList  ]
    Smetrics = sorted(metrics, key=lambda k: k[0], reverse=True)
    bestParameter=Smetrics[0]
    
    print("調校後最佳參數:lambdaParam:" + str(bestParameter[2]) +  
             "  ,結果AUC = " + str(bestParameter[0]))
    return bestParameter[3]

    
def  parametersEval(trainData, validationData):
    print("----- 評估lambda參數使用 ---------")
    evalParameter(trainData, validationData,"lambdaParam", 
            lambdaParamList=[1.0, 3.0, 5.0, 15.0, 25.0,30.0,35.0,40.0,45.0,50.0,60.0]) 
         


def CreateSparkContext():
    sparkConf = SparkConf()                                                       \
                         .setAppName("RunNaiveBayesBinary")                         \
                         .set("spark.ui.showConsoleProgress", "false") 
    sc = SparkContext(conf = sparkConf)
    print ("master="+sc.master)    
    SetLogger(sc)
    SetPath(sc)
    return (sc)

if __name__ == "__main__":
    print("RunNaiveBayesBinary")
    sc=CreateSparkContext()
    print("==========數據準備階段===============")
    (trainData, validationData, testData, categoriesMap) =PrepareData(sc)
    trainData.persist(); validationData.persist(); testData.persist()
    print("==========訓練評估階段===============")
    
    (AUC,duration,  lambdaParam,model)= \
            trainEvaluateModel(trainData, validationData, 60.0)
          
    #if (len(sys.argv) == 2) and (sys.argv[1]=="-e"):
     #   parametersEval(trainData, validationData)
    #elif   (len(sys.argv) == 2) and (sys.argv[1]=="-a"): 
    print("-----所有參數訓練評估找出最好的參數組合---------")  
    model=evalAllParameter(trainData, validationData, 
                           [1.0, 3.0, 5.0, 15.0, 25.0,30.0,35.0,40.0,45.0,50.0,60.0])

              
    print("==========測試階段===============")
    auc = evaluateModel(model, testData)
    print("使用test Data測試最佳模型,結果 AUC:" + str(auc))
    print("==========預測數據===============")
    PredictData(sc, model, categoriesMap)
19/06/01 09:21:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
]0;IPython: pythonwork/PythonProjectRunNaiveBayesBinary
master=spark://master:7077
==========數據準備階段===============
開始導入數據...
共計:7395項
標準化之前: 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.789131, 2.055555556, 0.676470588, 0.205882353, 0.047058824, 0.023529412, 0.443783175, 0.0, 0.0, 0.09077381, 0.0, 0.245831182, 0.003883495, 1.0, 1.0, 24.0, 0.0, 5424.0, 170.0, 8.0, 0.152941176, 0.079129575, 
標準化之後: 0.0, 3.088234470373542, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.382109905825835, 0.23846925874240185, 3.3301793603170884, 1.4030154477199817, 0.49030735543205883, 0.323968347812901, 0.07779782993228768, 0.0, 0.0, 2.190189633120838, 0.0, 4.683697355817031, 0.002066886525539329, 2.0555083992759973, 2.1113276370640266, 1.17686860239841, 0.0, 0.6111251528371162, 0.9472535877741962, 2.474396677695976, 0.8344415706300091, 0.998721352144073, 將數據分trainData:5924   validationData:740   testData:731
==========訓練評估階段===============
訓練評估:使用參數 lambda=60.0 所需時間=18.0418760777 結果AUC = 0.6375
-----所有參數訓練評估找出最好的參數組合---------
訓練評估:使用參數 lambda=1.0 所需時間=1.85026407242 結果AUC = 0.634868421053
訓練評估:使用參數 lambda=3.0 所需時間=1.76581716537 結果AUC = 0.634868421053
訓練評估:使用參數 lambda=5.0 所需時間=1.58911108971 結果AUC = 0.634868421053
訓練評估:使用參數 lambda=15.0 所需時間=1.46647286415 結果AUC = 0.634868421053
訓練評估:使用參數 lambda=25.0 所需時間=1.38512301445 結果AUC = 0.634868421053
訓練評估:使用參數 lambda=30.0 所需時間=1.35139298439 結果AUC = 0.634868421053
訓練評估:使用參數 lambda=35.0 所需時間=1.21870493889 結果AUC = 0.634868421053
訓練評估:使用參數 lambda=40.0 所需時間=1.15822100639 結果AUC = 0.634868421053
訓練評估:使用參數 lambda=45.0 所需時間=1.11964797974 結果AUC = 0.634868421053
訓練評估:使用參數 lambda=50.0 所需時間=1.13146996498 結果AUC = 0.634868421053
訓練評估:使用參數 lambda=60.0 所需時間=1.08395195007 結果AUC = 0.6375
調校後最佳參數:lambdaParam:60.0  ,結果AUC = 0.6375
==========測試階段===============
使用test Data測試最佳模型,結果 AUC:0.644337616438
==========預測數據===============
開始導入數據...
共計:3171項
 網址:  http://www.lynnskitchenadventures.com/2009/04/homemade-enchilada-sauce.html
             ==>預測:1.0 說明:長青網頁(evergreen)

 網址:  http://lolpics.se/18552-stun-grenade-ar
             ==>預測:1.0 說明:長青網頁(evergreen)

 網址:  http://www.xcelerationfitness.com/treadmills.html
             ==>預測:1.0 說明:長青網頁(evergreen)

 網址:  http://www.bloomberg.com/news/2012-02-06/syria-s-assad-deploys-tactics-of-father-to-crush-revolt-threatening-reign.html
             ==>預測:1.0 說明:長青網頁(evergreen)

 網址:  http://www.wired.com/gadgetlab/2011/12/stem-turns-lemons-and-limes-into-juicy-atomizers/
             ==>預測:1.0 說明:長青網頁(evergreen)

 網址:  http://www.latimes.com/health/boostershots/la-heb-fat-tax-denmark-20111013,0,2603132.story
             ==>預測:1.0 說明:長青網頁(evergreen)

 網址:  http://www.howlifeworks.com/a/a?AG_ID=1186&cid=7340ci
             ==>預測:1.0 說明:長青網頁(evergreen)

 網址:  http://romancingthestoveblog.wordpress.com/2010/01/13/sweet-potato-ravioli-with-lemon-sage-brown-butter-sauce/
             ==>預測:1.0 說明:長青網頁(evergreen)

 網址:  http://www.funniez.net/Funny-Pictures/turn-men-down.html
             ==>預測:1.0 說明:長青網頁(evergreen)

 網址:  http://youfellasleepwatchingadvd.com/
             ==>預測:1.0 說明:長青網頁(evergreen)

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章