對於連續屬性而言,可以考慮使用概率密度函數(如果是離散的,直接數數即可)。
對於貝葉斯統計,有以下公式:
1)屬性連續的情況
舉例1:以下是小孩和成年人的數據,其中第一個數表示身高,第二個數表示體重,根據以下數據判斷新數據(120,120),(165,110)是成人還是小孩
首先,我們假設身高和體重是互不相關的,即獨立的影響判斷的結果,如此可以使用高斯分佈作用於樸素貝葉斯上,由於已經假設獨立同分布,所以一個類的似然等於類中每個屬性的似然乘積,如下公式:
解析:我們的目的是爲了求得後驗分佈,所以先計算先驗和似然
(1)直接使用計數求先驗
(2)先使用data set將高斯分佈中的兩個參數,即均值和方差確定。
通過以下程序計算:
# -*- coding: utf-8 -*-
"""
Created on Thu Nov 22 20:01:46 2018
@author: wudl
"""
import numpy as np
import xlrd
def mean_function(ob):
mean = sum(ob)/len(ob)
return mean
def variance_function(ob):
mean = mean_function(ob)
ob_array = np.array(ob)
variance = sum((ob_array-mean)**2)/len(ob)
return variance
if __name__=="__main__":
workbook = xlrd.open_workbook('C:/Users/Lenovo/Desktop/bayes.xlsx')
sheet = workbook.sheet_by_name('Sheet1')
height_c = sheet.row_values(0)
weight_c = sheet.row_values(1)
height_a = sheet.row_values(3)
height_a = [i for i in height_a if i !=''] #使用列表解析式是爲了將列表中的空字符去掉
weight_a = sheet.row_values(4)
weight_a = [i for i in weight_a if i !='']
mean_height_c = mean_function(height_c)
variance_height_c = variance_function(height_c)
mean_weight_c = mean_function(weight_c)
variance_weight_c = variance_function(weight_c)
mean_height_a = mean_function(height_a)
variance_height_a = variance_function(height_a)
mean_weight_a = mean_function(weight_a)
variance_weight_a = variance_function(weight_a)
print('mean_height_c ==>> %.2f; variance_height_c ==>> %.2f' %(mean_height_c,variance_height_c),'\n' )
print('mean_weight_c ==>> %.2f; variance_weight_c ==>> %.2f' %(mean_weight_c,variance_weight_c),'\n' )
print('mean_height_a ==>> %.2f; variance_height_a ==>> %.2f' %(mean_height_a,variance_height_a),'\n' )
print('mean_weight_a ==>> %.2f; variance_weight_a ==>> %.2f' %(mean_weight_a,variance_weight_a),'\n' )
由此我們可以得到:
mean_height_c ==>> 59.17; variance_height_c ==>> 424.31
mean_weight_c ==>> 59.17; variance_weight_c ==>> 424.31
mean_height_a ==>> 170.00; variance_height_a ==>> 50.00
mean_weight_a ==>> 170.00; variance_weight_a ==>> 50.00
這裏近似四捨五入得到如下結果:
(3)計算後驗
在得到了均值和方差之後,就可以計算成人和小孩的先驗進而求得後驗
比如求成人身高的先驗
其中
完整程序:
# -*- coding: utf-8 -*-
"""
Created on Thu Nov 22 20:01:46 2018
@author: wudl
"""
import numpy as np
import xlrd
def mean_function(ob):
mean = sum(ob)/len(ob)
return mean
def variance_function(ob):
mean = mean_function(ob)
ob_array = np.array(ob)
variance = sum((ob_array-mean)**2)/len(ob)
return variance
def prior_distribution(w,h,ob1,ob2):
prior_h = 1/np.sqrt(2*np.pi*variance_function(ob1))*np.exp(-(h-mean_function(ob1))**2/(2*variance_function(ob1)))
prior_w = 1/np.sqrt(2*np.pi*variance_function(ob2))*np.exp(-(w-mean_function(ob2))**2/(2*variance_function(ob2)))
return prior_h*prior_w
#def p_sum():
#
if __name__=="__main__":
height,weight = map(int,input('Enter height and weight(separated by space):').split())
workbook = xlrd.open_workbook('C:/Users/Lenovo/Desktop/bayes.xlsx')
sheet = workbook.sheet_by_name('Sheet1')
height_c = sheet.row_values(0)
weight_c = sheet.row_values(1)
height_a = sheet.row_values(3)
height_a = [i for i in height_a if i !='']
weight_a = sheet.row_values(4)
weight_a = [i for i in weight_a if i !='']
mean_height_c = mean_function(height_c)
variance_height_c = variance_function(height_c)
mean_weight_c = mean_function(weight_c)
variance_weight_c = variance_function(weight_c)
mean_height_a = mean_function(height_a)
variance_height_a = variance_function(height_a)
mean_weight_a = mean_function(weight_a)
variance_weight_a = variance_function(weight_a)
print('mean_height_c ==>> %.2f; variance_height_c ==>> %.2f' %(mean_height_c,variance_height_c),'\n' )
print('mean_weight_c ==>> %.2f; variance_weight_c ==>> %.2f' %(mean_weight_c,variance_weight_c),'\n' )
print('mean_height_a ==>> %.2f; variance_height_a ==>> %.2f' %(mean_height_a,variance_height_a),'\n' )
print('mean_weight_a ==>> %.2f; variance_weight_a ==>> %.2f' %(mean_weight_a,variance_weight_a),'\n' )
"for forecasting"
"prior distribution"
prior_a = len(height_a)/(len(height_a)+len(height_c))
prior_c = len(height_c)/(len(height_a)+len(height_c))
"likelihood"
like_a = prior_distribution(height,weight,height_a,weight_a)
like_c = prior_distribution(height,weight,height_c,weight_c)
"results"
p_a = prior_a*like_a/(prior_a*like_a+prior_c*like_c)
p_c = prior_c*like_c/(prior_a*like_a+prior_c*like_c)
print(p_a)
print('p(y=c|x)==>>%.4f' %p_c)
'''
print('p(y=a|x)==>>'+str(p_a)) #考慮到四捨五入有時候比較小時總是會變爲零,所以採用字符型輸出
print('p(y=c|x)==>>'+str(p_c))
'''
輸入(120 120)得到:判定是小孩
輸入(165 110)得到:判定是小孩