中餐館過程僞代碼及python實現

#中餐館兩種採樣方式：
已知條件概率

##**算法1：**直接從聯合分佈中採樣

N:餐廳的總人數
T：樣本總數（採樣的次數）
$\alpha $：Dirichlet參數

代碼1：
#算法1：直接從聯合分佈裏採樣,根據中餐館條件概率採樣，先得到z1，再根據z1得到z2.。。。。最後得到聯合概率樣本
#N表示中餐館有10個人，alpha = 3表示dirichlet參數，T=50表示樣本數，即迭代多少次
#verbose表示詳細信息，verbose=FALSE，意思就是設置運行的時候不顯示詳細信息
#給定初始Z[0]=1,其他初始化爲0
import numpy as np
def Draw_CRP_Direct_Sample(N = 10, alpha = 3, T = 50, VERBOSE = False):
    Z_table = np.zeros((T,N))#(T,N)用0填充的數組，數值表示每個人坐的桌子的編號
    for t in range(T):
        Z = np.zeros(N,dtype=int)#包含N個人的樣本
        for i in range(N):
            if i == 0:
                Z[i] = 1#初始化Z[0] = 1,即第一個人坐的的類別
            else:
                if VERBOSE:
                    print('初始Z=',Z)#一個樣本
                unique, counts = np.unique(Z, return_counts=True)
                #unique返回樣本Z中去重之後的主題編號（從小到大排序），counts計數(主題權重）返回每個主題的個數
                #對於一維數組或者列表，unique函數去除其中重複的元素，並按元素由小到大返回一個新的無元素重複的元組或者列表
				#return_index=True表示返回新列表元素在舊列表中的位置，並以列表形式儲存在s中
				#return_inverse=True 表示返回舊列表元素在新列表中的位置，並以列表形式儲存在p中
				#return_counts=True  表示返回新列表元素在舊錶中出現的次數，並以列表形式存儲在counts中
				
                # remove the zeros unsigned tables
                if unique[0] == 0:
                    unique = np.delete(unique,0)#刪除主題編號爲0的元素

                if VERBOSE:
                   print("unique,counts,alpha", unique,counts,alpha)
                # added alpha to the end of the counts (weights) array
                counts = np.append(counts,alpha) #counts=[counts,alpha]
                # also the new table index will be the max of table seen so far
                unique = np.append(unique,max(unique)+1)#添加一個新的主題
                print("np.append(counts,alpha)",counts)#將counts和alpha拼接後的新counts
                print("np.append(unique,max(unique)+1)",unique)
                
                if VERBOSE:
                  print("sum(counts)=",sum(counts))
				#輪盤賭法 
				#u是隨機產生的數，使得不同數據對應不同的數據概率，並且在整體上保留了“區域概率越大，對應數據越多”            
                u = np.random.uniform()*sum(counts)
				#np.cumsum累加求和 例子：counts=[1,2,3,4,5] np.cumsum(counts)  array([ 1,  3,  6, 10, 15], dtype=int32)
				a_counts = np.cumsum(counts)#累加求和列表，這裏是對每個主題包含的個數累加求和，也可以算出每個類別的概率進行累加求和，效果是一樣的
                if VERBOSE:
                    print("acounts,counts, u, (a_counts > u)",a_counts,counts, u, a_counts > u)

                # first index where accumuated sum is greater than random variable
                index =  np.argmax(a_counts > u) #返回最大值的索引
                print("index：", index)
                Z[i] = unique[index]

            if VERBOSE:
                print("最終Z=",Z)
                print("\n\n") 
                
        Z_table[t,:] = Z
    return Z_table

##**算法2:**用吉布斯採樣從聯合樣本中採樣

N:餐廳的總人數
T：樣本總數（採樣的次數）
$\alpha $：Dirichlet參數
b:burn-in，採樣完丟棄的樣本數

代碼2：Gibbs sampling
import numpy as np
#算法2：用吉布斯採樣從聯合分佈中採樣
#burn_in = 10代表有前10個樣本會去掉
def Draw_CRP_Gibbs_Sample(N = 10, alpha = 3, T = 50, burn_in = 10, VERBOSE = False):
    Z = np.ones(N,dtype=int)#一個樣本初始化爲1
    Z_table = np.zeros((T,N))#所有樣本初始化爲0 
    for t in range(T+burn_in):
        for i in range(N):
            if VERBOSE:
                print("初始Z0=",Z)
            # remove current table assignment刪除當前表賦值，處理到第幾項，第幾項就賦爲0，Z2更新的就是Z1中爲0的那一項
            Z[i] = 0

            if VERBOSE:
                print("Z1=",Z)
                
            unique, counts = np.unique(Z, return_counts=True)

            # remove the zeros in unassigned tables 刪除未分配表中的0
            if unique[0] == 0:
                unique = np.delete(unique,0)
                counts = np.delete(counts,0)

            if VERBOSE:
                print("unique,counts,alpha", unique,counts,alpha)

            # added alpha to the end of the counts (weights) array
            counts = np.append(counts,alpha)

            # also the new table index will be the max of table seen so far
            unique = np.append(unique,max(unique)+1)
            
            print("np.append(counts,alpha)",counts) 
            print("np.append(unique,max(unique)+1)",unique)

            
            if VERBOSE:
                print("sum(counts)=",sum(counts))
            u = np.random.uniform()*sum(counts)

            a_counts = np.cumsum(counts)

            if VERBOSE:
                print("a_counts,counts, u, a_counts > u",a_counts,counts, u, a_counts > u)

            # first index where accumuated sum is greater than random variable
            index =  np.argmax(a_counts > u)

            print("index:", index)

            Z[i] = unique[index]

            if VERBOSE:
                print("Z2=",Z)
                print("\n") 

        old_table = np.unique(Z)#old_table統計出現的類別，可能不是從1開始
        print("old_table=",old_table)
        new_table = np.array(range(1,len(old_table)+1))#new_table與old_table形狀大小一樣，目的是將old_table中的類別變爲從1開始
        print("new_table=",new_table)

        for k in range(len(old_table)):
            Z[Z == old_table[k]]=new_table[k]#將old_table中舊類別轉化爲新類別

        if t >= burn_in:  #捨棄前burn-in個樣本
            Z_table[t-burn_in,:] = Z

        if VERBOSE:
            print("Z3=",Z)
            print("\n\n\n") 

    if VERBOSE:
        print("Z_table=",Z_table)
    
    return Z_table

如何驗證採樣算法的正確性？

通過兩方面進行比較：

被佔桌子個數K的期望（樣本均值與理論均值）
被佔桌子個數P(K=k)的概率（樣本概率與理論概率）

##參考
徐亦達：中國餐館過程演示

中餐館過程僞代碼及python實現

如何驗證採樣算法的正確性？

分類問題集錦及練習

中餐館過程僞代碼及python實現

Day1——Data PreProcessing

gensim word2vec

IDEA初上手的一天

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結