【GuidedLDA】代碼分析

初始化

先爲各個文檔裏的單詞隨機分配主題
guidedLDA在初始化階段改變了[文檔:主題]的隨機分佈
seed_topics-字典格式{在詞袋中的位置:種子詞的列表索引}

# 這是有種子詞的初始化
# 遍歷所有單詞
for i in range(N):
    # WS[k] 包含語料庫中的第k個單詞
    # DS[k] 包含第k個單詞的文檔索引
    w, d = WS[i], DS[i]
    if w not in seed_topics:
        continue
    # check if seeded initialization
    # 判斷是否在種子詞出現過
    # 初始化新的主題
    if w in seed_topics and random.random() < seed_confidence:
        # 使用自定義的主題編號
        z_new = seed_topics[w]
    else:
        # 否則，隨機分配
        z_new = i % n_topics
    ZS[i] = z_new
    # 矩陣對應元素+1
    ndz_[d, z_new] += 1
    nzw_[z_new, w] += 1
    nz_[z_new] += 1

開始迭代

核心公式，這裏源碼是用cpython搞的
_guidedlda.cpython-36m-darwin.so

log p(w,z) = log p(w|z) + log p(z)

nzw_: 記錄最終迭代中主題詞分配的計數矩陣
ndz_: 記錄最終迭代中文檔主題分配的計數矩陣
nz_: 主題賦值數組在最終迭代中計數
z:主題
d:文檔
w:單詞

    def _fit(self, X, seed_topics, seed_confidence):
        """Fit the model to the data X

        Parameters
        ----------
        X: array-like, shape (n_samples, n_features)
            Training vector, where n_samples in the number of samples and
            n_features is the number of features. Sparse matrix allowed.
        """
        random_state = guidedlda.utils.check_random_state(self.random_state)
        rands = self._rands.copy()

        self._initialize(X, seed_topics, seed_confidence)
        # 迭代
        for it in range(self.n_iter):
            # FIXME: using numpy.roll with a random shift might be faster
            random_state.shuffle(rands)
            if it % self.refresh == 0:
                ll = self.loglikelihood()
                logger.info("<{}> log likelihood: {:.0f}".format(it, ll))
                # keep track of loglikelihoods for monitoring convergence
                self.loglikelihoods_.append(ll)
            self._sample_topics(rands)
        # 這裏python代碼是看不出返回的值是啥玩意
        # 用c寫的，編譯成so文件的
        # 計算可能性
        ll = self.loglikelihood()
        logger.info("<{}> log likelihood: {:.0f}".format(self.n_iter - 1, ll))
        # eta: Dirichlet parameter for distribution over words 詞分佈
        # alpha: Dirichlet parameter for distribution over topics 主題分佈
        self.components_ = (self.nzw_ + self.eta).astype(float)
        # sum之後再增加一維得到形狀(主題數，1)
        self.components_ /= np.sum(self.components_, axis=1)[:, np.newaxis]
        # 主題t生成V中第i個單詞的概率
        self.topic_word_ = self.components_

        self.word_topic_ = (self.nzw_ + self.eta).astype(float)
        self.word_topic_ /= np.sum(self.word_topic_, axis=0)[np.newaxis, :]
        self.word_topic_ = self.word_topic_.T
        # 文檔d對應主題T中第i個主題的概率
        self.doc_topic_ = (self.ndz_ + self.alpha).astype(float)
        self.doc_topic_ /= np.sum(self.doc_topic_, axis=1)[:, np.newaxis]

        # delete attributes no longer needed after fitting to save memory and reduce clutter
        del self.WS
        del self.DS
        del self.ZS
        return self

總結

對於lDA來說，p(w|d) = p(w|t)p(t|d),詞在文檔的分佈 = 詞在主題的分佈*主題在文檔的分佈，有兩個超參數，θd和αt，分別表示d文檔對於主題的概率分佈和t主題生成單詞的概率分佈
首先隨機初始化θd和αt，然後枚舉主題T，對於每一個主題，都可以計算出每一篇文檔d和文檔d對應所有單詞w的p(w|d)，取最大值，此時的主題t就是對應d文檔w單詞
然後不斷更新超參，最後收斂
python包源碼不涉及LDA公式的實現，或者說不涉及串聯一些變量，核心變量的更新在py代碼裏沒有體現，是用的c(或者C++)實現的
目前已經掌握變量的定義、形狀、數據預處理
目前能看到的是guidedlda在有種子詞的情況下會影響初始化的主題分佈和詞分佈的矩陣，這個原本在lda中是完全隨機的，我在代碼上做了註釋

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【GuidedLDA】代碼分析

初始化

開始迭代

總結

容器中nginx無法使用同一個網絡下的容器域名

Python: SunMoonTimeCalculator

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

nodejs學習07——API

避免DbContext同時在多個線程調用

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

C#序列化對象轉爲爲XML格式字符串

Django博客重構教程（一）-models模型設計

ASP.Net引入Select2選擇框以及傳值

Django個人博客搭建教程---restful-api動態序列化

【Java】兩整數之和

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結