離線增量文章畫像計算

日萌社

人工智能AI:Keras PyTorch MXNet TensorFlow PaddlePaddle 深度學習實戰(不定時更新)


2.5 離線增量文章畫像計算

學習目標

  • 目標
    • 瞭解增量更新代碼過程
  • 應用

2.5.1 離線文章畫像更新需求

文章畫像,就是給每篇文章定義一些詞。

  • 關鍵詞:TEXTRANK + IDF共同的詞
  • 主題詞:TEXTRANK + ITFDF共同的詞

  • 更新文章時間:

1、toutiao 數據庫中,news_article_content 與news_article_basic—>更新到article數據庫中article_data表,方便操作

第一次:所有更新,後面增量每天的數據更新26日:1:00~2:00,2:00~3:00,左閉右開,一個小時更新一次

2、剛纔新更新的文章,通過已有的idf計算出tfidf值以及hive 的textrank_keywords_values

3、更新hive的article_profile

2.5.2 定時更新文章設置

  • 目的:通過Supervisor管理Apscheduler定時運行更新程序
  • 步驟:
    • 1、更新程序代碼整理,並測試運行
    • 2、Apscheduler設置定時運行時間,並啓動日誌添加
    • 3、Supervisor進程管理

2.5.2.1 更新程序代碼整理,並測試運行

注意在Pycharm中運行要設置環境:

PYTHONUNBUFFERED=1
JAVA_HOME=/root/bigdata/jdk
SPARK_HOME=/root/bigdata/spark
HADOOP_HOME=/root/bigdata/hadoop
PYSPARK_PYTHON=/root/anaconda3/envs/reco_sys/bin/python
PYSPARK_DRIVER_PYTHON=/root/anaconda3/envs/reco_sys/bin/python
import os
import sys
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.insert(0, os.path.join(BASE_DIR))
from offline import SparkSessionBase
from datetime import datetime
from datetime import timedelta
import pyspark.sql.functions as F
import pyspark
import gc

class UpdateArticle(SparkSessionBase):
    """
    更新文章畫像
    """
    SPARK_APP_NAME = "updateArticle"
    ENABLE_HIVE_SUPPORT = True

    SPARK_EXECUTOR_MEMORY = "7g"

    def __init__(self):
        self.spark = self._create_spark_session()

        self.cv_path = "hdfs://hadoop-master:9000/headlines/models/countVectorizerOfArticleWords.model"
        self.idf_path = "hdfs://hadoop-master:9000/headlines/models/IDFOfArticleWords.model"

    def get_cv_model(self):
        # 詞語與詞頻統計
        from pyspark.ml.feature import CountVectorizerModel
        cv_model = CountVectorizerModel.load(self.cv_path)
        return cv_model

    def get_idf_model(self):
        from pyspark.ml.feature import IDFModel
        idf_model = IDFModel.load(self.idf_path)
        return idf_model

    @staticmethod
    def compute_keywords_tfidf_topk(words_df, cv_model, idf_model):
        """保存tfidf值高的20個關鍵詞
        :param spark:
        :param words_df:
        :return:
        """
        cv_result = cv_model.transform(words_df)
        tfidf_result = idf_model.transform(cv_result)
        # print("transform compelete")

        # 取TOP-N的TFIDF值高的結果
        def func(partition):
            TOPK = 20
            for row in partition:
                _ = list(zip(row.idfFeatures.indices, row.idfFeatures.values))
                _ = sorted(_, key=lambda x: x[1], reverse=True)
                result = _[:TOPK]
                #         words_index = [int(i[0]) for i in result]
                #         yield row.article_id, row.channel_id, words_index

                for word_index, tfidf in result:
                    yield row.article_id, row.channel_id, int(word_index), round(float(tfidf), 4)

        _keywordsByTFIDF = tfidf_result.rdd.mapPartitions(func).toDF(["article_id", "channel_id", "index", "tfidf"])

        return _keywordsByTFIDF

    def merge_article_data(self):
        """
        合併業務中增量更新的文章數據
        :return:
        """
        # 獲取文章相關數據, 指定過去一個小時整點到整點的更新數據
        # 如:26日:1:00~2:00,2:00~3:00,左閉右開
        self.spark.sql("use toutiao")
        _yester = datetime.today().replace(minute=0, second=0, microsecond=0)
        start = datetime.strftime(_yester + timedelta(days=0, hours=-1, minutes=0), "%Y-%m-%d %H:%M:%S")
        end = datetime.strftime(_yester, "%Y-%m-%d %H:%M:%S")

        # 合併後保留:article_id、channel_id、channel_name、title、content
        # +----------+----------+--------------------+--------------------+
        # | article_id | channel_id | title | content |
        # +----------+----------+--------------------+--------------------+
        # | 141462 | 3 | test - 20190316 - 115123 | 今天天氣不錯,心情很美麗!!! |
        basic_content = self.spark.sql(
            "select a.article_id, a.channel_id, a.title, b.content from news_article_basic a "
            "inner join news_article_content b on a.article_id=b.article_id where a.review_time >= '{}' "
            "and a.review_time < '{}' and a.status = 2".format(start, end))
        # 增加channel的名字,後面會使用
        basic_content.registerTempTable("temparticle")
        channel_basic_content = self.spark.sql(
            "select t.*, n.channel_name from temparticle t left join news_channel n on t.channel_id=n.channel_id")

        # 利用concat_ws方法,將多列數據合併爲一個長文本內容(頻道,標題以及內容合併)
        self.spark.sql("use article")
        sentence_df = channel_basic_content.select("article_id", "channel_id", "channel_name", "title", "content", \
                                                   F.concat_ws(
                                                       ",",
                                                       channel_basic_content.channel_name,
                                                       channel_basic_content.title,
                                                       channel_basic_content.content
                                                   ).alias("sentence")
                                                   )
        del basic_content
        del channel_basic_content
        gc.collect()

        sentence_df.write.insertInto("article_data")
        return sentence_df

    def generate_article_label(self, sentence_df):
        """
        生成文章標籤  tfidf, textrank
        :param sentence_df: 增量的文章內容
        :return:
        """
        # 進行分詞
        words_df = sentence_df.rdd.mapPartitions(segmentation).toDF(["article_id", "channel_id", "words"])
        cv_model = self.get_cv_model()
        idf_model = self.get_idf_model()

        # 1、保存所有的詞的idf的值,利用idf中的詞的標籤索引
        # 工具與業務隔離
        _keywordsByTFIDF = UpdateArticle.compute_keywords_tfidf_topk(words_df, cv_model, idf_model)

        keywordsIndex = self.spark.sql("select keyword, index idx from idf_keywords_values")

        keywordsByTFIDF = _keywordsByTFIDF.join(keywordsIndex, keywordsIndex.idx == _keywordsByTFIDF.index).select(
            ["article_id", "channel_id", "keyword", "tfidf"])

        keywordsByTFIDF.write.insertInto("tfidf_keywords_values")

        del cv_model
        del idf_model
        del words_df
        del _keywordsByTFIDF
        gc.collect()

        # 計算textrank
        textrank_keywords_df = sentence_df.rdd.mapPartitions(textrank).toDF(
            ["article_id", "channel_id", "keyword", "textrank"])
        textrank_keywords_df.write.insertInto("textrank_keywords_values")

        return textrank_keywords_df, keywordsIndex

    def get_article_profile(self, textrank, keywordsIndex):
        """
        文章畫像主題詞建立
        :param idf: 所有詞的idf值
        :param textrank: 每個文章的textrank值
        :return: 返回建立號增量文章畫像
        """
        keywordsIndex = keywordsIndex.withColumnRenamed("keyword", "keyword1")
        result = textrank.join(keywordsIndex, textrank.keyword == keywordsIndex.keyword1)

        # 1、關鍵詞(詞,權重)
        # 計算關鍵詞權重
        _articleKeywordsWeights = result.withColumn("weights", result.textrank * result.idf).select(
            ["article_id", "channel_id", "keyword", "weights"])

        # 合併關鍵詞權重到字典
        _articleKeywordsWeights.registerTempTable("temptable")
        articleKeywordsWeights = self.spark.sql(
            "select article_id, min(channel_id) channel_id, collect_list(keyword) keyword_list, collect_list(weights) weights_list from temptable group by article_id")
        def _func(row):
            return row.article_id, row.channel_id, dict(zip(row.keyword_list, row.weights_list))
        articleKeywords = articleKeywordsWeights.rdd.map(_func).toDF(["article_id", "channel_id", "keywords"])

        # 2、主題詞
        # 將tfidf和textrank共現的詞作爲主題詞
        topic_sql = """
                select t.article_id article_id2, collect_set(t.keyword) topics from tfidf_keywords_values t
                inner join 
                textrank_keywords_values r
                where t.keyword=r.keyword
                group by article_id2
                """
        articleTopics = self.spark.sql(topic_sql)

        # 3、將主題詞表和關鍵詞表進行合併,插入表
        articleProfile = articleKeywords.join(articleTopics,
                                              articleKeywords.article_id == articleTopics.article_id2).select(
            ["article_id", "channel_id", "keywords", "topics"])
        articleProfile.write.insertInto("article_profile")

        del keywordsIndex
        del _articleKeywordsWeights
        del articleKeywords
        del articleTopics
        gc.collect()

        return articleProfile


if __name__ == '__main__':
    ua = UpdateArticle()
    sentence_df = ua.merge_article_data()
    if sentence_df.rdd.collect():
        rank, idf = ua.generate_article_label(sentence_df)
        articleProfile = ua.get_article_profile(rank, idf)

2.5.3 增量更新文章TFIDF與TextRank(作爲測試代碼,不往HIVE中存儲)

在jupyter notebook中實現計算過程

  • 目的:能夠定時增量的更新新發表的文章
  • 步驟:
    • 合併新文章數據
    • 利用現有CV和IDF模型計算新文章TFIDF存儲,以及TextRank保存
    • 利用新文章數據的
  • 導入包
import os
# 配置spark driver和pyspark運行時,所使用的python解釋器路徑
PYSPARK_PYTHON = "/miniconda2/envs/reco_sys/bin/python"
# 當存在多個版本時,不指定很可能會導致出錯
os.environ["PYSPARK_PYTHON"] = PYSPARK_PYTHON
os.environ["PYSPARK_DRIVER_PYTHON"] = PYSPARK_PYTHON
import sys
BASE_DIR = os.path.dirname(os.getcwd())
sys.path.insert(0, os.path.join(BASE_DIR))
from datetime import datetime
from datetime import timedelta
import pyspark.sql.functions as F
from offline import SparkSessionBase
import pyspark
import gc

2.5.3.1 合併新文章數據

class UpdateArticle(SparkSessionBase):
    """
    更新文章畫像
    """
    SPARK_APP_NAME = "updateArticle"
    ENABLE_HIVE_SUPPORT = True

    SPARK_EXECUTOR_MEMORY = "7g"

    def __init__(self):
        self.spark = self._create_spark_session()
  • 增量合併文章

可以根據自己的業務制定符合現有階段的更新計劃,比如按照天,小時更新,

ua.spark.sql("use toutiao")
_yester = datetime.today().replace(minute=0, second=0, microsecond=0)
start = datetime.strftime(_yester + timedelta(days=0, hours=-1, minutes=0), "%Y-%m-%d %H:%M:%S")
end = datetime.strftime(_yester, "%Y-%m-%d %H:%M:%S")

選取指定時間段的新文章(測試時候,爲了有數據出現,可以將偏移多一些天數,如days=-50)

注:確保news_article_basic與news_article_content是一致的。

# 合併後保留:article_id、channel_id、channel_name、title、content
# select * from news_article_basic where review_time > "2019-03-05";
# +----------+----------+--------------------+--------------------+
# | article_id | channel_id | title | content |
# +----------+----------+--------------------+--------------------+
# | 141462 | 3 | test - 20190316 - 115123 | 今天天氣不錯,心情很美麗!!! |
basic_content = ua.spark.sql(
  "select a.article_id, a.channel_id, a.title, b.content from news_article_basic a "
  "inner join news_article_content b on a.article_id=b.article_id where a.review_time >= '{}' "
  "and a.review_time < '{}' and a.status = 2".format(start, end))
# 增加channel的名字,後面會使用
basic_content.registerTempTable("temparticle")
channel_basic_content = ua.spark.sql(
  "select t.*, n.channel_name from temparticle t left join news_channel n on t.channel_id=n.channel_id")

# 利用concat_ws方法,將多列數據合併爲一個長文本內容(頻道,標題以及內容合併)
ua.spark.sql("use article")
sentence_df = channel_basic_content.select("article_id", "channel_id", "channel_name", "title", "content", \
                                           F.concat_ws(
                                             ",",
                                             channel_basic_content.channel_name,
                                             channel_basic_content.title,
                                             channel_basic_content.content
                                           ).alias("sentence")
                                          )
del basic_content
del channel_basic_content
gc.collect()

# sentence_df.write.insertInto("article_data")

2.5.3.2 更新TFIDF

  • 問題:計算出TFIDF,TF文檔詞頻,IDF 逆文檔頻率(文檔數量、某詞出現的文檔數量)已有N個文章中詞的IDF會隨着新增文章而動態變化,就會涉及TFIDF的增量計算。
    • 解決辦法可以在固定時間定時對所有文章數據進行全部計算CV和IDF的模型結果,替換模型即可

對新文章分詞,讀取模型

# 進行分詞前面計算出的sentence_df
words_df = sentence_df.rdd.mapPartitions(segmentation).toDF(["article_id", "channel_id", "words"])
cv_model = get_cv_model()
idf_model = get_idf_model()

定義兩個讀取函數

def get_cv_model(self):
        # 詞語與詞頻統計
        from pyspark.ml.feature import CountVectorizerModel
    cv_model = CountVectorizerModel.load(cv_path)
    return cv_model

def get_idf_model(self):
        from pyspark.ml.feature import IDFModel
    idf_model = IDFModel.load(idf_path)
    return idf_model
    def compute_keywords_tfidf_topk(words_df, cv_model, idf_model):
        """保存tfidf值高的20個關鍵詞
        :param spark:
        :param words_df:
        :return:
        """
        cv_result = cv_model.transform(words_df)
        tfidf_result = idf_model.transform(cv_result)
        # print("transform compelete")

        # 取TOP-N的TFIDF值高的結果
        def func(partition):
            TOPK = 20
            for row in partition:
                _ = list(zip(row.idfFeatures.indices, row.idfFeatures.values))
                _ = sorted(_, key=lambda x: x[1], reverse=True)
                result = _[:TOPK]
                for word_index, tfidf in result:
                    yield row.article_id, row.channel_id, int(word_index), round(float(tfidf), 4)

        _keywordsByTFIDF = tfidf_result.rdd.mapPartitions(func).toDF(["article_id", "channel_id", "index", "tfidf"])

        return _keywordsByTFIDF
# 1、保存所有的詞的idf的值,利用idf中的詞的標籤索引
# 工具與業務隔離
_keywordsByTFIDF = compute_keywords_tfidf_topk(words_df, cv_model, idf_model)

keywordsIndex = ua.spark.sql("select keyword, index idx from idf_keywords_values")

keywordsByTFIDF = _keywordsByTFIDF.join(keywordsIndex, keywordsIndex.idx == _keywordsByTFIDF.index).select(
  ["article_id", "channel_id", "keyword", "tfidf"])

# keywordsByTFIDF.write.insertInto("tfidf_keywords_values")

del cv_model
del idf_model
del words_df
del _keywordsByTFIDF
gc.collect()

# 計算textrank
textrank_keywords_df = sentence_df.rdd.mapPartitions(textrank).toDF(
  ["article_id", "channel_id", "keyword", "textrank"])
# textrank_keywords_df.write.insertInto("textrank_keywords_values")

前面這些得到textrank_keywords_df,接下來往後進行文章的畫像更新

2.5.3.3 增量更新文章畫像結果

對於新文章進行計算畫像

  • 步驟:
    • 1、加載IDF,保留關鍵詞以及權重計算(TextRank * IDF)
    • 2、合併關鍵詞權重到字典結果
    • 3、將tfidf和textrank共現的詞作爲主題詞
    • 4、將主題詞表和關鍵詞表進行合併,插入表

加載IDF,保留關鍵詞以及權重計算(TextRank * IDF)

idf = ua.spark.sql("select * from idf_keywords_values")
idf = idf.withColumnRenamed("keyword", "keyword1")
result = textrank_keywords_df.join(idf,textrank_keywords_df.keyword==idf.keyword1)
keywords_res = result.withColumn("weights", result.textrank * result.idf).select(["article_id", "channel_id", "keyword", "weights"])

合併關鍵詞權重到字典結果

keywords_res.registerTempTable("temptable")
merge_keywords = ua.spark.sql("select article_id, min(channel_id) channel_id, collect_list(keyword) keywords, collect_list(weights) weights from temptable group by article_id")

# 合併關鍵詞權重合併成字典
def _func(row):
    return row.article_id, row.channel_id, dict(zip(row.keywords, row.weights))

keywords_info = merge_keywords.rdd.map(_func).toDF(["article_id", "channel_id", "keywords"])

將tfidf和textrank共現的詞作爲主題詞

topic_sql = """
                select t.article_id article_id2, collect_set(t.keyword) topics from tfidf_keywords_values t
                inner join 
                textrank_keywords_values r
                where t.keyword=r.keyword
                group by article_id2
                """
articleTopics = ua.spark.sql(topic_sql)

將主題詞表和關鍵詞表進行合併。

article_profile = keywords_info.join(article_topics, keywords_info.article_id==article_topics.article_id2).select(["article_id", "channel_id", "keywords", "topics"])

# articleProfile.write.insertInto("article_profile")

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章