PySpark TF-IDF计算（2）

使用PySpark进行TF-IDF计算

这篇博文将记录使用PySpark进行TF-IDF统计的过程，将提供多种计算方法。

1. 准备数据

为了简单，同时为了验证自己的程序有木有错误，我使用如下的测试数据：

1 我来到北京清华大学
2 他来到了网易杭研大厦
3 我来到北京清华大学
4 他来到了网易杭研大厦
5 我来到北京清华大学，我来到北京清华大学

一共五行，每行代表一篇文章，每行中得文章id和正文使用空格分开，例如第一行：1代表文章id,"我来到北京清华大学"代表一篇文本。
将文本写入到文件test中。

2. 加载数据并且分词

分词采用jieba分词，代码如下：

def seg(data):
    """
    分词后返回分词的dataframe
    :param spark:
    :param data:
    :return:
    """
    return [w for w in jieba.cut(data.strip(), cut_all=False) if len(w) > 1 and re.match(remove_pattern, w) is not None]

主函数：

if __name__ == '__main__':
    conf = SparkConf().setAppName('text_trans').setMaster("local[*]")
    sc = SparkContext()
    sc.setLogLevel(logLevel='ERROR')
    spark = SparkSession.builder.config(conf=conf).getOrCreate()
	#读本地文件
    data = spark.read.text('test')
    ＃使用广播变量记录总文本数量
    count = data.count()
    brocast_count = sc.broadcast(count)
	# show 一下数据
    data.show()
    # 根据空格进行文章编号和文本正文的切分，然后对文本正文调用jieba分词进行分词
    data = data.rdd.map(lambda x: x[0].split(' '))\
        .map(lambda x: (x[0], x[1], seg(x[1]))) # 分词

过滤掉空行

如果我们的数据中存在空行或者分词后去除停用词后没有词了，那么我们可以把这一行给去掉，spark中去掉空行可以使用filter，我习惯去掉空行的做法是使用一个特殊的符号，然后进行filter,代码如下：

# 使用-- 标识空行，然后filter 掉这一行的数据，filter中如果改行的逻辑是false那么就会filter 否则就pass
data = data\
    .map(lambda x: (x[0], '--' if len(x[2]) == 0 else x[1], x[2]))\
    .filter(lambda x: x[1] != '--')

计算TF

TF是针对一篇文章而言的，是一篇文章中得单词频率/单词总数，这里的计算比较简单，就不多折腾了。

def calc_tf(line):
    """
    计算每个单词在每篇文章的tf
    :param line:
    :return:
    """
    cnt_map = {}
    for w in line[2]:
        cnt_map[w] = cnt_map.get(w, 0) + 1
    lens = len(line)
    return [(line[0], (w, cnt *1.0/lens)) for w,cnt in cnt_map.items()]
# 计算tf
tf_data = data.flatMap(calc_tf)

打印可以看到：

print(tf_data.collect())

[('1', ('来到', 0.3333333333333333)), ('1', ('北京', 0.3333333333333333)), ('1', ('清华大学', 0.3333333333333333)), ('2', ('来到', 0.3333333333333333)), ('2', ('网易', 0.3333333333333333)), ('2', ('杭研', 0.3333333333333333)), ('2', ('大厦', 0.3333333333333333)), ('3', ('来到', 0.3333333333333333)), ('3', ('北京', 0.3333333333333333)), ('3', ('清华大学', 0.3333333333333333)), ('4', ('来到', 0.3333333333333333)), ('4', ('网易', 0.3333333333333333)), ('4', ('杭研', 0.3333333333333333)), ('4', ('大厦', 0.3333333333333333)), ('5', ('来到', 0.6666666666666666)), ('5', ('北京', 0.6666666666666666)), ('5', ('清华大学', 0.6666666666666666))]

计算IDF

IDF是逆文档频率，表示一个单词出现在语料库中出现的频率，也就是一个单词在多少篇文章中出现了。
下面就给出二个计算IDF的方法，在计算IDF的时候，flatMap需要带上文章ID：

思路1. 分词后的结果进行flatMap,转化为文章ID，单词的RDD，然后进行计算，这里顺便了解一下combinedByKey

def flat_with_doc_id(data):
    """
    flat map的时候带上文章ID
    :param data:
    :return:
    """
    return [(w, data[0]) for w in data[2]]
t = data.flatMap(flat_with_doc_id)\
    .combineByKey(lambda v: 1,
                  lambda x, v: x + 1,
                  lambda x, y: x+y)\
    .map(lambda x: (x[0], x[1]*1.0/brocast_count.value))

下面的代码打印

print('t', t.collect())

可以看到

t [('来到', 1.2), ('北京', 0.8), ('清华大学', 0.8), ('网易', 0.4), ('杭研', 0.4), ('大厦', 0.4)]

如果没毛病，计算DF是没有错的。

思路二.考虑到我们最后需要计算TF-IDF，如果第一步计算出TF后再结合这一步计算出来的IDF，那么就不可避免的进行join操作，这个shuffle非常耗时耗力，我们应该尽量的避免，那么可不可以通过RDD的transformer计算IDF呢，当然是可以的，下面提供一个我写的，效率个人觉得还可以，在300W篇文本，单机16G内存在1小时内可以完成。

    def create_combiner(v):
        t = []
        t.append(v)
        return (t, 1)

    def merge(x, v):
        # x==>(list, count)
        t = []
        if x[0] is not None:
            t = x[0]
        t.append(v)
        return (t, x[1] + 1)

    def merge_combine(x, y):
        t1 = []
        t2 = []
        if x[0] is not None:
            t1 = x[0]
        if y[0] is not None:
            t2 = y[0]
        t1 = t1.extend(t2)
        return (t1, x[1] + y[1])

    def flat_map_2(line):
        rst = []
        idf_value = line[1][-1] * 1.0 / brocast_count.value
        for doc_pair in line[1][:-1]:
            print(doc_pair)
            for p in doc_pair:
                rst.append(Row(docId=p[0], token=line[0], tf_value=p[1], idf_value=idf_value, tf_idf_value=p[1] * idf_value))
        return rst

idf_rdd = tf_data.map(lambda x: (x[1][0], (x[0], x[1][1])))\
    .combineByKey(create_combiner,
                   merge,
                   merge_combine)\
    .flatMap(flat_map_2)

将其转化为DataFram然后show

    tf_idf_df = spark.createDataFrame(idf_rdd)
    tf_idf_df.show()
    tf_idf_df.printSchema()

可以看到

+-----+---------+-------------------+------------------+-----+
|docId|idf_value|       tf_idf_value|          tf_value|token|
+-----+---------+-------------------+------------------+-----+
|    1|      1.0| 0.3333333333333333|0.3333333333333333|   来到|
|    2|      1.0| 0.3333333333333333|0.3333333333333333|   来到|
|    3|      1.0| 0.3333333333333333|0.3333333333333333|   来到|
|    4|      1.0| 0.3333333333333333|0.3333333333333333|   来到|
|    5|      1.0| 0.6666666666666666|0.6666666666666666|   来到|
|    1|      0.6|0.19999999999999998|0.3333333333333333|   北京|
|    3|      0.6|0.19999999999999998|0.3333333333333333|   北京|
|    5|      0.6|0.39999999999999997|0.6666666666666666|   北京|
|    1|      0.6|0.19999999999999998|0.3333333333333333| 清华大学|
|    3|      0.6|0.19999999999999998|0.3333333333333333| 清华大学|
|    5|      0.6|0.39999999999999997|0.6666666666666666| 清华大学|
|    2|      0.4|0.13333333333333333|0.3333333333333333|   网易|
|    4|      0.4|0.13333333333333333|0.3333333333333333|   网易|
|    2|      0.4|0.13333333333333333|0.3333333333333333|   杭研|
|    4|      0.4|0.13333333333333333|0.3333333333333333|   杭研|
|    2|      0.4|0.13333333333333333|0.3333333333333333|   大厦|
|    4|      0.4|0.13333333333333333|0.3333333333333333|   大厦|
+-----+---------+-------------------+------------------+-----+

如果只是想计算DF，那么可以直接使用第一个方法，combineByKey比reduceByKey要节省内存消耗，而且在大数据的时候更为明显。

完整代码：

# -*- coding: utf-8 -*-

"""
 计算TF-IDF
 @Time    : 2019/2/18 18:03
 @Author  : MaCan ([email protected])
 @File    : text_transformator.py
"""

from pyspark.sql import SparkSession, Row
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext

# from spark_work.io_utils import mrp_hdfs_2016_path,mrp_hdfs_2017_path,mrp_hdfs_2018_path, user_dict_path

import jieba
import os
import re

#过滤英文的pattern
remove_pattern = '[\u4e00-\u9fa5]+'


# if os.path.exists(user_dict_path):
#     try:
#         jieba.load_userdict(user_dict_path)
#     except Exception as e:
#         print(e)


def seg(data):
    """
    分词后返回分词的dataframe
    :param spark:
    :param data:
    :return:
    """
    return [w for w in jieba.cut(data.strip(), cut_all=False) if len(w) > 1 and re.match(remove_pattern, w) is not None]


def flat_with_doc_id(data):
    """
    flat map的时候带上文章ID
    :param data:
    :return:
    """
    return [(data[0], w) for w in data[2]]


def calc_tf(line):
    """
    计算每个单词在每篇文章的tf
    :param line:
    :return:
    """
    cnt_map = {}
    for w in line[2]:
        cnt_map[w] = cnt_map.get(w, 0) + 1
    lens = len(line)
    return [(line[0], (w, cnt *1.0/lens)) for w,cnt in cnt_map.items()]

def create_combiner(v):
    t = []
    t.append(v)
    return (t, 1)

def merge(x, v):
    # x==>(list, count)
    t = []
    if x[0] is not None:
        t = x[0]
    t.append(v)
    return (t, x[1] + 1)

def merge_combine(x, y):
    t1 = []
    t2 = []
    if x[0] is not None:
        t1 = x[0]
    if y[0] is not None:
        t2 = y[0]
    t1 = t1.extend(t2)
    return (t1, x[1] + y[1])

def flat_map_2(line):
    rst = []
    idf_value = line[1][-1] * 1.0 / brocast_count.value
    for doc_pair in line[1][:-1]:
        print(doc_pair)
        for p in doc_pair:
            rst.append(Row(docId=p[0], token=line[0], tf_value=p[1], idf_value=idf_value, tf_idf_value=p[1] * idf_value))
    return rst


if __name__ == '__main__':
    conf = SparkConf().setAppName('text_trans').setMaster("local[*]")
    sc = SparkContext()
    sc.setLogLevel(logLevel='ERROR')
    spark = SparkSession.builder.config(conf=conf).getOrCreate()

    data = spark.read.text('test')
    count = data.count()
    brocast_count = sc.broadcast(count)

    data.show()
    data = data.rdd.map(lambda x: x[0].split(' '))\
        .map(lambda x: (x[0], x[1], seg(x[1]))) # 分词

    # filter掉返回false的结果
    data = data\
        .map(lambda x: (x[0], '--' if len(x[2]) == 0 else x[1], x[2]))\
        .filter(lambda x: x[1] != '--')
    print('data', data.collect())
    data.cache() # x[0]=> docId x[1]==> token
    # 算tf
    tf_data = data.flatMap(calc_tf)
    print(tf_data.collect())


    #idf
    t = data.flatMap(flat_with_doc_id)\
        .map(lambda x: (x[1], x[0]))\
        .combineByKey(lambda v: 1,
                      lambda x, v: x + 1,
                      lambda x, y: x+y)\
        .map(lambda x: (x[0], x[1]*1.0/brocast_count.value)).collect()
    print('t', t)
    print('*'*20)

    idf_rdd = tf_data.map(lambda x: (x[1][0], (x[0], x[1][1])))\
        .combineByKey(create_combiner,
                       merge,
                       merge_combine)\
        .flatMap(flat_map_2)

    tf_idf_df = spark.createDataFrame(idf_rdd)
    tf_idf_df.show()
    tf_idf_df.printSchema()
    spark.stop()

PySpark TF-IDF计算（2）

使用PySpark进行TF-IDF计算

1. 准备数据

2. 加载数据并且分词

过滤掉空行

计算TF

计算IDF

完整代码：

Windows下PySpark 環境搭建篇以及詞頻統計（1）

解決TensorRT編譯時protobuf模塊編譯錯誤

windows下python安裝scipy庫的方法

IDEA遠程調試hadoop

python 使用uwsgi 開啓多進程服務

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結