python 多進程jieba分詞，高效分詞，multiprocessing

自然語言任務經常使用jieba分詞，數據量大時怎麼加速，jieba分詞不支持使用asyncio異步加速，使用multiprocessing還是可以的

import jieba
import jieba.analyse
import multiprocessing

# 加載自定義詞典
jieba.load_userdict("user_dic.txt")
jieba.load_userdict("cate_group.txt")
jieba.analyse.set_stop_words('stopwords_v1.txt')

def process_text(text):
    # 分詞
    words = jieba.cut(text, cut_all=True)
    
    # 過濾長度小於2或大於10的詞和純數字的詞
    filtered_words = [w for w in words if len(w) >= 2 and len(w) <= 10 and not w.isdigit()]
    
    # 返回分詞結果
    return filtered_words


# 創建進程池
pool = multiprocessing.Pool()

# 處理文本列表
# texts = ["這是一段測試文本", "這是另一段測試文本"]
texts = data["new_text"]
results = pool.map(process_text, texts)

# 輸出結果
results

結果：

[['估值', '有待', '修復', '煤炭', '平均', '市盈率', '美元'],
 ['國產',
  '醫療',
  '醫療器械',
  '器械',
  '行業',
  '發展',
  '迅速',
  '作爲',
  '國內',
  '最大',
  '醫療',
  '醫療器械',
  '器械',
  '企業',
  '基本',
  '一枝',
  '一枝獨秀',
  '獨秀'],
 ['今日', '上海', '現貨'],
 ['消息', '準備'],

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python 多進程jieba分詞，高效分詞，multiprocessing

詐騙（殺豬盤）網站進行滲透測試

Python 潮流週刊#50：我最喜歡的 Python 3.13 新特性！

【Python】保存gym截圖

【譯】使用 GitHub Copilot 作爲你的編碼 GPS

Linux 服務器配置-安裝portainer-ce社區版

外行也能讀懂的網絡硬件設備功能原理速成

安裝Auto-GPT

策略梯度玩 cartpole 遊戲，強化學習代替PID算法控制平衡杆

deepspeed 訓練多機多卡報錯 ncclSystemError Last error

如何實現圖像搜索，文搜圖，圖搜圖，CLIP+faiss向量數據庫實現圖像高效搜索

使用單卡qlora混合精度訓練大模型chatGLM2-6b，解決qlora loss變成nan的問題！

我用numpy實現了VIT，手寫vision transformer, 可在樹莓派上運行，在hugging face上訓練模型保存參數成numpy格式，純numpy實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結