python異步正則字符串替換，asyncio異步正則字符串替換re

自然語言處理經常使用re正則模塊進行字符串替換，但是文本數量特別大的時候，需要跑很久，這就需要使用asyncio異步加速處理

import pandas as pd
import re
import asyncio

data = pd.read_csv("guba_all_post_20230413.csv")

data.dropna(inplace=True)



# def replace_between_dollars(strings):
#     pattern = r'\$[^$]*\$;'
#     pattern1 = r'[^\w\s]+'
#     new_strings = []
#     for idx,text in enumerate(strings):
#         text = re.sub(pattern, '', text)
#         text = re.sub(pattern1, '', text)
#         text = re.sub(r'\s+', '', text)
#         new_strings.append(text)
        
#     return new_strings

# replace_between_dollars(data["text"])

# data["new_text"] = replace_between_dollars(data["text"])
# data[:50]

pattern = r'\$[^$]*\$;'
pattern1 = r'[^\w\s]+'
async def replace_between_dollars(long_string):
    text = str(long_string)
    new_strings = []
    text = re.sub(pattern, '', text)
    text = re.sub(pattern1, '', text)
    text = re.sub(r'\s+', '', text)
    text = re.sub(r'[a-zA-Z]{30,}', '', text)
    text = re.sub(r"autoimg\w+", "", text)

    return text
 
async def main():
    tasks = []
    for i in data["text"]:
        # print(i)
        tasks.append(asyncio.create_task(replace_between_dollars(i)))
    matches_list = await asyncio.gather(*tasks)

    data["new_text"] = matches_list

    print(matches_list[:200])
    data.to_csv("guba_all_newtext_20230413.csv",index=False)


if __name__ == '__main__':
    asyncio.run(main())

　　結果：

['估值有待修復煤炭平均市盈率6倍3美元', '國產醫療器械行業發展迅速邁瑞作爲的國內最大的醫療器械企業基本一枝獨秀了', '今日上海現貨鉬價', '出消息了準備套人', '你爺爺要紅了', '買個了鬼半年多了沒一點長進而且還跌', '沒有萬手哥55過不去', '今天972抄底了感覺大盤要怕怕的明天希望你給給機會出來', '可從研究開放式基金入手如010379013626005108010341等', '明570收']

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python異步正則字符串替換，asyncio異步正則字符串替換re

詐騙（殺豬盤）網站進行滲透測試

Python 潮流週刊#50：我最喜歡的 Python 3.13 新特性！

【Python】保存gym截圖

【譯】使用 GitHub Copilot 作爲你的編碼 GPS

Linux 服務器配置-安裝portainer-ce社區版

外行也能讀懂的網絡硬件設備功能原理速成

安裝Auto-GPT

策略梯度玩 cartpole 遊戲，強化學習代替PID算法控制平衡杆

deepspeed 訓練多機多卡報錯 ncclSystemError Last error

如何實現圖像搜索，文搜圖，圖搜圖，CLIP+faiss向量數據庫實現圖像高效搜索

使用單卡qlora混合精度訓練大模型chatGLM2-6b，解決qlora loss變成nan的問題！

我用numpy實現了VIT，手寫vision transformer, 可在樹莓派上運行，在hugging face上訓練模型保存參數成numpy格式，純numpy實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結