如何提取代碼中的中文字符串

原創

2020-06-25 08:50

前言

在一般情況下，程序中的中文字符串都是寫在某個文件中讀取（例如json），但是大部分時候我們都是直接寫入到代碼中，這個時候如果我們想把字符串提取出來就需要一個一個去找，或者通過IDE提供的字符串匹配方法來進行實現。通過IDE搜索的方式固然可以，但是如果我們想把字符串提取出來，則需要一個一個地點擊，這樣會非常麻煩。下面介紹如何使用代碼提取中文字符串。

正則表達式

通過正則表達式可以提取中文字符串

具體的正則表達式爲 [\u4e00-\u9fa5]

Python提取中文字符串

直接貼代碼：

代碼相關內容

Python 版本3.7

IDE Spyder

代碼


# -*- coding: utf-8 -*-
import os
import re
list1=[]
def file_name(file_dir):
    for root,dirs,files in os.walk(file_dir):
        getChineseStrings(root,files)
        
def getChineseStrings(root,files):
    for f in files:
        if f.endswith('.js'):
            print(root+f)
            getNoRepeatList(root+f,list1)
            with open(os.path.join(root,f),encoding='UTF-8') as lines:
                for line in lines:
                    line=str(line)
                    #刪除//註釋的內容
                    line =re.sub(r'//.*$','',line)
                    #刪除行內/**/註釋的內容
                    line =re.sub(r'/\*.*\*/','',line)
                    #刪除行內/*註釋以及其右邊的內容
                    line =re.sub(r'/\*.*$','',line)
                    #刪除行內*/註釋以及其左邊的內容
                    line =re.sub(r'.*\*/','',line)
                    #刪除*以及後面的字符串
                    line =re.sub(r'\*.*$','',line)
                    #查找“”中間的中文字符串
                    findPart(u"\".*[\u4e00-\u9fa5]+.*\"",line)
                    #刪除“”中間的中文字符串
                    line =re.sub(u"\".*[\u4e00-\u9fa5]+.*\"",'',line)
                    #查找‘’中間的中文字符串
                    findPart(u"\'.*[\u4e00-\u9fa5]+.*\'",line)
                    #刪除‘’中間的中文字符串
                    line =re.sub(u"\'.*[\u4e00-\u9fa5]+.*\'",'',line)
                    #查找><中間的中文字符串 
                    findPart(u">.*[\u4e00-\u9fa5]+.*<",line)
                    
def findPart(regex,text):
    res=re.findall(regex,text)
    for r in res:
        if '\"' in r:
            result = r.split('\"')
            for i in result:
                if re.compile(u'[\u4e00-\u9fa5]').search(i):
                    print (str(i))
                    getNoRepeatList(str(i),list1)
            return 
        if '\'' in r:
            result = r.split('\'')
            for i in result:
                if re.compile(u'[\u4e00-\u9fa5]').search(i):
                    print (str(i))
                    getNoRepeatList(str(i),list1)
            return 
        if '>' in r or '<' in r :
            result =re.split(r">|<", r) 
            for i in result:
                if re.compile(u'[\u4e00-\u9fa5]').search(i):
                    print (str(i))
                    getNoRepeatList(str(i),list1)
            return 
def getNoRepeatList(i,lists):
    if i not in lists:
        lists.append(i)
if __name__=='__main__':
    file_name('paste the path of the files here')
    print("==========================================================================================this is no repeating list")
    for i in list1:
        print(i)

代碼詳解

第一個方法：
file_name
這個方法主要用於遍歷文件夾中的文件

第二個方法：
getChineseStrings
這個方法主要用於獲得中文字符串，這也是整個代碼文件的核心部分，

以JavaScript爲例，其註釋主要分爲三種

第一種註釋
//這是註釋內容

第二種註釋

/*這是註釋內容*/

第三種註釋

/*
*這是註釋內容
 */

爲了提取中文字符串，我們首先需要刪除註釋中的字符串，以//爲例
line =re.sub(r'//.*$','',line)
這裏直接將每一行中的//後面的字符都替換爲''，這樣在後面的提取中，就會自動排除掉這些註釋中的字符串。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

如何提取代碼中的中文字符串

前言

正則表達式

Python提取中文字符串

代碼相關內容

代碼

代碼詳解

linux安裝cuda和cudnn

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

bootstrap方法

Functions And Models

伴隨矩陣及其運算

英語語法——動詞

如何提取代碼中的中文字符串

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結