程序分析：python實現從文件夾中複製文件（匹配選定內容的文件）

原創

献世online

2020-06-16 09:38

對於深度學習，已經是耳熟能詳了，做深度學習的都知道無非是train, train, train...，所謂訓練就需要數據集，對數據的處理是工作量大且耗時耗神的一件事。在做實驗的過程中，遇到這樣的需求：

1) 提取含有某個“關鍵字”的文件

2）數據集的分佈如下

這裏有40個文件夾，每個文件夾中有1000個文件。

對於這樣的數據集，我們可能只需要每個文件夾中的幾個文件；並且需要找出對自己有用的文件。如果一個一個的來看，那工作量和時間太浪費了，很不值。所以接下來，我將闡述如何解決這樣的問題的方法。

（1）讀取文件，匹配文件內容

def readFile(filepath):
    """通過正則匹配文本中內容，並返回文本."""
    regcontent = r'call.value'  # 列出內容含有'call.value'的文件
    contentre = re.compile(regcontent)
    fp = open(filepath)
    lines = fp.readlines()  # 逐行讀取文件內容
    flines = len(lines)
    # 逐行匹配數據.
    for i in range(flines):
        iscontent = re.findall(contentre, lines[i])
        if iscontent:
            fp.close()
            return filepath

（2）過濾空文件

def getValidData(filelist=[]):
    """過濾集合中空的元素."""
    valifile = []
    for fp in filelist:
        if fp != None:
            valifile.append(fp)
    return valifile

（3）複製文件（shutil.copy）

def copyFile(src_file):
    dst = "/media/jion1/data/train_data"  # 指定文件夾
    shutil.copy(src_file, dst)  # shutil.copy 複製文件到指定文件夾下

（4）獲取指定文件並複製

def getDirList(filepath):
    """獲取目錄下所有的文件."""
    regtxt = r'.+?\.txt'  # 掃描對象爲txt文件.
    txtlist = []  # 文件集合.
    txtre = re.compile(regtxt)
    needfile = []  # 存放結果.
    for parent, listdir, listfile in os.walk(filepath):
        for files in listfile:
            # 獲取所有文件.
            istxt = re.findall(txtre, files)
            filecontext = os.path.join(parent, files)
            # 獲取非空的文件.
            if istxt:
                txtlist.append(filecontext)
                # 將所有的數據存放到needfile中.
                needfile.append(readFile(filecontext))

    if needfile == []:
        raise FileException("no file can be find!")
    else:
        validatedata = getValidData(needfile)
        print(validatedata)  # 打印獲取的文件名

        # 循環複製數組中的文件
        for i in range(len(validatedata)):
            copyFile(validatedata[i])

        print('total file %s , validate file %s.' % (len(txtlist), len(validatedata)))

由此，就可以將自己需要的文件從龐大的數據集中選取出來。

完整可運行的代碼見：https://github.com/Messi-Q/SmartContract-Detection-Based-DeepLearning/blob/master/tools/find_call_value.py

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

程序分析：python實現從文件夾中複製文件（匹配選定內容的文件）

認知提升的方法

C#開源的兩款功能強大的錄屏神器

螞蟻面試：Springcloud核心組件的底層原理，你知道多少？

前端 Vue yarn.lock文件：詳解和使用指南

跨鏈資料調研

手動向fabric通道中添加組織Org3-fabric學習筆記（三）

Fabric交易執行流程-fabric學習筆記（一）

手動搭建基礎的fabric網絡-fabric學習筆記（二）

PyTorch學習筆記 (一)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結