劃分自己的數據集

0 說明

最近正在使用yolov4來進行安全帽是否佩戴的檢測。但是因爲數據集是由不同的小組成員標定的，在標定的過程中有些相似的數據集雖然沒有標定，但是也忘記刪除了，這就導致了標籤個數和數據集的個數不相符的現象出現（圖片9608個，標籤9263個）。因此，爲了更方便的將數據集劃分爲訓練集和測試集，我打算將沒有進行標定的數據集進行刪除，然後重新劃分爲訓練集和測試集。

1 刪除未標定的圖片

1.1 明確數據和標籤的個數

# 獲取指定文件夾下指定類型的數量
def getFilsNum(path_dir, type_files):
    # step0: 判斷文件夾是否存在，或者是否是個目錄
    # step1: 打開文件夾
    # step2: 獲取指定文件類型的個數
    # step3: 返回個數
    if not os.path.exists(path_dir) or not os.path.isdir(path_dir):
        print("{0}文件夾不存在，或者不是一個目錄".format(path_dir))
        return 0

    num = 0
    files_name = []
    for path_file in os.listdir(path_dir):
        file_name, postfix = path_file.split('.')
        if postfix == type_files:
            num += 1
            files_name.append(file_name)
    return files_name, num


# step1: 打開兩個文件夾
path_images_dir = r'D:\Dataset\helmet\helmetAll2020\JPEGImages'
path_annotations_dir = r'D:\Dataset\helmet\helmetAll2020\Annotations'


# files_data和files_lable用來比較哪些是多餘的圖片或者標籤
files_data,length_imgs = getFilsNum(path_images_dir,'jpg')
files_label,length_ans = getFilsNum(path_annotations_dir,'xml')

打印運行的結果：

print(length_ans)
# 9263
print(length_imgs)
# 9608

小結：在這裏小結一下使用到的函數

（1）os.listdir()：返回某個目錄下面所有的文件（帶後綴，你可以使用split進行分割提取）

（2）os.walk()：這個和listdir()功能相似。它的返回值有三個root,dirs,files。root表示當前文件夾的地址，dirs是一個列表，其中保存當前文件夾下所有的目錄；files也是一個列表，其中保存當前文件夾下所有的文件

1.2 刪除多餘的圖片或者標籤

# 如果圖片的數量和標籤的數量不相等就刪除多餘的圖片或者標籤（在這邊圖片的個數多餘標籤的個數，要刪除多餘的圖片）
redun_files = []
# 如果圖片的數量多
if length_imgs > length_ans:
    for file_ in files_data:
        if file_ not in files_label:
            redun_files.append(file_)
    # 打開圖片目錄，獲取所有文件名稱，如果文件名稱在redun_files中，就打印
    for path_file in os.listdir(path_images_dir):
        file_name,_ = path_file.split('.')
        if file_name in redun_files:
            # 刪除圖片
            os.remove(os.path.join(path_images_dir,path_file))

else: # 標籤的數量比圖片的多
    for file_ in files_label:
        if file_ not in files_data:
            redun_files.append(file_)
        if file_name in redun_files:
            # 刪除標籤
            os.remove(os.path.join(path_images_dir,path_file))

print(len(redun_files))
# 345

2 劃分數據集

2.1 創建訓練集和測試集的目錄

訓練集和測試集的目錄我是這樣安排的：

明確目錄結構之後，我們就需要創建相關的目錄

# 創建相關的目錄
path_datasets_train = r'D:\Dataset\helmet\cumtHelmet2020\datasets\train'
path_datasets_val = r'D:\Dataset\helmet\cumtHelmet2020\datasets\val'
path_labels_train = r'D:\Dataset\helmet\cumtHelmet2020\labels\train'
path_labels_val = r'D:\Dataset\helmet\cumtHelmet2020\labels\val'
makedirs(path_datasets_train)
makedirs(path_datasets_val)
makedirs(path_labels_train)
makedirs(path_labels_val)

2.2 劃分數據集

在這裏我是這樣處理的，先移動20%驗證集的圖片和標籤，剩下的部分自然就是訓練集的圖片和標籤了，然後再一次性將訓練圖片和標籤移動到指定位置

# 按照8：2的比例劃分數據集（先移動驗證集，然後將剩餘的圖片一次性移動到指定位置）
import random
import shutil
def partition_datasets(num_files_in_dir, path_dest, path_source,val_rate):
    files_source = os.listdir(path_source)
    if val_rate>0.5:
        print('請設置驗證集的比重')
        return -1
    # 獲得要移動圖片的數量
    num_part_imgs = int(val_rate*num_files_in_dir)
    # sample一定要是個類別
    sample = random.sample(files_source,num_part_imgs)
    imgs_sample_list = []
    for name in sample:
        # print(os.path.join(path_source,name))
        shutil.move(os.path.join(path_source,name),os.path.join(path_dest,name))
        imgs_sample_list.append(name.split('.')[0])
    return imgs_sample_list

imgs_list = partition_datasets(length_imgs,path_dest=path_datasets_val,path_source=path_images_dir,val_rate=0.2)

# 開始移動相關的標籤（先移動驗證集的label）
def partitioin_anno(imgs_list,path_dest, path_source):
    num = 0
    for path_label in os.listdir(path_source):
        label,_ = path_label.split('.')
        if label in imgs_list:
            # 移動標籤到指定位置
            shutil.move(os.path.join(path_source,path_label),os.path.join(path_dest,path_label))
            # print(os.path.join(path_dest,path_label))

小結：

（1）random.sample()：是用來隨機抽取樣本的。它的第一個參數是一個列表，第二個參數是要從這個列表中抽樣的個數。它的返回值也是一個列表，表示我們抽樣的結果。

（2）shutil.move(source,dest)：source表示的源地址文件，dest表示的是目標地址文件

2.3 移動訓練集的圖片以及標籤

# 將剩餘的圖片都移動到訓練集中
def moveLeftFile(path_source, path_dest):
    for path_file in os.listdir(path_source):
        shutil.move(os.path.join(path_source,path_file),os.path.join(path_dest,path_file))
    print("移動完成")

# 移動訓練圖片和標籤
moveLeftFile(path_source = path_images_dir, path_dest = path_datasets_train)
moveLeftFile(path_source = path_annotations_dir, path_dest = path_labels_train)

劃分自己的數據集

0 說明

1 刪除未標定的圖片

2 劃分數據集

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（二）使用kube-vip實現集羣VIP訪問

企業大模型如何成爲自己數據的“百科全書”？

本地SSL證書過期輸入命令在IIS自動生成

.NET週刊【5月第2期 2024-05-12】

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（一）部署K8s

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（三）數據卷掛載NFS（網絡文件系統）

darknet中常用的函數

c++中常用的容器

在win10下編譯yolov4

pluginlib 之 ROS Wiki

ROS代碼

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結