CNN項目準備工作-圖片的 Extract, Transform, Load (ETL)

點擊上方“AI算法與圖像處理”，選擇加"星標"或“置頂”

重磅乾貨，第一時間送達

文 |AI_study

使用PyTorch提取、轉換和加載(ETL)

歡迎回到PyTorch神經網絡編程系列。在這篇文章中，我們將編寫本系列第二部分的第一個代碼。

我們將使用torchvision演示一個非常簡單的提取、轉換和加載管道，這是用於機器學習的PyTorch計算機視覺包。言歸正傳，我們開始吧。

項目概要

有四個步驟，我們將遵循我們通過這個項目:

準備數據
構建模型
訓練模型
分析模型的結果

ETL過程

在這篇文章中，我們將首先準備數據。爲了準備我們的數據，我們將遵循所謂的ETL過程。

ETL：https://en.wikipedia.org/wiki/Extract,transform,load

從數據源中提取數據。
將數據轉換爲所需的格式。
將數據加載到適當的結構中。

ETL過程可以被認爲是一個分形過程（ fractal process），因爲它可以應用於各種規模。該流程可以小規模應用，比如單個程序，也可以大規模應用，一直到企業級別，在企業級別有處理每個單獨部分的大型系統。

如果您想了解更多關於通用數據科學管道的信息，請查看data science post，在那裏我們將對此進行更詳細的介紹。

data science post: https://deeplizard.com/learn/video/d11chG7Z-xk

一旦我們完成了ETL過程，我們就準備開始構建和訓練我們的深度學習模型。PyTorch有一些內置的包和類，使ETL過程非常簡單。

PyTorch Imports

我們首先導入所有必需的PyTorch庫。

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F


import torchvision
import torchvision.transforms as transforms

這個表格描述了每一個包:

Package	Description
torch	The top-level PyTorch package and tensor library.（頂級的PyTorch包和張量庫。）
torch.nn	A subpackage that contains modules and extensible classes for building neural networks.（包含用於構建神經網絡的模塊和可擴展類的子包。）
torch.optim	A subpackage that contains standard optimization operations like SGD and Adam.（包含標準優化操作(如SGD和Adam)的子包。）
torch.nn.functional	A functional interface that contains typical operations used for building neural networks like loss functions and convolutions.（一個函數接口，包含用於構建神經網絡的典型操作，如loss函數和卷積。）
torchvision	A package that provides access to popular datasets, model architectures, and image transformations for computer vision.（爲計算機視覺提供對流行數據集、模型體系結構和圖像轉換的訪問的包。）
torchvision.transforms	An interface that contains common transforms for image processing.（包含圖像處理常用轉換的接口。）

Other Imports

下一個導入是Python中用於數據科學的標準包:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


from sklearn.metrics import confusion_matrix
#from plotcm import plot_confusion_matrix


import pdb


torch.set_printoptions(linewidth=120)

注意，pdb是Python調試器，註釋掉的 import *** 是一個本地文件，我們將在以後的文章中介紹它，以繪製混淆矩陣，最後一行設置了PyTorch print語句的打印選項。

我們現在準備好準備數據了。

使用PyTorch準備數據

我們在準備數據時的最終目標是做以下(ETL):

提取——從數據源獲取Fashion-MNIST圖像數據。
轉換——把數據轉換成張量的形式。
加載——把我們的數據放入一個對象，使它容易訪問。

出於這些目的，PyTorch爲我們提供了兩個類:

Class	Description
torch.utils.data.Dataset	An abstract class for representing a dataset.（表示數據集的抽象類。）
torch.utils.data.DataLoader	Wraps a dataset and provides access to the underlying data.（包裝數據集並提供對底層數據的訪問。）

抽象類是一個Python類，它有我們必須實現的方法，所以我們可以通過創建一個子類來擴展dataset類的功能來創建一個自定義數據集。

抽象類：https://docs.python.org/3/library/abc.html

爲了使用PyTorch創建自定義數據集，我們通過創建實現這些所需方法的子類來擴展dataset。這樣做之後，我們的新子類就可以傳遞給一個PyTorch DataLoader對象。

我們將使用內置在torchvision包中的fashion-MNIST數據集，因此我們的項目不需要這樣做。只需知道Fashion-MNIST內建的dataset類正在幕後做這件事。

All subclasses of the Dataset class must override __len__, that provides the size of the dataset, and __getitem__, supporting integer indexing in range from 0 to len(self) exclusive.

具體來說，有兩種方法需要實現。__len__方法返回數據集的長度，而__getitem__方法從數據集中獲取位於數據集中特定索引位置的元素。

PyTorch Torchvision包

torchvision軟件包使我們可以訪問以下資源：

Datasets (like MNIST and Fashion-MNIST)
Models (like VGG16)
Transforms
Utils

計算機視覺

所有這些資源都與深度學習計算機視覺任務有關。

當我們在之前的文章中瞭解到Fashion -MNIST數據集時，介紹fashion數據集的arXiv論文指出，作者希望它是原始MNIST數據集的一個補充。

這樣一來，像PyTorch這樣的框架就可以通過更改URL來檢索數據，從而添加Fashion-MNIST。

這就是PyTorch的情況。PyTorch FashionMNIST數據集只是擴展了MNIST數據集並覆蓋了url。

下面是來自PyTorch的torchvision源代碼的類定義:

class FashionMNIST(MNIST):
    """`Fashion-MNIST <https://github.com/zalandoresearch/fashion-mnist>`_ Dataset.


    Args:
        root (string): Root directory of dataset where ``processed/training.pt``
            and  ``processed/test.pt`` exist.
        train (bool, optional): If True, creates dataset from ``training.pt``,
            otherwise from ``test.pt``.
        download (bool, optional): If true, downloads the dataset from the internet and
            puts it in root directory. If dataset is already downloaded, it is not
            downloaded again.
        transform (callable, optional): A function/transform that  takes in an PIL image
            and returns a transformed version. E.g, ``transforms.RandomCrop``
        target_transform (callable, optional): A function/transform that takes in the
            target and transforms it.
    """
    urls = [
        'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz',
        'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz',
        'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz',
        'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz',
    ]

現在讓我們看看如何利用torchvision。

PyTorch數據集類

要使用torchvision獲取FashionMNIST數據集的實例，我們只需創建一個類似的實例:

train_set = torchvision.datasets.FashionMNIST(
    root='./data'
    ,train=True
    ,download=True
    ,transform=transforms.Compose([
        transforms.ToTensor()
    ])
)

注意，root 參數以前是 './data/FashionMNIST' 。然而，由於torchvision 的更新，它已經改變了。

我們指定以下參數:

Parameter	Description
root	The location on disk where the data is located.（磁盤上數據所在的位置。）
train	If the dataset is the training set（數據集是否爲訓練集）
download	If the data should be downloaded.（數據是否要下載）
transform	A composition of transformations that should be performed on the dataset elements.（應該在數據集元素上執行的轉換組合。）

因爲我們希望我們的圖像被轉換成張量，所以我們使用了內置的transform . totensor() 轉換，並且由於這個數據集將用於訓練，我們將把實例命名爲train_set。

當我們第一次運行這段代碼時，Fashion-MNIST數據集將在本地下載。隨後的調用在下載之前檢查數據。因此，我們不必擔心重複下載或重複網絡調用。

PyTorch DataLoader類

要爲我們的訓練集創建一個DataLoader包裝器，我們這樣做:

train_loader = torch.utils.data.DataLoader(train_set
    ,batch_size=1000
    ,shuffle=True
)

我們只是傳遞train_set作爲參數。現在，我們可以利用加載器的完成這個任務，否則手動實現相當複雜:

batch_size(在本例中爲1000)
shuffle(例子中是True的)
num_workers(默認值爲0，表示將使用主進程)

ETL的總結

從ETL的角度來看，我們在創建數據集時，已經實現了使用torchvision進行提取和轉換:

Extract —從web中提取原始數據。
Transform——原始圖像數據變換成一個張量。
Load——train_set被(加載到)數據加載器包裝，使我們能夠訪問底層數據。

現在，我們應該很好地理解了PyTorch提供的torchvision 模塊，以及如何在PyTorch的 torch.utils.data 使用Datasets 和 DataLoaders 來簡化ETL任務。

文章中內容都是經過仔細研究的，本人水平有限，翻譯無法做到完美，但是真的是費了很大功夫，希望小夥伴能動動你性感的小手，分享朋友圈或點個“在看”，支持一下我 ^_^

英文原文鏈接是：

https://deeplizard.com/learn/video/8n-TGaBZnk4

加羣交流

歡迎小夥伴加羣交流，目前已有交流羣的方向包括：AI學習交流羣，目標檢測，秋招互助，資料下載等等；加羣可掃描並回復感興趣方向即可（註明：地區+學校/企業+研究方向+暱稱）

謝謝你看到這裏！ ????

CNN項目準備工作-圖片的 Extract, Transform, Load (ETL) | PyTorch系列（十一）

使用PyTorch提取、轉換和加載(ETL)

ETL過程

PyTorch Imports

Other Imports

PyTorch Torchvision包

計算機視覺

PyTorch數據集類

PyTorch DataLoader類

ETL的總結

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

cs04 CSS Measurement Units

CNN層參數詳解 | PyTorch系列（十四）

Hexo Yelee主題側邊欄社交圖標logo中的github圖標不顯示，成功搞定

PyTorch神經網絡中可學習的參數——CNN權重 | PyTorch系列（十五）

PyTorch中Linear層的原理 | PyTorch系列（十六）

一份能幫你通過大廠之路的簡歷指南 | 算法崗

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結