【轉載】常見醫療掃描圖像處理步驟

【文章來源】http://blog.csdn.net/u013635029/article/details/72957944

一、下載必要包

skimage需要更新到最新0.13版本，否則會報錯，ImportError: cannot import name label。

sudo pip install scikit-image  -U -i  https://pypi.tuna.tsinghua.edu.cn/simple

二、數據格式

2.1 dicom

當前醫療競賽官方數據集格式 data-science-bowl-2017。數據列表如下:

後綴爲 .dcm。

每個病人的一次掃描CT(scan)可能有幾十到一百多個dcm數據文件(slices)。可以使用 python的dicom包讀取，讀取示例代碼如下:

dicom.read_file('/data/lung_competition/stage1/7050f8141e92fa42fd9c471a8b2f50ce/498d16aa2222d76cae1da144ddc59a13.dcm')

一般使用其pixl_array數據

slices = [dicom.read_file(os.path.join(folder_name,filename)) for filename in os.listdir(folder_name)]
slices = np.stack([s.pixel_array for s in slices])

2.2 mhd格式

mhd格式是另外一種數據格式，來源於(LUNA2016)[https://luna16.grand-challenge.org/data/]。每個病人一個mhd文件和一個同名的raw文件。如下:

一個mhd通常有幾百兆，對應的raw文件只有1kb。mhd文件需要藉助python的SimpleITK包來處理。SimpleITK 示例代碼如下:

import SimpleITK as sitk
itk_img = sitk.ReadImage(img_file) 
img_array = sitk.GetArrayFromImage(itk_img) # indexes are z,y,x (notice the ordering)
num_z, height, width = img_array.shape        #heightXwidth constitute the transverse plane
origin = np.array(itk_img.GetOrigin())      # x,y,z  Origin in world coordinates (mm)
spacing = np.array(itk_img.GetSpacing())    # spacing of voxels in world coor. (mm)

需要注意的是，SimpleITK的img_array的數組不是直接的像素值，而是相對於CT掃描中原點位置的差值，需要做進一步轉換。轉換步驟參考 SimpleITK圖像轉換

2.3 查看CT掃描文件軟件

一個開源免費的查看軟件 mango

三、dicom格式數據處理過程

3.1 處理思路

首先，需要明白的是醫學掃描圖像(scan)其實是三維圖像，使用代碼讀取之後開源查看不同的切面的切片(slices)，可以從不同軸切割

如下圖展示了一個病人CT掃描圖中，其中部分切片slices

其次，CT掃描圖是包含了所有組織的，如果直接去看，看不到任何有用信息。需要做一些預處理,預處理中一個重要的概念是放射劑量，衡量單位爲HU(Hounsfield Unit)，下表是不同放射劑量對應的組織器官

Hounsfield Unit = pixel_value * rescale_slope + rescale_intercept

一般情況rescale slope = 1, intercept = -1024。
灰度值是pixel value經過重重LUT轉換得到的用來進行顯示的值，而這個轉換過程是不可逆的，也就是說，灰度值無法轉換爲ct值。只能根據窗寬窗位得到一個大概的範圍。 pixel value經過modality lut得到Hu，但是懷疑pixelvalue的讀取出了問題。dicom文件中存在（0028，0106）（0028，0107）兩個tag，分別是最大最小pixel value，可以用來檢驗你讀取的pixel value 矩陣是否正確。
LUT全稱look up table，實際上就是一張像素灰度值的映射表，它將實際採樣到的像素灰度值經過一定的變換如閾值、反轉、二值化、對比度調整、線性變換等，變成了另外一個與之對應的灰度值，這樣可以起到突出圖像的有用信息，增強圖像的光對比度的作用。
首先去除超過 -2000的pixl_array，CT掃描邊界之外的像素值固定爲-2000(dicom和mhd都是這個值)。第一步是設定這些值爲0，當前對應爲空氣（值爲0）

slices[slices == -2000] = 0

上表中肺部組織的HU數值爲-500,但通常是大於這個值，比如-320、-400。挑選出這些區域，然後做其他變換抽取出肺部像素點。抽取出這些特徵，最後將其存儲爲ndarray的npy格式，供給卷積神經網絡。

3.2 先載入必要的包

# -*- coding:utf-8 -*-
'''
this script is used for basic process of lung 2017 in Data Science Bowl
'''

import glob
import os
import pandas as pd
import SimpleITK as sitk

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import skimage, os
from skimage.morphology import ball, disk, dilation, binary_erosion, remove_small_objects, erosion, closing, reconstruction, binary_closing
from skimage.measure import label,regionprops, perimeter
from skimage.morphology import binary_dilation, binary_opening
from skimage.filters import roberts, sobel
from skimage import measure, feature
from skimage.segmentation import clear_border
from skimage import data
from scipy import ndimage as ndi
import matplotlib
#matplotlib.use('Agg')
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
import dicom
import scipy.misc
import numpy as np

DICOM是醫學圖像中標準文件，這些文件包含了諸多的元數據信息（比如像素尺寸，每個維度的一像素代表真實世界裏的長度）。如下代碼是載入一個掃描面，包含了多個切片(slices)，我們僅簡化的將其存儲爲python列表。數據集中每個目錄都是一個掃描面集（一個病人）。有個元數據域丟失，即Z軸方向上的像素尺寸，也即切片的厚度。所幸，我們可以用其他值推測出來，並加入到元數據中。

# Load the scans in given folder path
def load_scan(path):
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    try:
        slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
    except:
        slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)

    for s in slices:
        s.SliceThickness = slice_thickness

    return slices

3.3 像素轉換爲HU單元

有些掃描面有圓柱形掃描邊界，但是輸出圖像是正方形。邊界之外的像素值固定爲-2000,。第一步是設定這些值爲0，當前對應爲空氣（值爲0）。然後回到HU單元，乘以rescale比率並加上intercept(存儲在掃描面的元數據中)。

def get_pixels_hu(slices):
    image = np.stack([s.pixel_array for s in slices])
    # Convert to int16 (from sometimes int16), 
    # should be possible as values should always be low enough (<32k)
    image = image.astype(np.int16)
    # Set outside-of-scan pixels to 0
    # The intercept is usually -1024, so air is approximately 0
    image[image == -2000] = 0

# Convert to Hounsfield units (HU)
    for slice_number in range(len(slices)):

        intercept = slices[slice_number].RescaleIntercept
        slope = slices[slice_number].RescaleSlope

        if slope != 1:
            image[slice_number] = slope * image[slice_number].astype(np.float64)
            image[slice_number] = image[slice_number].astype(np.int16)

        image[slice_number] += np.int16(intercept)

    return np.array(image, dtype=np.int16)

可以查看病人的掃描HU值分佈情況

first_patient = load_scan(INPUT_FOLDER + patients[0])
first_patient_pixels = get_pixels_hu(first_patient)
plt.hist(first_patient_pixels.flatten(), bins=80, color='c')
plt.xlabel("Hounsfield Units (HU)")
plt.ylabel("Frequency")
plt.show()

3.4重新採樣

不同掃描面的像素尺寸、粗細粒度是不同的。這不利於我們進行CNN任務，我們可以使用同構採樣。
一個掃描面的像素區間可能是[2.5,0.5,0.5],即切片之間的距離爲2.5mm。可能另外一個掃描面的範圍是[1.5,0.725,0.725]。這可能不利於自動分析。

常見的處理方法是從全數據集中以固定的同構分辨率重新採樣。如果我們選擇，將所有的東西採樣爲1mmx1mmx1mm像素，我們可以使用3D卷積網絡

def resample(image, scan, new_spacing=[1,1,1]):
    # Determine current pixel spacing
    spacing = map(float, ([scan[0].SliceThickness] + scan[0].PixelSpacing))
    spacing = np.array(list(spacing))
    resize_factor = spacing / new_spacing
    new_real_shape = image.shape * resize_factor
    new_shape = np.round(new_real_shape)
    real_resize_factor = new_shape / image.shape
    new_spacing = spacing / real_resize_factor

    image = scipy.ndimage.interpolation.zoom(image, real_resize_factor, mode='nearest')

    return image, new_spacing

現在重新取樣病人的像素，將其映射到一個同構分辨率 1mm x1mm x1mm。

pix_resampled, spacing = resample(first_patient_pixels, first_patient, [1,1,1])

輸出肺部掃描的3D圖像方法

def plot_3d(image, threshold=-300):

    # Position the scan upright, 
    # so the head of the patient would be at the top facing the camera
    p = image.transpose(2,1,0)

    verts, faces = measure.marching_cubes(p, threshold)
    fig = plt.figure(figsize=(10, 10))
    ax = fig.add_subplot(111, projection='3d')
    # Fancy indexing: `verts[faces]` to generate a collection of triangles
    mesh = Poly3DCollection(verts[faces], alpha=0.1)
    face_color = [0.5, 0.5, 1]
    mesh.set_facecolor(face_color)
    ax.add_collection3d(mesh)
    ax.set_xlim(0, p.shape[0])
    ax.set_ylim(0, p.shape[1])
    ax.set_zlim(0, p.shape[2])
    plt.show()

打印函數有個閾值參數，來打印特定的結構，比如tissue或者骨頭。400是一個僅僅打印骨頭的閾值(HU對照表)

plot_3d(pix_resampled, 400)

3.5 輸出一個病人scans中所有切面slices

def plot_ct_scan(scan):
    '''
            plot a few more images of the slices
    :param scan:
    :return:
    '''
    f, plots = plt.subplots(int(scan.shape[0] / 20) + 1, 4, figsize=(50, 50))
    for i in range(0, scan.shape[0], 5):
        plots[int(i / 20), int((i % 20) / 5)].axis('off')
        plots[int(i / 20), int((i % 20) / 5)].imshow(scan[i], cmap=plt.cm.bone)

此方法的效果示例如下:

3.6 定義分割出CT切面裏面肺部組織的函數

def get_segmented_lungs(im, plot=False):

    '''
    This funtion segments the lungs from the given 2D slice.
    '''
    if plot == True:
        f, plots = plt.subplots(8, 1, figsize=(5, 40))
    '''
    Step 1: Convert into a binary image.
    '''
    binary = im < 604
    if plot == True:
        plots[0].axis('off')
        plots[0].set_title('binary image')
        plots[0].imshow(binary, cmap=plt.cm.bone)

    '''
    Step 2: Remove the blobs connected to the border of the image.
    '''
    cleared = clear_border(binary)
    if plot == True:
        plots[1].axis('off')
        plots[1].set_title('after clear border')
        plots[1].imshow(cleared, cmap=plt.cm.bone)

    '''
    Step 3: Label the image.
    '''
    label_image = label(cleared)
    if plot == True:
        plots[2].axis('off')
        plots[2].set_title('found all connective graph')
        plots[2].imshow(label_image, cmap=plt.cm.bone)
    '''
    Step 4: Keep the labels with 2 largest areas.
    '''
    areas = [r.area for r in regionprops(label_image)]
    areas.sort()
    if len(areas) > 2:
        for region in regionprops(label_image):
            if region.area < areas[-2]:
                for coordinates in region.coords:
                       label_image[coordinates[0], coordinates[1]] = 0
    binary = label_image > 0
    if plot == True:
        plots[3].axis('off')
        plots[3].set_title(' Keep the labels with 2 largest areas')
        plots[3].imshow(binary, cmap=plt.cm.bone)
    '''
    Step 5: Erosion operation with a disk of radius 2. This operation is
    seperate the lung nodules attached to the blood vessels.
    '''
    selem = disk(2)
    binary = binary_erosion(binary, selem)
    if plot == True:
        plots[4].axis('off')
        plots[4].set_title('seperate the lung nodules attached to the blood vessels')
        plots[4].imshow(binary, cmap=plt.cm.bone)
    '''
    Step 6: Closure operation with a disk of radius 10. This operation is
    to keep nodules attached to the lung wall.
    '''
    selem = disk(10)
    binary = binary_closing(binary, selem)
    if plot == True:
        plots[5].axis('off')
        plots[5].set_title('keep nodules attached to the lung wall')
        plots[5].imshow(binary, cmap=plt.cm.bone)
    '''
    Step 7: Fill in the small holes inside the binary mask of lungs.
    '''
    edges = roberts(binary)
    binary = ndi.binary_fill_holes(edges)
    if plot == True:
        plots[6].axis('off')
        plots[6].set_title('Fill in the small holes inside the binary mask of lungs')
        plots[6].imshow(binary, cmap=plt.cm.bone)
    '''
    Step 8: Superimpose the binary mask on the input image.
    '''
    get_high_vals = binary == 0
    im[get_high_vals] = 0
    if plot == True:
        plots[7].axis('off')
        plots[7].set_title('Superimpose the binary mask on the input image')
        plots[7].imshow(im, cmap=plt.cm.bone)

    return im

此方法每個步驟對圖像做不同的處理，效果如下:

3.7 肺部圖像分割

爲了減少有問題的空間，我們可以分割肺部圖像（有時候是附近的組織）。這包含一些步驟，包括區域增長和形態運算，此時，我們只分析相連組件。
步驟如下：

閾值圖像（-320HU是個極佳的閾值，但是此方法中不是必要）
處理相連的組件，以決定當前患者的空氣的標籤，以1填充這些二值圖像
可選：當前掃描的每個軸上的切片，選定最大固態連接的組織（當前患者的肉體和空氣），並且其他的爲0。以掩碼的方式填充肺部結構。
只保留最大的氣袋（人類軀體內到處都有氣袋）

def largest_label_volume(im, bg=-1):
    vals, counts = np.unique(im, return_counts=True)
    counts = counts[vals != bg]
    vals = vals[vals != bg]
    if len(counts) > 0:
        return vals[np.argmax(counts)]
    else:
        return None
def segment_lung_mask(image, fill_lung_structures=True):

    # not actually binary, but 1 and 2. 
    # 0 is treated as background, which we do not want
    binary_image = np.array(image > -320, dtype=np.int8)+1
    labels = measure.label(binary_image)

    # Pick the pixel in the very corner to determine which label is air.
    #   Improvement: Pick multiple background labels from around the patient
    #   More resistant to "trays" on which the patient lays cutting the air 
    #   around the person in half
    background_label = labels[0,0,0]

    #Fill the air around the person
    binary_image[background_label == labels] = 2


    # Method of filling the lung structures (that is superior to something like 
    # morphological closing)
    if fill_lung_structures:
        # For every slice we determine the largest solid structure
        for i, axial_slice in enumerate(binary_image):
            axial_slice = axial_slice - 1
            labeling = measure.label(axial_slice)
            l_max = largest_label_volume(labeling, bg=0)

            if l_max is not None: #This slice contains some lung
                binary_image[i][labeling != l_max] = 1

    binary_image -= 1 #Make the image actual binary
    binary_image = 1-binary_image # Invert it, lungs are now 1

    # Remove other air pockets insided body
    labels = measure.label(binary_image, background=0)
    l_max = largest_label_volume(labels, bg=0)
    if l_max is not None: # There are air pockets
        binary_image[labels != l_max] = 0

    return binary_image

查看切割效果

segmented_lungs = segment_lung_mask(pix_resampled, False)
segmented_lungs_fill = segment_lung_mask(pix_resampled, True)
plot_3d(segmented_lungs, 0)

我們可以將肺內的結構也包含進來（結節是固體），不僅僅只是肺部內的空氣

plot_3d(segmented_lungs_fill, 0)

使用掩碼時，要注意首先進行形態擴充(python的skimage的skimage.morphology)操作（即使用圓形kernel，結節是球體），參考 python形態操作。這會在所有方向（維度）上擴充掩碼。僅僅肺部的空氣+結構將不會包含所有結節，事實上有可能遺漏黏在肺部一側的結節（這會經常出現，所以建議最好是擴充掩碼）。

3.8 數據預處理

歸一化處理

當前的值範圍是[-1024,2000]。任意大於400的值都是我們所感興趣的，因爲它們都是不同反射密度下的骨頭。LUNA16競賽中常用來做歸一化處理的閾值集是-1000和400.以下代碼

#歸一化
MIN_BOUND = -1000.0
MAX_BOUND = 400.0

def normalize(image):
    image = (image - MIN_BOUND) / (MAX_BOUND - MIN_BOUND)
    image[image>1] = 1.
    image[image<0] = 0.
    return image

0值中心化
簡單來說就是所有像素值減去均值。LUNA16競賽中的均值大約是0.25.

不要對每一張圖像做零值中心化（此處像是在kernel中完成的）CT掃描器返回的是校準後的精確HU計量。不會出現普通圖像中會出現某些圖像低對比度和明亮度的情況

PIXEL_MEAN = 0.25

def zero_center(image):
    image = image - PIXEL_MEAN
    return image

3.9 存儲每個病人scan的所有slice肺部特徵

下述代碼爲main函數中整體過程

lung_head_dir = '/data/lung_competition/stage1/'
    img_save_head = '/data/lung_competition/roi_images/'
    patient_scans_list = [x for x in os.listdir(lung_head_dir) if os.path.isdir(os.path.join(lung_head_dir,x))]
    #patient_labels_file = '/data/lung_competition/stage1_labels.csv'
    #patient_labels = pd.read_csv(patient_labels_file)
    selem = ball(2)
    #print(patient_scans_list)
    for scan in patient_scans_list:
        scan_files = os.path.join(lung_head_dir,scan)
        ct_scan = read_ct_scan(scan_files)
        save_npy_path = os.path.join(img_save_head, scan)
        if os.path.exists(save_npy_path):
            os.mkdir(save_npy_path)

存儲病人scan的所有slice肺部特徵的關鍵代碼爲

segmented_ct_scan = segment_lung_from_ct_scan(ct_scan)
save_npy = save_npy_path+'.npy'
np.save(save_npy,segmented_ct_scan)
print('file %s saved ..'%save_npy)
plot_ct_scan(segmented_ct_scan)

四、mhd格式數據處理過程

4.1 處理思路

mhd的數據只是格式與dicom不一樣，其實質包含的都是病人掃描。處理MHD需要藉助SimpleIKT這個包，處理思路詳情可以參考Data Science Bowl2017的toturail Data Science Bowl 2017。需要注意的是MHD格式的數據沒有HU值，它的值域範圍與dicom很不同。
我們以LUNA2016年的數據處理流程爲例。參考代碼爲 LUNA2016數據切割

4.2 載入必要的包

import SimpleITK as sitk
import numpy as np
import csv
from glob import glob
import pandas as pd
file_list=glob(luna_subset_path+"*.mhd")
#####################
#
# Helper function to get rows in data frame associated 
# with each file
def get_filename(case):
    global file_list
    for f in file_list:
        if case in f:
            return(f)
#
# The locations of the nodes
df_node = pd.read_csv(luna_path+"annotations.csv")
df_node["file"] = df_node["seriesuid"].apply(get_filename)
df_node = df_node.dropna()
#####
#
# Looping over the image files
#
fcount = 0
for img_file in file_list:
    print "Getting mask for image file %s" % img_file.replace(luna_subset_path,"")
    mini_df = df_node[df_node["file"]==img_file] #get all nodules associate with file
    if len(mini_df)>0:       # some files may not have a nodule--skipping those 
        biggest_node = np.argsort(mini_df["diameter_mm"].values)[-1]   # just using the biggest node
        node_x = mini_df["coordX"].values[biggest_node]
        node_y = mini_df["coordY"].values[biggest_node]
        node_z = mini_df["coordZ"].values[biggest_node]
        diam = mini_df["diameter_mm"].values[biggest_node]

4.3 LUNA16的MHD格式數據的值

一直在尋找MHD格式數據的處理方法，對於dicom格式的CT有很多論文根據其HU值域可以輕易地分割肺、骨頭、血液等，但是對於MHD沒有這樣的參考。從LUNA16論壇得到的解釋是，LUNA16的MHD數據已經轉換爲HU值了，不需要再使用slope和intercept來做rescale變換了。此論壇主題下，有人提出MHD格式沒有提供pixel spacing(mm) 和 slice thickness(mm) ，而標準文件annotation.csv文件中結節的半徑和座標都是mm單位，最後確認的是MHD格式文件中只保留了體素尺寸以及座標原點位置，沒有保存slice thickness。如此來說，dicom纔是原始數據格式。

4.4 座標體系變換

MHD值得座標體系是體素，以mm爲單位。結節的位置是CT scanner座標軸裏面相對原點的mm值，需要將其轉換到真實座標軸位置，可以使用SimpleITK包中的 GetOrigin() GetSpacing()。圖像數據是以512x512數組的形式給出的。
座標變換如

相應的代碼處理如下:

k_img = sitk.ReadImage(img_file) 
img_array = sitk.GetArrayFromImage(itk_img) # indexes are z,y,x (notice the ordering)
center = np.array([node_x,node_y,node_z])   # nodule center
origin = np.array(itk_img.GetOrigin())      # x,y,z  Origin in world coordinates (mm)
spacing = np.array(itk_img.GetSpacing())    # spacing of voxels in world coor. (mm)
v_center =np.rint((center-origin)/spacing)  # nodule center in voxel space (still x,y,z ordering)

在LUNA16的標註CSV文件中標註了結節中心的X,Y,Z軸座標，但是實際取值的時候取的是Z軸最後三層的數組(img_array)。

下述代碼只提取了包含結節的最後三個slice的數據

i=0
for i_z in range(int(v_center[2])-1,int(v_center[2])+2):
    mask = make_mask(center,diam,i_z*spacing[2]+origin[2],width,height,spacing,origin)
    masks[i] = mask
    imgs[i] = matrix2int16(img_array[i_z])
    i+=1
np.save(output_path+"images_%d.npy" % (fcount) ,imgs)
np.save(output_path+"masks_%d.npy" % (fcount) ,masks)

4.5 查看結節

以下代碼用於查看原始CT和結mask

import matplotlib.pyplot as plt
imgs = np.load(output_path+'images_0.npy')
masks = np.load(output_path+'masks_0.npy')
for i in range(len(imgs)):
    print "image %d" % i
    fig,ax = plt.subplots(2,2,figsize=[8,8])
    ax[0,0].imshow(imgs[i],cmap='gray')
    ax[0,1].imshow(masks[i],cmap='gray')
    ax[1,0].imshow(imgs[i]*masks[i],cmap='gray')
    plt.show()
    raw_input("hit enter to cont : ")

4.6 包含結節位置信息的mhd格式數據特徵

參考一個開源代碼實現 UNET訓練肺組織結節分割

【轉載】常見醫療掃描圖像處理步驟

一、下載必要包

二、數據格式

三、dicom格式數據處理過程

四、mhd格式數據處理過程

如何使用 JS 判斷用戶是否處於活躍狀態

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

基於U-Net的眼底圖像血管分割實例

【轉載】使用U-Net分割方法進行癌症診斷（教程翻譯）

【轉載】常見醫療掃描圖像處理步驟

【轉載】【論文筆記】文本識別之CRNN

【每日學習】深度學習相關知識

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結