圖像處理——pdf表格處理

最近一直在處理金融方面的數據，其中比較難搞定的是財務報表裏面的表格數據。很多非常有用的信息全部濃縮在表格裏面，比如如下：

這個算是比較規整的、行列整齊的表格。下面的就稍微難一些些：

於是，研究了幾天，做出了一版基本可用的單元格切割方案。

腳本見：https://github.com/yfyvan/table_crop

基本思路如下：

（1）橫豎線定位用的是卷積

    img = cv2.imread(file_name, 0)

    # 定義單元格
    for name in ['raw_img', 'img_vertical', 'img_horizontal', 'img_add', 'img_add2', 'img_point']:
        cv2.namedWindow(name, 0)
        cv2.resizeWindow(name, 1000, 800)
        cv2.moveWindow(name, 200, 20)

    """
    step 1: 圖像基本處理
    """
    # 二值處理
    _, img = cv2.threshold(img, 127, 255, cv2.THRESH_BINARY)

    if show_pro:
        cv2.imshow('raw_img', img)

    # 膨脹腐蝕
    rows, cols = img.shape

    # 識別橫線，卷積加粗
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (cols // scale1, 1))
    img_vertical = cv2.erode(img, kernel, iterations=1)
    if show_pro:
        cv2.imshow("img_vertical", img_vertical)

    # 識別豎線，卷積加粗
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, rows // scale2))
    img_horizontal = cv2.erode(img, kernel, iterations=1)
    if show_pro:
        cv2.imshow("img_horizontal", img_horizontal)

    # 疊加取點
    img_add = cv2.add(img_vertical, img_horizontal)
    if show_pro:
        cv2.imshow("img_add", img_add)

    # 疊加去噪
    img_add2 = cv2.dilate(img_add, np.ones((3, 3)), iterations=1)
    if show_pro:
        cv2.imshow("img_add2", img_add2)

N x 1的卷積和1 x N的卷積分別是用來加粗橫線和豎線的，加粗後變成如下的樣子：

到這裏橫線豎線非常清晰可見了。接下來我們把座標點找出來！

（2）座標點定位

其實第一反應挺直接的，就是橫線跟豎線的交點。那直接cv2.add

    # 疊加取點
    img_add = cv2.add(img_vertical, img_horizontal)
    if show_pro:
        cv2.imshow("img_add", img_add)

    cv2.imwrite('tmp/add.png', img_add)

    # 疊加去噪
    img_add2 = cv2.dilate(img_add, np.ones((3, 3)), iterations=1)
    if show_pro:
        cv2.imshow("img_add2", img_add2)

疊加後變成這樣：

哇，到這是不是很明顯了，交點已經非常直觀了，接下來就是找到這些交點就行了。

（3）尋找最優交線座標

這裏，我們根據圖2的結果，選取所有可能的橫線座標，在這裏我把這個邏輯定義爲：存在連續config.horizontal_threshold=300個點都是黑色點的位置爲橫線線位置

    y_possible = []
    for index in range(img.shape[0]):
        row = img_horizontal[index]
        for i in range(len(row) - horizontal_threshold):
            if sum(row[i: i + horizontal_threshold]) == 0:
                y_possible.append(index)
                break

這裏的y就是，以一個像素爲單位，所有的可能是橫線的座標（因爲有存在密集數據的單元格，在腐蝕膨脹過程過變成噪音）

接下來繼續精準定位橫線的位置。

    # 擴大y範圍，防止單向抑制
    y_final = []
    tmp = []
    for i in range(len(y_possible) - 1):
        # 尋找非跳躍交點，跳躍說明是下一條橫線
        if y_possible[i + 1] - y_possible[i] < 3:
            tmp.append(i)
        else:
            # 發現跳躍點，則對當前區域連續點求均值
            y_index = (tmp[-1] + tmp[0]) // 2
            # 並前後各擴展5、10個像素值
            y_final.append(y_possible[y_index - 5: y_index + 10])
            tmp = []

一方面，我們認爲，在上一步的結果之後，如果存在連續三個座標點的像素值都是0，即：黑色。我們認爲，這個橫線是我們加粗的橫向的起始位置（即粗黑線條的第一個像素值的座標）；另一方面，在找到這個值後，我們前後各擴展5、10 個像素值作爲下一步計算的數據。

接下來，做最優橫線的選取，爲什麼還不是最優的，我們看：

因爲在線的疊加時，會存在格中格，這些單元格在疊加過程中只有一半的疊加線是黑色的，如上，我們可能選出了5條備選線，但是隻有第3條是正確的，但是隻要在3到5之間的線，都是可用的。

    # 定位x
    y_final_cnt = []
    # 加上底線區域
    y_final.append(list(range(shape[0] - 30, shape[0])))

    # 求區域內交點個數最大的第一條線
    # （第二第三無所謂，理論上第一條與最後一條的均值是最合適的，但是邏輯太複雜。。。）
    # （第一條線會導致矩形方框的下底整體上移）
    for y_scope in y_final:
        cnt = 0
        y_scope_tmp = []
        for y_index in y_scope:
            y_value = img_add2[y_index]
            for flag_index in range(len(y_value) - 1):
                if y_value[flag_index] == 0 and y_value[flag_index + 1] != 0:
                    cnt += 1
            y_scope_tmp.append(cnt)
            cnt = 0
        y_final_cnt.append(y_scope_tmp)
    # 取出最大的第一條的座標值
    y_final_cnt = list(filter(lambda x: x != [], y_final_cnt))
    max_cnt_indexs = list(map(lambda x: x.index(max(x)), y_final_cnt))
    # 取出x該點的y軸座標值
    y_cors = [y_final[y_cor_index][max_cnt_indexs[y_cor_index]] for y_cor_index in range(len(max_cnt_indexs))]

這一步定義爲：選擇備選線中黑色交點最多的第一條線

上面這步，可就是選取這一步黑線的代碼。到此，最優橫線已找到。

不管找的是橫線還是豎線，邏輯都是一樣的，先橫後豎，先豎後橫都一樣。

（4、5）尋找交點，拼接交點

找到了橫線，那接下來在疊圖的基礎上把橫線對應的所有座標中，取出像素值爲0的點就OK

    # 定位y
    x_cors = []
    for y in y_cors:
        x_cors_tmp = []
        xs = img_add2[y]
        xx, yy = list(zip(*enumerate(xs)))
        x_index = 0
        while x_index < len(xs):
            if xs[x_index] == 0:
                x_cors_tmp.append(x_index)
                x_index += 20
                if x_index >= len(yy):
                    break
            x_index += 1
        x_cors.append(x_cors_tmp)

    # 所有可能的y軸座標
    # print(y_cors)
    # 每個y軸座標對應的所有可能的x軸座標
    # print(x_cors)

    """
    step 4: 拼接所有可能區域座標集

    """
    cors = []

    # 拼接座標點
    index = 0
    while index < len(y_cors):
        ys_tmp = y_cors[index]
        xs_tmp_list = x_cors[index]
        for xs_tmp in xs_tmp_list:
            cors.append((ys_tmp, xs_tmp))
        index += 1

然後，我們提取所有的可能的交點座標

    cors = []

    # 拼接座標點
    index = 0
    while index < len(y_cors):
        ys_tmp = y_cors[index]
        xs_tmp_list = x_cors[index]
        for xs_tmp in xs_tmp_list:
            cors.append((ys_tmp, xs_tmp))
        index += 1

    cors = sorted(set(cors))
    # (219, 368), (219, 694), (284, 368), (284, 694)

    # 分割行
    new_cors = {}
    for cor in cors:
        if cor[0] not in new_cors:
            new_cors[cor[0]] = []
        new_cors[cor[0]].append(cor)

    new_cors = list(new_cors.values())

其結果抽象出來如下：

我們會得出所有的直線交點，黑色點是錯誤的，後面咱們再處理

然後，咱們拼出所有的矩形框的四個座標點

    # 拼接所有可能的矩形框
    i = 0
    result = []
    while i < len(new_cors) - 1:
        j = i + 1
        while j < len(new_cors):
            new_lis1, new_lis2 = combine(new_cors[i], new_cors[j])
            j += 1
            # y軸排列組合
            for ii in new_lis1:
                # x軸排列組合
                for jj in new_lis2:
                    if judge(ii, jj):
                        result.append([ii[0], ii[1], jj[0], jj[1], area([ii[0], ii[1], jj[0], jj[1]])])
        i += 1

（6）iou + nms算法選擇單元格

基礎思路，從一個單元格開始，計算其餘單元格個該單元格的IOU值，大於0.08的視爲同一個單元格。使用nms算法，把這些單元格進行剔除。

首先，去除黑點

    next_result = []
    for i in result:
        # (y, x)
        p1, p2, p3, p4, area_ = i
        if p1[1] < 50:
            if (img_add2[p1[0], p1[1] + 1] == 0 or img_add2[p1[0], p1[1] - 1] == 0) and \
                    (img_add2[p3[0], p3[1] + 1] == 0 or img_add2[p3[0], p3[1] - 1] == 0):
                next_result.append([p1, p2, p3, p4, area_])
        else:
            next_result.append([p1, p2, p3, p4, area_])

    result = next_result

其次，iou + nms，nms前我們先排序，保證最小單元格能一定被捕捉到

    result = list(filter(lambda x: x[-1] >= min_anchor_size, result))
    # 根據座標和單元格面積排序
    result = sorted(result, key=lambda x: (x[0][0], x[0][1], x[-1]))
    nms_result = nms(result, [], ratio=ratio)

因爲一些手撕的邏輯算法都在utils裏面，各位移步：https://github.com/yfyvan/table_crop/blob/master/utils.py

（7）異常值處理

接下來最後做一些處理，去除一些異常值（噪音造成的錯誤框），再去除一些近似值（iou閾值造成的近似框）

    new_result = []
    for i in range(len(nms_result)):
        tmp = list(map(lambda x: x, nms_result[i][:4]))
        # (y1, x1), (y2, x2), (y3, x3), (y4, x4)
        if tmp[1][1] - tmp[0][1] < 40:
            continue
        new_result.append(tmp)

    img = cv2.imread(file_name)
    for i in new_result:
        cv2.rectangle(img, (i[0][1], i[0][0]), (i[-1][1], i[-1][0]), random.choice([(255, 0, 0), (0, 0, 255), (0, 255, 0)]), thickness=3)
    new_result = abno_data_detete(new_result)
    new_result = simi_data_delete(new_result)

（8）圖形切割

根據以上生成的各個點的座標，切割小單元格

    img = cv2.imread(file_name, 0)
    pics = []
    files = []
    index = 0
    if not os.path.exists(data_root + '/' + str(notice_id)):
        os.mkdir(data_root + '/' + str(notice_id))
    file_root = data_root + '/' + str(notice_id)

    for i in anchors:
        index += 1
        img2 = img[i[0][0]:i[2][0], i[0][1]:i[1][1]]
        new_file_name = '%s/%s_%s.png' % (file_root, index, str(i))
        cv2.imwrite(new_file_name, img2)
        pics.append(img2)
        files.append(new_file_name)

OVER

注意幾點：

（1）參數調整

a、config.scale2 = 20

這是豎線的scale絕對值，不同的長度的表格這個值不一樣，長表格這個值需要大一些，什麼80， 100，都有可能，你可以根據長寬自己擬出一個比例進行動態調整

b、ratio = 0.08

nms判斷的閾值，這個有點不好調，表格比較密集的，這個值可以大一些，稀疏的可以小一些0.05-0.3不等

c、min_anchor_size = 10000

這個是最小單元格的閾值，圖片的像素值不同，這個值肯定也不同

調優策略：

（1）這是個初版分割方案，有非常多的地方可以調優。比如：動態調整scale值，多處pop以append代替，先剔除異常值在做排序和nms等，都能加速

（2）物體識別走yolo和frcnn會非常非常高地提高準確度，大家可以移步yolo3和frcnn的github。陷於數據標註的成本和缺乏gpu。。。我也很無奈。。。

圖像處理——pdf表格處理

技術分享-Spark-安裝與部署

卷積神經網絡-BN、Dropout、leaky_relu (tensorflow)

圖像處理——pdf表格處理

源碼剖析transformer、self-attention（自注意力機制）、bert原理！

地表最強一階段目標檢測框架：yolov4之tf2+版本

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結