PDF文件解析&拆分在SAP憑證打印場景中的運用（二）

原創

2020-09-06 13:02

　　小爬上篇文章分析了，SAP憑證批量打印場景中爲啥要用到PDF文件解析&拆分。這篇文章，緊接着上一篇，重點談談如何用python來做到高效的PDF文件解析&拆分。

　　小爬使用了python第三方庫PyPDF2，它可以輕鬆的處理pdf文件，它提供了讀、寫、分割、合併、文件轉換等多種操作。小爬試了下，PyPDF2分割和合並的工作能輕鬆搞定，但是提取文本這塊，它只擅長英文。如果PDF內容涉及大量中文，則PYPDF2提取到的文本是大量的亂碼。

　　StackOverflow上熱心的程序員推薦了pdfminer，或者tika-python，可惜tika-python底層是用java實現的，它要求電腦上至少安裝有Java7的開發環境，所以它不在我的考慮範圍。小爬試了下pdfminer以及很多人推薦的pdfplumber庫，下面這段代碼，講述瞭如何通過PYPDF2+pdfplumber庫，以及RE正則表達式完成pdf文本的解析，得到PDF文本中的 “SAP憑證編號” 以及“頁碼”，直至生成新的pdf文件：

import pdfplumber,re

from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_dict={}

with pdfplumber.open("test.pdf") as pdf:
    total_page_num=len(pdf.pages)
    
    for i in range(total_page_num):
        print(i)
        p0 = pdf.pages[i]
        contents=p0.extract_text()
        voucherCode=re.search(".*?SAP憑證編號：([0-9]{10}).*?",contents,re.S).group(1)
        pageCode=re.search(".*?頁碼：(.*?)/.*?",contents,re.S).group(1).strip().rjust(3,"0") #部分憑證不止一頁，如果僅僅基於憑證號命名，會重名
        # print(voucherCode,pageCode)
        pdf_dict[i]=[voucherCode,pageCode]
pdf = PdfFileReader("test.pdf")
total=pdf.getNumPages()
for i in range(pdf.getNumPages()):
    pdf_writer = PdfFileWriter()
    pdf_writer.addPage(pdf.getPage(i))
    output = f'{pdf_dict[i][0]}_{pdf_dict[i][1]}.pdf'
    print(i,output)
    with open(output, 'wb') as output_pdf:
        pdf_writer.write(output_pdf)

親測，每解析一頁PDF內容，需要0.8秒~1秒。輕度使用自然是問題不大，小爬也樂於推薦這種方法。不過當我們的PDF有幾百上千頁，且我們有多個這樣的PDF文件時，我們難免會擔心它的解析效率。

爲了進一步提升PDF文本解析的效率，小爬嘗試了各類python-pdf解析庫，最終功夫不負有心人，找到了心儀的解決方案——XpdfReader，官網：https://www.xpdfreader.com/。

親測，它的核心產品 XpdfReader 提供了各大系統版本下的安裝包，讀取PDF文件效率極高，要好過市面上的福昕PDF閱讀器和adobe reader，不過功能相對簡單。小爬這裏要用到的是它提供的命令行工具：

pdftotext.exe。爲了能夠讀取多種語言，我們還需要對應的語言包，比如小爬的xpdf文件夾結構如下：

感興趣的童鞋可以上官網下載對應文件。準本好這些後，我們就可以開始提取文本了，具體見下面的代碼示例：

import os,subprocess,time,re,glob
import warnings
from os.path import isfile,join
from PyPDF2 import PdfFileReader, PdfFileWriter,PdfFileMerger
warnings.filterwarnings('ignore') # 關掉控制檯的大量pdfFileReader的warning，沒有這句也不影響程序執行
start=time.perf_counter()
base_dir=os.path.dirname(os.path.abspath(__file__))
ef=join(base_dir,"xpdf/pdftotext.exe")
cfg=join(base_dir,"xpdf/xpdfrc")
files=[]
voucher_codes=[]
pdf = PdfFileReader("test.pdf", 'rb')

total=pdf.getNumPages()
for i in range(pdf.getNumPages()):
    pdf_writer = PdfFileWriter()
    pdf_writer.addPage(pdf.getPage(i))
    output = f'result_{i+1}.pdf'
    print(i,output)
    with open(output, 'wb') as output_pdf:
        pdf_writer.write(output_pdf)
    files.append(join(base_dir,output))

def convert(file):
    bo = subprocess.check_output([ef,'-f','1','-l','1','-cfg',cfg,'-raw',file,'-']) #這個命令中的所有調用文件參數必須使用full path.否則調用出錯。
    return bo.decode('utf-8')
for index,file in enumerate(files):
    print(index+1)
    bo=convert(file)
    if len(bo)!=0:
        contents=bo.split('\r\n')
        for content in contents:
            if "SAP憑證編號" in content:
                voucher_code=re.search(".*?SAP憑證編號：([0-9]{10}).*?",content).group(1)
                if voucher_code not in voucher_codes:
                    voucher_codes.append(voucher_code)
            if "頁碼：" in content:
                pageCode=re.search(".*?頁碼：(.*?)/.*?",content).group(1).strip().rjust(3,"0")
        os.rename(file,join(base_dir,"results",f"{voucher_code}_{pageCode}.pdf"))
        print(voucher_code,pageCode)
openFiles=[]
for index,voucher_code in enumerate(voucher_codes):
    files=sorted(glob.glob(join(base_dir,"results",f"{voucher_code}*.pdf")))
    pdf_merger = PdfFileMerger()
    for file in files:
        openFile=open(file, 'rb')
        pdf_merger.append(openFile)
        openFiles.append(openFile)
    
    with open(join(base_dir,"results",f"final_{voucher_code}.pdf"), 'wb') as fout:
        pdf_merger.write(fout)
for openfile in openFiles:
    openfile.close()  # 對打開的文件，逐一關閉，後續進行移除。如果不關閉，後續無法使用remove方法刪除文件
files=sorted(glob.glob(join(base_dir,"results",f"*.pdf")))
for file in files:
    if "final" not in file:
        os.remove(file)

end=time.perf_counter()
totalTime=round(end-start,2)
print(f"total time:{totalTime} seconds.")

　　這段代碼的核心就是自定義方法 convert，該方法很簡單，利用subprocess庫發送命令行：按照 pdftotext.exe的要求，傳遞相關參數即可。親測，該方法提取pdf文本效率極高，大概0.1秒就可以提取一頁PDF內容。

　　這段代碼中還有一點需要強調，當我們用PdfFileMerger()方法時，需要打開大量的PDF對象，我們這個合併完成後，這些打開的PDF對象不會自行關掉，這會導致我們沒法用remove方法刪除這些PDF文件（假設merge完pdf後，我們不再需要一開始的這些pdf了），這裏小爬把這些打開的openFile放到Openfiles池子裏（list對象），最後統一調用close()方法後，再進行remove。

　　如果你遇到過類似的PDF文本解析效率不高的問題，趕緊用文中的方法試下，相信你會驚訝於它的簡單、直接、高效。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

PDF文件解析&拆分在SAP憑證打印場景中的運用（二）

如何使用 JS 判斷用戶是否處於活躍狀態

Mono 支持LoongArch架構

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

網絡爬蟲的祕密：如何高效地抓取JD.com視頻鏈接

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

python用win32com.client驅動excel時如何控制是否更新鏈接？

如何在SAP GUI中快速執行新的事務代碼

如何批量去掉文本的括號前後綴內容

如何藉助python第三方庫存取不同應用程序的用戶名、密碼

python如何提取瀏覽器中保存的網站登錄用戶名密碼

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結