python 讀取 pdf 文檔

原創

Ghost丶

2020-02-25 20:17

這個圖片是使用的流程說明，看着是有點繞的，分解來看（學自慕課）

首先使用 open 方法或者 urlopen 打開本場文檔或者網絡文檔（一般會這麼做因爲考慮到文檔太大，對網絡服務器負擔也很大）生成文檔對象，以下的方法之中的網絡鏈接已經存在了

# 獲取文檔對象
pdf0 = open('sampleFORtest.pdf','rb')
# pdf1 = urlopen('http://www.tencent.com/zh-cn/content/ir/an/2016/attachments/20160321.pdf')

接着創建 文檔解析器 和 PDF文檔對象 並將他們相互關聯

# 創建一個與文檔關聯的解析器
parser = PDFParser(pdf0)
# 創建一個PDF文檔對象
doc = PDFDocument()
# 連接兩者
parser.set_document(doc)
doc.set_parser(parser)

對 PDF文檔對象 進行初始化，如果文檔本身進行了加密，則需要在加入 password 參數

# 文檔初始化
doc.initialize('')

先創建 PDF資源管理器 和 參數分析器

# 創建PDF資源管理器
resources = PDFResourceManager()
# 創建參數分析器
laparam = LAParams()

再創建一個 聚合器 ，並接收 PDF資源管理器 參數分析器 作爲參數

# 創建一個聚合器，並接收資源管理器，參數分析器作爲參數
device = PDFPageAggregator(resources,laparams=laparam)

最後創建一個 頁面解釋器 ，將 PDF資源管理器 和 聚合器 作爲參數

# 創建一個頁面解釋器
interpreter = PDFPageInterpreter(resources,device)

這樣 頁面解釋器 就具有對PDF文檔進行編碼，解釋成Python能夠識別的格式

最後呢，使用 PDF文檔對象 的 get_pages()方法 從PDF文檔中讀取出頁面集合，接着使用 頁面解釋器 對頁面集合逐一讀取，再調用 聚合器 的 get_result()方法 將頁面逐一放置到 layout 之中，最後商用 layout 的 get_text()方法 獲取每一頁的 text

for page in doc.get_pages():
# 使用頁面解釋器讀取頁面
interpreter.process_page(page)
# 使用聚合器讀取頁面頁面內容
layout = device.get_result()
for out in layout:
if hasattr(out,'get_text'): # 因爲文檔中不只有text文本
pprint(out.get_text())

需要注意的是在PDF文檔中不只有 text 還可能有圖片等等，爲了確保不出錯先判斷對象是否具有 get_text()方法

完整的代碼

# encoding:utf-8
'''
@author:
@time:
'''
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pprint import pprint
from urllib.request import urlopen
# 獲取文檔對象
pdf0 = open('sampleFORtest.pdf','rb')
# pdf1 = urlopen('http://www.tencent.com/zh-cn/content/ir/an/2016/attachments/20160321.pdf')
# 創建一個與文檔關聯的解釋器
parser = PDFParser(pdf0)
# 創建一個PDF文檔對象
doc = PDFDocument()
# 連接兩者
parser.set_document(doc)
doc.set_parser(parser)
# 文檔初始化
doc.initialize('')
# 創建PDF資源管理器
resources = PDFResourceManager()
# 創建參數分析器
laparam = LAParams()
# 創建一個聚合器，並接收資源管理器，參數分析器作爲參數
device = PDFPageAggregator(resources,laparams=laparam)
# 創建一個頁面解釋器
interpreter = PDFPageInterpreter(resources,device)
# 使用文檔對象獲取頁面的集合
for page in doc.get_pages():
# 使用頁面解釋器讀取頁面
interpreter.process_page(page)
# 使用聚合器讀取頁面頁面內容
layout = device.get_result()
for out in layout:
if hasattr(out,'get_text'): # 因爲文檔中不只有text文本
pprint(out.get_text())

素材選取是官方提供的

運行的結果：

'Preemptive Information Extraction using Unrestricted Relation Discovery\n'
'Yusuke Shinyama\n'
'Satoshi Sekine\n'
'New York University\n715, Broadway, 7th Floor\nNew York, NY, 10003\n'
'{yusuke,sekine}@cs.nyu.edu\n'
'Abstract\n'
('We are trying to extend the boundary of\n'
'Information Extraction (IE) systems. Ex-\n'
'isting IE systems require a lot of time and\n'
'human effort to tune for a new scenario.\n'
'Preemptive Information Extraction is an\n'
'attempt to automatically create all feasible\n'
'IE systems in advance without human in-\n'
'tervention. We propose a technique called\n'
'Unrestricted Relation Discovery that dis-\n'
'covers all possible relations from texts and\n'
'presents them as tables. We present a pre-\n'
'liminary system that obtains reasonably\n'
'good results.\n')

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python 讀取 pdf 文檔

RADIUS 報文解析

kali linux 命令

linux 壓縮和解壓命令大全

python 3 爬蟲教程

撰寫一組SNORT規則防禦SQL注入

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結