python3处理pdf工具 pdfminer3k

pdfminer3k应用

python处理pdf也是常用的技术了，pdfminer3k是一个非常好的工具。

先在系统目录下建立pip目录，呈现 C:\Users\Administrator\pip，之后建立pip.ini文本文件，内容如下：

[global]
index-url=http://mirrors.aliyun.com/pypi/simple/
[install]
trusted-host=mirrors.aliyun.com

#安装最好通过设置国内代理下载安装，如阿里、北清等，以上我是通过阿里云代理，每次安装都很顺利，在此感谢阿里！

安装 pip install pdfminer3k

首先，通用脚本读取pdf中的文本：

我主要是想在pdf中抽出自己想要的一些关键信息，所以需要找到这些信息的共同点。幸运的是，这些关键信息的行都含有'//'，所以我只需找到含有'//'的行就行了，于是写了以下脚本。

这样就可以直接使用了，我们先看脚本：

from io import StringIO
from io import open
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, process_pdf

def read_pdf(pdf):
# resource manager
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
# device
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
process_pdf(rsrcmgr, device, pdf)
device.close()
content = retstr.getvalue()
retstr.close()
# 获取所有行
lines = str(content).split("\n")

units = [1, 2, 3, 5, 7, 8, 9, 11, 12, 13]
header = '\x0cUNIT '
# print(lines[0:100])
count = 0
flag = False
text = open('words.txt', 'w+')
for line in lines:
if line.startswith(header):
flag = False
count += 1
if count in units:
flag = True
print(line)
text.writelines(line + '\n')
if '//' in line and flag:
text_line = line.split('//')[0].split('. ')[-1]
print(text_line)
text.writelines(text_line+'\n')
text.close()

def _main():
my_pdf = open('t1.pdf', "rb")
read_pdf(my_pdf)
my_pdf.close()

if __name__ == '__main__':
_main()
其实看到lines = str(content).split("\n")那一行就够了，我们可以把lines都print出来，就可以看到pdf里面的内容。

这样我们就可以把pdf文件处理看作简单的字符串数据处理了。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python3处理pdf工具 pdfminer3k

pdfminer3k应用

杭州的 IT 崩盘了么？

开源高性能结构化日志模块NanoLog

Python 潮流周刊#55：分享 9 个高质量的技术类信息源！

Azure Virtual Network (22) 多订阅使用Azure DNS解析问题 Windows Azure Platform 系列文章目录

【简写Mybatis-02】注册机的实现以及SqlSession处理

手绘二维码

.NET借助虚拟网卡实现一个简单异地组网工具

mssql未公開加密函數pwdencrypt,pwdcompare

POST請求數據傳輸

Python基礎知識資料收集庫

Tornado一個不錯的簡潔WEB APP框架

python3 urllib及requests基本使用

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結