Ubuntu PDF OCR 工具—OCRmyPDF

原創

usmile

2021-06-04 13:18

簡介

開源 OCR 工具，識別掃描版 PDF 使其可以搜索、複製等

OCR, Optical Character Recognition，光學字符識別，是指對文本資料的圖像文件進行分析識別處理，獲取文字及版面信息的過程

github: https://github.com/jbarlow83/OCRmyPDF

docs: https://ocrmypdf.readthedocs.io/en/latest/

安裝

安裝 ocymypdf 及依賴

sudo apt-get -y remove ocrmypdf # 本機未安裝 ocrmypdf 舊版
sudo apt-get -y update
sudo apt-get -y install \
    ghostscript \
    icc-profiles-free \
    liblept5 \
    libxml2 \
    pngquant \
    python3-cffi \
    python3-distutils \
    python3-pkg-resources \
    python3-reportlab \
    qpdf \
    tesseract-ocr \
    zlib1g \
    unpaper
    
wget https://bootstrap.pypa.io/get-pip.py && python3 get-pip.py

export PATH=$HOME/.local/bin:$PATH
python3 -m pip install --user ocrmypdf

安裝 JBIG2 encoder

git clone https://github.com/agl/jbig2enc
cd jbig2enc
./autogen.sh
./configure && make
[sudo] make install

遇到的問題

執行 ./autogen.sh

./autogen.sh: 行 45: aclocal：未找到命令
```
sudo apt-get install automake
```
./autogen.sh: 行 50: libtoolize：未找到命令 ./autogen.sh: 行 50: glibtoolize：未找到命令
```
sudo apt install libtool
```

執行./configure && make

Error! Leptonica not detected.
```
sudo apt install libleptonica-dev
```
https://github.com/tesseract-ocr/tesseract/issues/215#issuecomment-369339789
Error! zlib not detected.
```
sudo apt install zlib1g-dev
```

安裝語言包

中文簡體

sudo apt install tesseract-ocr-chi-sim

命令

ocrmypdf -l chi_sim --output-type pdf [source.pdf] [ocr.pdf]

ocrmypdf

工具命令
-l chi_sim

指定語言爲中文簡體
--output-type pdf

生成標準的 PDF 格式
source.pdf

需要進行處理的文檔名
ocr.pdf

處理完生成的文檔名

示例

ocrmypdf -l chi_sim --output-type pdf 正則表達式必知必會\(修訂版\).pdf 正則表達式必知必會\(修訂版\)-ocr.pdf --force-ocr

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Ubuntu PDF OCR 工具—OCRmyPDF

簡介

安裝

安裝 ocymypdf 及依賴

安裝 JBIG2 encoder

遇到的問題

安裝語言包

命令

電子科技大學計算機科學與技術就讀體驗

Golang爬蟲代理接入的技術與實踐

sshpass 簡介

SSH 協議及 OpenSSH 實現

Ubuntu PDF OCR 工具—OCRmyPDF

Ubuntu picogo+typora+gitee 雲端筆記方案

Node.js-Events 模塊總結與源碼解析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結