安裝解析庫

背景說明

抓取網頁代碼後，下一步是從網頁中獲取信息。

提取信息的方法有很多，可以使用正則表達式，但是寫起來比較繁瑣。也可以使用強大的解析庫。

此外，還有非常強大的解析方法，比如Xpath解析和CSS選擇器解析等。

環境說明

[root@localhost Python-3.6.6]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.4 (Maipo)
[root@localhost Python-3.6.6]# uname -a
Linux localhost.localdomain 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
[root@localhost Python-3.6.6]# getenforce 
Disabled
[root@localhost Python-3.6.6]# systemctl status firewalld.service 
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:firewalld(1)
[root@localhost Python-3.6.6]#

lxml安裝

pip3 install lxml

Beautiful Soup安裝

pip3 install beautifulsoup4

pyquery安裝

pip3 install pyquery

tesserocr安裝

爬蟲過程中，經常會遇見驗證碼。此時我們可以直接用OCR來識別。

tesserocr是python的一個OCR識別庫，其實是對tesseract做的python API的封裝，所以他的核心是tesseract。所以需要先安裝tesseract。

yum install -y tesseract

[root@localhost bin]# tesseract --list-langs  #查看支持的語言
List of available languages (1):
eng
#如上，只能識別英語。如果想要識別多國語言，則需要安裝語言包。
yum install -y tesseract-langpack*

#安裝Cython，tesserocr需要Cython>=0.23
pip3 install Cython

#安裝tesserocr
pip3 install tesserocr pillow

#測試
#在網上照一張驗證碼的圖片，存到本地。
tesseract timg.jpg result -l eng && cat result.txt
#上述方式是通過shell的方式進行測試。下面通過python的tesserocr庫來測試：
>>> import tesserocr
>>> from PIL import Image
>>> image = Image.open('timg.jpg')
>>> print(tesserocr.image_to_text(image))
7364

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

更改pip源

安裝數據庫

安裝請求庫

linux 安裝 python3

selinux的初級管理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結