Python:langdetect和langid檢測語言類型

原創

SpiderLiH

2020-07-03 04:17

需求問題：

前幾天，因爲工作的需要，要求對爬取的語料進行語種識別，所以在網上查閱了一些資料。在這裏跟大家介紹下兩個語言檢測工具langdetect和langid。

1.langid模塊

在這裏提供一下，官方文檔。大家有興趣的話，可以去研究下。https://github.com/saffsd/langid.py

安裝方法也很簡單，直接下DOS 命令行輸入下面命令：

pip install langid

langid 模塊提供了一個很重要的方法 langid.classify()。
該方法返回的結果是元組類型（），元組的第一個參數代表語言類型，第二個參數代表該語言的可信度。

代碼如下：

import langid

language1 = "今天是2019.11.20號，距離過年還有3個月。加油，加油！！！"
language2 = 'Thanks for his honesty and courage, the truth will not be covered by lies.'
language3 = "Temuan-temuan awal ini masih perlu untuk dikonfirmasi oleh penelitian lebih lanjut"

print(langid.classify(language1))
print(langid.classify(language2))
print(langid.classify(language3))

輸出結果如下：

(‘zh’, -259.3397614955902) # zh代表中文
(‘en’, -192.87218618392944) # en 代表英語
(‘id’,-95.6275782585144) # id 代表印尼語

注意下：自己測試了下，語言的檢測率還可以，但是效率太慢了。輸出的語言類型是參考ISO 639-1語言編碼標準。
詳情可參考百度百科：ISO 639-1

2.langdetect 模塊

安裝方法也很簡單，在DOS 窗口下輸入下面的命令：

pip install langdetect

langdetect 模塊提供了兩個很重要的方法。

一個是langdetect .detect() 該方法是檢測語言的類型
另外一個是langdetect.detect_langs() 該方法是檢測所有語言類型及其所佔的比例。

代碼如下：

import langdetect
language1 = "今天是2019.11.20號，距離過年還有3個月。加油，加油！！！"
language2 = 'Thanks for his honesty and courage, the truth will not be covered by lies.'
language3 = "Hello，world。Python 生命之旅！！"

print(langdetect.detect(language1))
print(langdetect.detect(language2))
print(langdetect.detect(language3))
print(langdetect.detect_langs(language1))
print(langdetect.detect_langs(language3))

輸入結果如下：

zh-cn #中文
en # 英文
en #英文
[zh-cn:0.9999985317701515]
[en:0.8571398609764227, cy:0.14285913702201758] # 在這裏就可以看出來準確率不是很高了

注意：該模塊提供的方法速度挺快的，但是，網上有人說，準確率不夠。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python:langdetect和langid檢測語言類型

需求問題：

1.langid模塊

2.langdetect 模塊

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

Python:langdetect和langid檢測語言類型

Python通過m3u8文件下載合併ts視頻

Mongodb的權限管理

python中英文翻譯translate模塊使用

Python OS和shutil模塊的常見方法

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結