實體鏈接Entity Linking開源工具:dexter2

實體鏈接(Entity Linking)

在自然語言處理中,實體鏈接,命名實體鏈接(NEL),命名實體消歧(NED),命名實體識別和消歧(NERD)或命名實體規範化(NEN),都是是確定實體(Entity)的Identity的任務。 例如,對於句子“巴黎是法國的首都”,Entity Linking的想法是確定句中“巴黎”指的是巴黎市,而不是巴黎希爾頓或任何其他可稱爲“巴黎”的實體。再例如,對於句子”James Bond is cool”,我們期望獲得“James_Bond”這整個經過鏈接後的名字。

Dexter2

Dexter是一個Entity Linking的開源框架,其利用維基百科(英文)中的詞條來實現實體鏈接。

下載

dexter on github
這裏有編譯好的二進制文件和source code,本文直接上編譯好的bin file
windows的話在解壓後的當前目錄:

java -Xmx4000m -jar dexter-2.1.0.jar

或者在linux上

wget http://hpc.isti.cnr.it/~ceccarelli/dexter2.tar.gz
tar -xvzf dexter2.tar.gz
cd dexter2
java -Xmx4000m -jar dexter-2.1.0.jar

於是本地端口8080開啓,如果是windows或者有可視化的linux上直接打開瀏覽器輸入http://localhost:8080/dexter-webapp/dev/ 即可查看api。如果dexter是在服務器上的話那麼直接用Python request利用url獲取結果(見後文)。

使用

所有使用api可以參考本地或者官網 。都有可執行的例子。本文舉例說明。

1. annotate, spot

  • annotate
    Performs the entity linking on a given text, annotating maximum n entities.

  • spot
    It only performs the first step of the entity linking process, i.e., find all the mentions that could refer to an entity

兩者都是對一句query中的詞進行entity linking。區別是annotate會找出最相關的前n個linking。按需使用。

例如,查找

Bob Dylan and Johnny Cash had formed a mutual admiration society even before they met in the early 1960s

中的linked entity

當然,可以直接輸入網址進行demo查看。linking的confidence設置爲0.5:

http://localhost:8080/dexter-webapp/api/rest/annotate?text=Bob%20Dylan%20and%20Johnny%20Cash%20had%20formed%20a%20mutual%20admiration%20society%20even%20before%20they%20met%20in%20the%20early%201960s&n=50&wn=false&debug=false&format=text&min-conf=0.5

可以得到Annotate的結果:

"value": "<a href=\"#\" onmouseover='manage(4637590)' >Bob Dylan</a> and <a href=\"#\" onmouseover='manage(11983070)' >Johnny Cash</a> had formed a mutual admiration society even before they met in the early 1960s"

其中annotate也可以給出spot的結果:

"spots": [
    {
      "mention": "johnny cash",
      "linkProbability": 1,
      "start": 14,
      "end": 25,
      "linkFrequency": 2558,
      "documentFrequency": 1932,
      "entity": 11983070,
      "field": "body",
      "entityFrequency": 2540,
      "commonness": 0.9929632525410477,
      "score": 0.9929632525410477
    },
    {
      "mention": "bob dylan",
      "linkProbability": 1,
      "start": 0,
      "end": 9,
      "linkFrequency": 5588,
      "documentFrequency": 4275,
      "entity": 4637590,
      "field": "body",
      "entityFrequency": 5547,
      "commonness": 0.9926628489620616,
      "score": 0.9926628489620616
    }
  ]

可以看出這個程序給我們link出了bob dylan和johnny cash兩個置信度高於0.5的linked entity,並給出了兩個entity的id。我們可以用這些id去做些其他的操作,具體在後文講解。

如果運行spot api:

http://localhost:8080/dexter-webapp/api/rest/spot?text=Bob%20Dylan%20and%20Johnny%20Cash%20had%20formed%20a%20mutual%20admiration%20society%20even%20before%20they%20met%20in%20the%20early%201960s&wn=false&debug=false&format=text

可以得到結果:

"spots": [
    {
      "mention": "mutual admiration society",
      "linkProbability": 1,
      "field": "body",
      "start": 39,
      "end": 64,
      "linkFrequency": 33,
      "documentFrequency": 31,
      "candidates": [
        {
          "entity": 2319591,
          "freq": 13,
          "commonness": 0.3939393939393939
        },
        {
          "entity": 2648616,
          "freq": 9,
          "commonness": 0.2727272727272727
        },
        {
          "entity": 2319544,
          "freq": 6,
          "commonness": 0.18181818181818182
        },
        {
          "entity": 3001631,
          "freq": 4,
          "commonness": 0.12121212121212122
        },
        {
          "entity": 32742,
          "freq": 1,
          "commonness": 0.030303030303030304
        }
      ]
    },
    {
      "mention": "johnny cash",
      "linkProbability": 1,
      "field": "body",
      "start": 14,
      "end": 25,
      "linkFrequency": 2558,
      "documentFrequency": 1932,
      "candidates": [
        {
          "entity": 11983070,
          "freq": 2540,
          "commonness": 0.9929632525410477
        },
        {
          "entity": 12326526,
          "freq": 14,
          "commonness": 0.00547302580140735
        }
      ]
    },
    {
      "mention": "bob dylan",
      "linkProbability": 1,
      "field": "body",
      "start": 0,
      "end": 9,
      "linkFrequency": 5588,
      "documentFrequency": 4275,
      "candidates": [
        {
          "entity": 4637590,
          "freq": 5547,
          "commonness": 0.9926628489620616
        },
        {
          "entity": 438899,
          "freq": 35,
          "commonness": 0.006263421617752327
        }
      ]
    }
  ],
  "nSpots": 3,
  "querytime": 264

我們發現事實上dexter不僅找到了bob dylan和johnny cash,它還找到了mutual admiration society。但mutual admiration society有很多詞條含義,比如Mutual_Admiration_Society_(song),Mutual_Admiration_Society_(album),Mutual_Admiration_Society_(collaboration),Mutual_Admiration_Society_–Joe_Locke&_David_Hazeltine_Quartet。
但事實上我們一看就應該知道這個Multual admiration society應該是首歌或者專輯,這說明dexter的算法應該是context-free的,和上下文無關。所以dexter其實只提供了linking的接口,如果需要解決多義性則還需其他工具。

2. get-id

輸入實體獲取id(在wiki中編好的號碼)

http://localhost:8080/dexter-webapp/api/rest/get-id?title=johnny%20cash

http://localhost:8080/dexter-webapp/api/rest/get-id?title=johnny_cash

二者得到的結果都是:

{
  "title": "Johnny_cash",
  "url": "",
  "id": 11983070
}

3. get-desc

輸入id獲取description。可以理解爲輸入id獲取entity

http://localhost:8080/dexter-webapp/api/rest/get-desc?id=11983070&title-only=true

記得把title-only參數改成true,不然無法輸出實體。

4. 用Python批量處理

開啓8080端口後,可以使用urllib和json來批量處理信息。舉個例子

import urllib
from urllib import request
from urllib import parse
import json


def GetAnnotateUrl(query, n = 5, conf = 0.5):
  url = 'http://localhost:8080/dexter-webapp/api/rest/annotate?text='
  query = query.replace(' ', '%20')
  url += query
  url += ('&n=' + str(n))
  url += ('&min-conf=' + str(conf))
  url += '&wn=false&debug=false&format=text'
  return url

def GetId2EntityUrl(id):
  url = 'http://localhost:8080/dexter-webapp/api/rest/get-desc?title-only=true&id='
  url += str(id)
  return url

def GetRequest(url):
  req = request.Request(url)
  data = request.urlopen(req).read().decode('utf-8')
  Json = json.loads(data)
  return Json

def GetEntitiesByQuery(query, n = 5, conf = 0.5 ):
  url = GetAnnotateUrl(query, n, conf)
  AnnoData = GetRequest(url)
  # AnnoData = json.dumps(AnnoData, indent = 4, separators = (',', ':'))
  # print(AnnoData) # Use the above dumps command to print structured json
  Spots = AnnoData['spots']
  Entities = {}
  for session in Spots:
    url = GetId2EntityUrl(session["entity"])
    Entities[session["entity"]] = GetRequest(url)["title"]
  return Entities

Entities = GetEntitiesByQuery('bob dylan and johnny cash')
print(Entities)
output:
{4637590: 'Bob_Dylan', 11983070: 'Johnny_Cash'}

可以批量處理一波。這東西也不像TX/ALI雲裏面的開源項目還要限制流量的,可以無限用。

Enjoy!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章