Python統計論文被引量和自引量

昨天替老闆寫了一個統計文章被引次數的腳本，現在分享給大家，開源萬歲！這裏的引用統計是基於Web of Science的數據庫，這就意味着你必須有Web of Science的賬號或者你所在的ip可以正常使用它。適用於廣大理工科的研究工作者，適用於英文主流期刊（SCI）的統計。中文核心要看Web of Science具體的收錄情況。

所需庫

wos

這個是Web of Science提供的數據接口，它有一個非常難讀的官方說明，感興趣的同學可以試一試：
http://ipscience-help.thomsonreuters.com/wosWebServicesLite/WebServicesLiteOverviewGroup/Introduction.html

re

這個庫anaconda自帶，可以用來處理字符串。我們扒取的各種網頁信息實際上都是字符串，所以這個庫在寫爬蟲的時候非常有用。

xml

這個庫同樣anaconda自帶，wos返回的各種數據均是以xml格式存儲的，所以需要用這個庫處理。

sys

我們只調用了其中的exit函數，用來終止腳本。

代碼實現

這個代碼貌似很有用的樣子，很多人哭着喊着想要，所以我就貼在這裏。使用的話請註明作者和出處，禁止用於商業用途，開源就是開源！

# -*- coding: utf-8 -*-
"""
Created on Tue Feb 26 14:43:45 2019

@author: danphnis
"""

from wos import WosClient
import wos.utils
import xml.dom.minidom
import re
import sys

class paper:
    def __init__(self,xml_txt,r):
        self.year = int(xml_txt.childNodes[r*2-1].childNodes[r*1-1].childNodes[r*2-1].getAttribute('pubyear'))
        self.uid = xml_txt.childNodes[r*1-1].firstChild.data
        self.title = xml_txt.childNodes[r*2-1].childNodes[r*1-1].childNodes[r*3-1].childNodes[-r].firstChild.data
        print('已找到文章，開始處理，請覈對文章名及DOI：')
        print(self.title)
        doi_temp = '未找到doi'
        if xml_txt.childNodes[r*3-1].childNodes[r*2-1].childNodes[r*1-1].childNodes[-r].hasAttribute('value'):
            doi_temp = xml_txt.childNodes[r*3-1].childNodes[r*2-1].childNodes[r*1-1].childNodes[-r].getAttribute('value')
        self.doi = doi_temp
        print(self.doi)
        
        self.loc = []
        self.org = []
        temp = xml_txt.childNodes[r*2-1].childNodes[r*2-1].childNodes[r*5-1]
        for i in range(int(temp.getAttribute('count'))):
            self.loc.append(temp.childNodes[r*i+r-1].childNodes[r*1-1].childNodes[r*1-1].firstChild.data)
            self.org.append(temp.childNodes[r*i+r-1].childNodes[r*1-1].childNodes[r*2-1].childNodes[r*1-1].firstChild.data)
            
        self.authors = []
        temp = xml_txt.childNodes[r*2-1].childNodes[r*1-1].childNodes[r*4-1]
        for i in range(int(temp.getAttribute('count'))):
            dict_t = {'name':temp.childNodes[r*i+r-1].getElementsByTagName('wos_standard')[0].firstChild.data,'addr_no':[int(a)-1 for a in temp.childNodes[r*i+r-1].getAttribute('addr_no').split()]}
            self.authors.append(dict_t)

def str_ifsame():
    ifsame = input("是請輸入1，不是請輸入0，並按Enter繼續,結束調試請鍵入-1:")
    try:
        ifsame = int(ifsame)
        return ifsame
    except:
        return str_ifsame()

def compare_author(a1,a2,loc1,org1,loc2,org2):
    if a1['name'] == a2['name']:
        return True
    n1 = re.sub('[ .\-_]','',a1['name']).lower()
    n2 = re.sub('[ .\-_]','',a2['name']).lower()
    if len(n1) > len(n2):
        n1,n2 = n2,n1
    tol = 0
    count = 0
    for s in n1:
        if s in n2:
            continue
        else:
            count += 1
            if count > tol:
                return False
    
    print("請求協助，判斷他們是否是同一個人：\n")
    print("------------------------------------")
    print(a1['name'])
    for i in a1['addr_no']:
        print(loc1[i])
        print(org1[i])
    print("------------------------------------")
    print(a2['name'])
    for i in a2['addr_no']:
        print(loc2[i])
        print(org2[i])
    print("------------------------------------\n")
    ifsa = str_ifsame()
    if ifsa == 1:
        return True
    elif ifsa == -1:
        sys.exit()
    else:
        return False
    
    

def compare_authors(tp1,tp2):
    for author1 in tp1.authors:
        for author2 in tp2.authors:
            if compare_author(author1,author2,tp1.loc,tp1.org,tp2.loc,tp2.org):
                return True
    return False

querys = 'TI=Analysis of transpacific transport of black carbon during HIPPO-3: implications for black carbon aging'
querys = 'DO=10.5194/acp-14-6315-2014'
querys = "IS=1674-2834"

with WosClient() as client:
    text = wos.utils.query(client, querys)
    paper_inf = paper(xml.dom.minidom.parseString(text).childNodes[0].childNodes[1],2)
    print("開始檢索引用它的文章")
    response = xml.dom.minidom.parseString(client.citingArticles(paper_inf.uid).records).childNodes[0].childNodes

    citing = []
    autocite2018 = 0
    cite2018 = 0
    autocite2019 = 0
    cite2019 = 0
    cite = len(response)
    print("已找到 "+repr(cite)+" 篇")
    for i in range(cite):
        print("正在處理第 "+repr(i+1)+" 項，共 "+repr(cite)+" 項！")
        citing.append(paper(response[i],1))
        paper_t = citing[-1]
        if paper_t.year == 2018:
            cite2018 += 1
        elif paper_t.year == 2019:
            cite2019 += 1
        else:
            continue
            
        ifautocite = compare_authors(paper_t,paper_inf)
        if ifautocite:
            print("這篇文章是自引！")
            if paper_t.year == 2018:
                autocite2018 += 1
            elif paper_t.year == 2019:
                autocite2019 += 1
            print("這篇文章非自引！")

    print("2018總引用：")
    print("    "+repr(cite2018))
    print("2018自引：")
    print("    "+repr(autocite2018))

運行以後根據提示操作就可以了，由於需要對比兩篇文章裏的作者是不是同一個人，但是我又沒有特別有效的對比算法，所以在算法不能判斷的時候還是需要人工輔助判斷一下，就像這樣：

Authenticated (SID: 6ARVcXl7XJYQB6XJ1hL)
已找到文章，開始處理，請覈對文章名及DOI：
Analysis of transpacific transport of black carbon during HIPPO-3: implications for black carbon aging
10.5194/acp-14-6315-2014
開始檢索引用它的文章
已找到 14 篇
正在處理第 1 項，共 14 項！
已找到文章，開始處理，請覈對文章名及DOI：
Estimating Source Region Influences on Black Carbon Abundance, Microphysics, and Radiative Effect Observed Over South Korea
10.1029/2018JD029257
請求協助，判斷他們是否是同一個人：

------------------------------------
Campuzano-Jost, P
Univ Colorado Boulder, Cooperat Inst Res Environm Sci, Boulder, CO 80309 USA
Univ Colorado Boulder
Univ Calif Irvine, Dept Chem, Irvine, CA 92717 USA
Univ Calif Irvine
------------------------------------
Tao, S
Peking Univ, Coll Urban & Environm Sci, Beijing 100871, Peoples R China
Peking Univ
------------------------------------


是請輸入1，不是請輸入0，並按Enter繼續,結束調試請鍵入-1:

運行結束後就會顯示被引次數和自引次數：

Authenticated (SID: 7FXfMEXOlSvdikJeU8C)
已找到文章，開始處理，請覈對文章名及DOI：
Regional earth system modeling: review and future directions
10.1080/16742834.2018.1452520
開始檢索引用它的文章
已找到 2 篇
正在處理第 1 項，共 2 項！
已找到文章，開始處理，請覈對文章名及DOI：
The Simulation of East Asian Summer Monsoon Precipitation With a Regional Ocean-Atmosphere Coupled Model
10.1029/2018JD028541
正在處理第 2 項，共 2 項！
已找到文章，開始處理，請覈對文章名及DOI：
Future precipitation changes over China under 1.5 degrees C and 2.0 degrees C global warming targets by using CORDEX regional climate models
10.1016/j.scitotenv.2018.05.324
2018總引用：
   2
2018自引：
   0

分解

思來想去還是稍微寫一些講解吧，萬一有人看呢。

querys = 'TI=Analysis of transpacific transport of black carbon during HIPPO-3: implications for black carbon aging'

這個是檢索依據，比如這樣寫就是根據標題（Title）檢索，當然也可以根據DOI（把前面的TI改寫成DO）。

with WosClient() as client:
   text = wos.utils.query(client, querys)
   paper_inf = paper(xml.dom.minidom.parseString(text).childNodes[0].childNodes[1],2)

這裏是檢索論文，並且把我們需要的信息存進一個叫paper_inf的對象裏，這個對象屬於paper類，這個類的定義如下，後面的childNodes是讀取有用字段用的，這個涉及到xml格式的讀取，可以參見
https://www.cnblogs.com/xiaobingqianrui/p/8405813.html

class paper:
   def __init__(self,xml_txt,r):
       self.year = int(xml_txt.childNodes[r*2-1].childNodes[r*1-1].childNodes[r*2-1].getAttribute('pubyear'))
       self.uid = xml_txt.childNodes[r*1-1].firstChild.data
       self.title = xml_txt.childNodes[r*2-1].childNodes[r*1-1].childNodes[r*3-1].childNodes[-r].firstChild.data
       print('已找到文章，開始處理，請覈對文章名及DOI：')
       print(self.title)
       doi_temp = '未找到doi'
       if xml_txt.childNodes[r*3-1].childNodes[r*2-1].childNodes[r*1-1].childNodes[-r].hasAttribute('value'):
           doi_temp = xml_txt.childNodes[r*3-1].childNodes[r*2-1].childNodes[r*1-1].childNodes[-r].getAttribute('value')
       self.doi = doi_temp
       print(self.doi)
       
       self.loc = []
       self.org = []
       temp = xml_txt.childNodes[r*2-1].childNodes[r*2-1].childNodes[r*5-1]
       for i in range(int(temp.getAttribute('count'))):
           self.loc.append(temp.childNodes[r*i+r-1].childNodes[r*1-1].childNodes[r*1-1].firstChild.data)
           self.org.append(temp.childNodes[r*i+r-1].childNodes[r*1-1].childNodes[r*2-1].childNodes[r*1-1].firstChild.data)
           
       self.authors = []
       temp = xml_txt.childNodes[r*2-1].childNodes[r*1-1].childNodes[r*4-1]
       for i in range(int(temp.getAttribute('count'))):
           dict_t = {'name':temp.childNodes[r*i+r-1].getElementsByTagName('wos_standard')[0].firstChild.data,'addr_no':[int(a)-1 for a in temp.childNodes[r*i+r-1].getAttribute('addr_no').split()]}
           self.authors.append(dict_t)

這裏我們提取了作者的信息，還有作者所屬的單位信息，主要都是是一些xml格式的繁瑣問題。

  response = xml.dom.minidom.parseString(client.citingArticles(paper_inf.uid).records).childNodes[0].childNodes

然後我們搜索一下引用了這篇文章的其他文章，這裏需要用到上一次的搜索記錄（UID），存在了paper_inf.uid中。接下來，重複上面的操作提取每一篇文章的有用字段，在對比一下作者的信息就可以了。

結語

其實這段代碼很簡單，大佬們稍微看一看就知道是什麼意思了。稍微難一點的其實只有處理返回的xml字段，比較費事。把字段複製進一個xml文件，用ie打開對着多使幾次就好了。

第五篇 Python統計論文被引量和自引量

Python統計論文被引量和自引量

所需庫

wos

re

xml

sys

代碼實現

分解

結語

第三篇 Python優雅地進行數據同化

第六篇用python爬蟲進行百度搜索並進入百度百科保存該詞條下的第一張圖片

第一篇 NC數據深度優先檢索（Python)

關於此博客更名和改版的說明

第五篇 Python統計論文被引量和自引量

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結