Python sax的 xml 數據文件解析及如何去除解析文本中的特殊標籤,

wq

在數據解析之前，需要對數據中的異常文本(<sub>, <b>,<i>,<sup>等文本修飾符標籤)進行預處理，例如下文，

<Abstract>
   <AbstractText><b>Background:</b> Lung adenocarcinoma has a strong tendency to develop 
into bone metastases, especially spinal metastases (SM). Long noncoding RNAs (lncRNAs) play 
critical roles in regulating several biological processes in cancer cells. However, the 
mechanisms underlying the roles of lncRNAs in the development of SM have not been 
elucidated to date. <b>Methods:</b> Clinical specimens were collected for analysis of differentially expressed lncRNAs. The Kyoto Encyclopedia of Genes and Genomes (KEGG) was 
used to examine the effects of these genes on pathways. RNA pull-down was utilized to 
identify the targeting protein of lncRNAs. The effects of lncRNA on its target were 
detected in A549 and SPCA-1 cells via perturbation of the lncRNA expression. Oncological 
behavioral changes in transfected cells and phosphorylation of kinases in the relevant 
pathways, with or without inhibitors, were observed. Further, tumorigenicity was found to 
occur in experimental nude mice. <b>Results:</b> LINC00852<sup>s2</sup> and the mitogen-activated 
protein kinase (MAPK) pathway were found to be associated with SM. Moreover, the LINC00852 
target S100A9 had a positive regulatory role in the progression, migration, invasion, and 
metastasis of lung adenocarcinoma cells, both <i>in vitro</i> and <i>in vivo</i>. 
Furthermore, S100A9 strongly activated the P38 and REK<sub>1/2</sub> kinases, and slightly activated 
the phosphorylation of the JNK kinase in the MAPK pathway in A549 and SPCA-1 cells. 
<b>Conclusion:</b> LINC00852 targets S100A9 to promote progression and oncogenic ability in 
lung adenocarcinoma SM through activation of the MAPK pathway. These findings suggest a 
potential novel target for early intervention against SM in lung cancer.
    </AbstractText>
</Abstract>

具體解決方法就是在解析之前進行預處理，然後再解析，本人要處理的數據量較大，所以採用數組進行了批量處理：


import os
from variation_preprocess.pubmed_test import xml_parser

source_dir = 'G:\\Pubmed_file\\'

List_Fname = []
List_Sname = []
List_csvname = []

def listdir(path, list_Fname, list_Sname, list_csv):
    for file in os.listdir(path):
        if file[-4:] == '.xml':
            file_path = 'G:/Pubmed_file/'+ file[:-4] + '.xml'
            file_save = 'G:/PubMed/'+ file[:-4] + '_edited.xml'
            file_csv = 'E:/PubMed/' + file[:-4] + '.csv'
            list_Fname.append(file_path)
            list_Sname.append(file_save)
            list_csv.append(file_csv)
    return list_Fname, list_Sname, list_csv


# 去除 sub, sub 標籤，
def xml_process(list_Pname, list_Sname):
    for file_p in range(len(list_Pname)):
        file_path = list_Pname[file_p]
        temp_save = list_Sname[file_p]
        print(file_path, temp_save)
        with open(file_path, 'r', encoding='utf-8') as tf:
            with open(temp_save, 'a+', encoding='utf-8') as sf:
                data = tf.readlines()
                for index in data:
                    index = index.replace('<sup>','^').replace('</sup>','')
                    index_f = index.replace('<sub>','_').replace('</sub>','')
                    sf.write(index_f)


if __name__ == '__main__':
    list_Fname, list_Sname, list_csvname = listdir(source_dir, List_Fname, List_Sname, List_csvname)
    # xml_process(list_Fname,list_Sname)

    for file_index in range(len(list_Fname)):
        file_direc = list_Sname[file_index]
        file_save = list_csvname[file_index]
        xml_parser(file_direc, file_save)

然後就可以處理數據了，這裏，我採用了SAX方法來解析xml 文檔，因爲它是非常適合處理批量數據的

# -*- coding:UTF-8 -*-

import xml.sax
import pandas as pd

global i
i = 0

class SaxHandler(xml.sax.ContentHandler) :

    def __init__(self) :
        self.Pubmed = []
        self.CurrentData = ''
        self.PMID = ''
        self.ISSN = ''
        self.date = ''
        self.Date_year = ''
        self.Date_month = ''
        self.Date_day = ''
        self.DateCompleted = ''
        self.DateCompleted_Month = ''
        self.DateRevised_Month = ''
        self.DateRevised = ''
        self.IssnType = ''
        self.CitedMedium = ''
        self.ArticleType = ''
        self.ISOAbbreviation = ''
        self.Journal_Title = ''
        self.ArticleTitle = ''
        self.ELocationID = ''
        self.AbstractText = ''
        self.Author = ''
        self.Author_full = ''
        self.LastName = ''
        self.ForeName = ''
        self.Initials = ''
        self.Identifier = ''
        self.Affiliation = ''
        self.Keywords = ''
        self.Language = ''
        self.PublicationType = ''
        self.Tags = ['a', 'b', 'c', 'd', 'e']

    def startElement(self, tag, attributes) :
        global i
        # print("tag",tag, i)
        self.Tags.append(tag)
        if len(self.Tags) < 2:
            return
        if self.Tags[-2] == "Journal" and self.Tags[-1] == 'ISSN' :
            try :
                self.IssnType = attributes['IssnType']
                return
            except :
                pass
        if self.Tags[-3] == "Journal" and self.Tags[-1] == 'JournalIssue':
            try :
                self.CitedMedium = attributes['CitedMedium']
                return
            except :
                pass

        if self.Tags[-2] == 'Abstract' and self.Tags[-1] == 'AbstractText' :
            try :
                text = attributes['Label']
                self.AbstractText = text + ":"
                return
            except :
                pass
        if self.Tags[-2] == 'AbstractText' and self.Tags[-1] == 'AbstractText' :
            try :
                text = attributes['Label']
                self.AbstractText = self.AbstractText + " ## " + text + ":"
                return
            except :
                pass

    def endElement(self, tag):
        global i
        self.CurrentData = tag
        if self.CurrentData == 'PubmedArticle':
            self.Pubmed.append([self.PMID, self.DateCompleted, self.DateRevised, self.ISSN, self.IssnType,
                                self.CitedMedium, self.date, self.Date_year, self.ArticleType,
                                self.ISOAbbreviation, self.ArticleTitle, self.Language, self.ELocationID,
                                self.Author,self.Author_full,  self.Affiliation, self.Keywords, self.AbstractText])
            # print(self.PMID, self.ISSN, self.date, self.Date_year, self.ArticleType, self.ArticleTitle, self.ELocationID)
            i = i + 1
            self.init()
            if (i % 4000) == 0 :
                print("第 %d 條數據" % i)

    def characters(self, content) :
        global i
        names = self.__dict__
        if content.strip() == '':
            return
        if self.Tags[-2] == 'MedlineCitation' and self.Tags[-1] == "PMID":
            self.PMID = content
            return
        if self.Tags[-2] == 'Journal' and self.Tags[-1] == "ISSN":
            self.ISSN = content
            return
        if self.Tags[-2] == "PubDate" and self.Tags[-1] == 'Year':
            self.Date_year = content
            self.date = self.Date_year
            return
        if self.Tags[-3] == "PubDate" and self.Tags[-1] == 'Month':
            self.Date_month = content
            self.date = self.Date_month + '/' + self.Date_year
            return
        if self.Tags[-4] == "PubDate" and self.Tags[-1] == "Day":
            self.Date_day = content
            self.date = self.Date_day + '/' + self.Date_month + '/' + self.Date_year
            return
        if self.Tags[-2] == "DateCompleted" and self.Tags[-1] == "Year":
            self.DateCompleted = content
            return
        if self.Tags[-3] == "DateCompleted" and self.Tags[-1] == "Month":
            self.DateCompleted_Month = content
            self.DateCompleted = self.DateCompleted_Month+'/'+self.DateCompleted
            return
        if self.Tags[-4] == "DateCompleted" and self.Tags[-1] == "Day":
            self.DateCompleted = content+'/'+self.DateCompleted
            return
        if self.Tags[-2] == "DateRevised" and self.Tags[-1] == "Year":
            self.DateRevised = content
            return
        if self.Tags[-3] == "DateRevised" and self.Tags[-1] == "Month":
            self.DateRevised_Month = content
            self.DateRevised = self.DateRevised_Month + "/" + self.DateRevised
            return
        if self.Tags[-4] == "DateRevised" and self.Tags[-1] == "Day":
            self.DateRevised = content+'/'+self.DateRevised
            return
        if self.Tags[-1] == "Title":
            self.ArticleType = content
            return
        if self.Tags[-1] == "ISOAbbreviation":
            self.ISOAbbreviation = content
            return
        if self.Tags[-1] == "ArticleTitle":
            self.ArticleTitle = content
            return
        if self.Tags[-1] == "ELocationID":
            self.ELocationID = content
            return
        if self.Tags[-2] == 'Abstract' and self.Tags[-1] == 'AbstractText':
            if len(self.AbstractText) > 0:
                self.AbstractText = self.AbstractText + content
                return
            else :
                self.AbstractText = content
                return
        if self.Tags[-2] == 'AbstractText' and self.Tags[-1] == 'AbstractText':
            self.AbstractText = self.AbstractText + content
            return
        if self.Tags[-1] == 'Keyword':
            if len(self.Keywords) == 0:
                self.Keywords = content
                return
            else :
                self.Keywords = self.Keywords + ";" + content
                return
        if self.Tags[-1] == 'LastName':
            self.LastName = content
            return
        if self.Tags[-1] == 'ForeName':
            self.ForeName = content
            return
        if self.Tags[-1] == 'Initials':
            self.Initials = content
            if len(self.Author) > 0 :
                self.Author = self.Author + "; " + self.LastName + ' ' + self.Initials
                self.Author_full = self.Author_full + "; " + self.LastName + ' ' + self.ForeName
                return
            else :
                self.Author = self.LastName + ' ' + self.Initials
                self.Author_full = self.LastName + ' ' + self.ForeName
        if self.Tags[-1] == 'Affiliation' :
            if len(self.Affiliation) > 0 :
                self.Affiliation = self.Affiliation + ";" + self.LastName + " " + self.ForeName + ":" + content
                return
            else :
                self.Affiliation = self.LastName + " " + self.ForeName + ":" + content
                return
        if self.Tags[-1] == 'Language':
            self.Language = content
            return

    def init(self) :
        names = self.__dict__
        self.CurrentData = ''
        self.PMID = ''
        self.ISSN = ''
        self.date = ''
        self.Date_year = ''
        self.Date_month = ''
        self.Date_day = ''
        self.DateCompleted = ''
        self.DateCompleted_Month = ''
        self.DateRevised_Month = ''
        self.DateRevised = ''
        self.IssnType = ''
        self.CitedMedium = ''
        self.ArticleType = ''
        self.ISOAbbreviation = ''
        self.Journal_Title = ''
        self.ArticleTitle = ''
        self.ELocationID = ''
        self.AbstractText = ''
        self.Author = ''
        self.Author_full = ''
        self.LastName = ''
        self.ForeName = ''
        self.Initials = ''
        self.Identifier = ''
        self.Affiliation = ''
        self.Keywords = ''
        self.Language = ''
        self.PublicationType = ''
        self.Tags = ['a', 'b', 'c', 'd', 'e']


def xml_parser(file_loca, save_path) :
    parser = xml.sax.make_parser()
    parser.setFeature(xml.sax.handler.feature_namespaces, 0)
    Handler = SaxHandler()
    parser.setContentHandler(Handler)

    parser.parse(file_loca)
    data = Handler.Pubmed
    print(len(data))
    print(len(data[0]))
    columns = ["PMID", "DateCompleted","DateRevised","ISSN", "IssnType", "CitedMedium", "date", "year", "Article_Type", "ISOAbbreviation",
               "ArticleTitle", "Language", "ELocationID", "Author","Author_full", "Affiliation", "Keywords", "AbstractText"]
    data2 = pd.DataFrame(data, columns=columns)
    data2.to_csv(save_path, index=False, encoding='utf-8')

Python sax的 xml 數據文件解析及如何去除解析文本中的特殊標籤,

linux安裝cuda和cudnn

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

Python sax的 xml 數據文件解析及如何去除解析文本中的特殊標籤,

ubuntu18.4 安裝 wps 2019

liunx (ubuntu) 如何讓Python 直接運行的方法

linux (ubuntu) 環境下的 python2.7 和 python3.* 的替換方法

判斷一個二叉樹是否是對稱二叉樹（ Java）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python sax的 xml 數據文件解析及 如何去除解析文本中的特殊標籤,

Python sax的 xml 數據文件解析及如何去除解析文本中的特殊標籤,