利用pyecharts繪製新浪微博傳播圖（文末附完整代碼地址）

文章目錄

代碼地址

任務

延續之前的爬蟲任務，最初同學提出的設想是生成如下圖所示：

來龍去脈

但彼時對爬蟲還很陌生，對於如何構造這樣的數據關係，利用何種包把數據展示出來都一無所知，因此先擱置了。

繼分析完微博文本生成詞雲圖後，想到將地區在地圖上顯示出統計圖像應該是剛需，必定有現成的庫可用。在廣泛瀏覽相關信息後，決定選取pyecharts集成庫。
但在下載安裝後，發現網上現有的實例都無法使用，發現例子的版本普遍都是0.5，已經是老一代的。當然不甘於重裝舊有的版本，追逐新潮。來到pyecharts的GitHub網站，

此前一直懵懂使用着GitHub，現在下決心掌握基本的使用。幸運看到了廖雪峯的Git教程，一兩天的觀摩與嘗試基本瞭解，不再迷茫。

將項目clone到本地，成功運行出示例的地圖後，在挖掘這個礦藏滿滿的寶庫時，發現gallery中的關係圖好像很貼切，在運行本地的Graph示例時，意外驚喜，正巧有我需要的圖例：

真是“有心栽花花不開，無心插柳柳成蔭”、“踏破鐵鞋無覓處，得來全不費工夫”！（近日意識到自己語文能力驟降，需要常溫習之）
於是，快馬加鞭，開啓逆向工程完成此項目。

過程

繪圖代碼

繪製圖像的代碼如下：

import json
import os

from pyecharts.commons.utils import JsCode
from pyecharts import options as opts
from pyecharts.charts import Graph, Page
from pyecharts.faker import Collector
#可以在Jupyter Lab中渲染展示圖片
from pyecharts.globals import CurrentConfig,NotebookType
CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_LAB

def graph_weibo() -> Graph:
    with open(os.path.join("fixtures", "weibo.json"), "r", encoding="utf-8") as f:
        j = json.load(f)
        nodes, links, categories, cont, mid, userl = j
        
    c = (
        Graph()
        .add(
            "",
            nodes,
            links,
            categories,
            repulsion=50,
            linestyle_opts=opts.LineStyleOpts(curve=0.2),
            label_opts=opts.LabelOpts(is_show=False),#True),#
        )
        .set_global_opts(
            legend_opts=opts.LegendOpts(is_show=False),#True),#
            title_opts=opts.TitleOpts(title="Graph-微博轉發關係圖"),
        )
    )
    return c

分別運行

c = graph_weibo()

c.load_javascript()

c.render_notebook()

出現傳播圖：

繪圖參數分析

繪圖所需的參數通過讀取json文件傳遞：

		with open(os.path.join("fixtures", "weibo.json"), "r", encoding="utf-8") as f:
        j = json.load(f)

打開本地的weibo.json文件觀察，如賦值的提示：

        nodes, links, categories, cont, mid, userl = j

json文件由含六個元素的列表構成，分別包括了結點，聯繫，類別，微博文本，微博mid與博主暱稱。

結點參數格式

每個結點由包括如下信息的字典組成

		{
            "name": "Camel3942",            //轉發博主暱稱
            "symbolSize": 5,        		//圖中標誌大小     
            "draggable": "False",        	//是否可拖動
            "value": 1,        				//被再次轉發次數
            "category": "Camel3942",        //被再次轉發後，屬於以本博主暱稱命名的類，否則屬於轉發來源博主的類
            "label": {	        //此博主被再次轉發後，含有此標籤，否則不含
                "normal": {
                    "show": "True"
                }
            }
        },
        ……

對比一個沒有被二次轉發的博主結點格式：

		{
            "name": "超昂閃存",
            "symbolSize": 5,
            "draggable": "False",
            "value": 0,
            "category": "重工組長於彥舒"
        },
        ……

聯繫參數格式

此信息比較簡明，一條轉發微博的來源source以及該微博博主target。
具體來講，如果此微博博主直接轉發原文微博，則source爲原文微博博主，如果二次轉發其他人轉發的該微博，則source爲其他人。

        {
            "source": "新浪體育",
            "target": "Beijingold4"
        },
        {
            "source": "麻黑浮雲",
            "target": "X一塊紅布"
        },
        ……

類別參數格式

更加簡明，所有被二次轉發過的博主暱稱：

		{
            "name": "Camel3942"
        },
        {
            "name": "Christinez"
        },
        {
            "name": "JoannaBlue"
        },
        ……

分析，傳入類別後能夠將該類作爲一個整體渲染效果，如下所示：

總覽

獲取轉發關係

通過分析微博文本得知轉發人信息的HTML文本基本結構如下：

// <a href=’/n/被轉發博主暱稱’>@被轉發博主暱稱:

例如一個含轉發信息的微博正文

“//@宇字號湯包or湯圓:紅十字會依然是當年的紅十字會，郭美美事件一點都沒有改變它”

的text內容爲：

“//<a href=’/n/宇字號湯包or湯圓’>@宇字號湯包or湯圓:紅十字會依然是當年的紅十字會，郭美美事件一點都沒有改變它”

利用之前對正則表達式的基礎瞭解，書中涉及到python中的re模塊具有功能。

最初的一個版本能夠運行成功

但有一個出現問題，並未獲得正確暱稱：

這個問題在example的文件中也有體現，當時並未過分關心原因，只當笑話：

判斷爲字符匹配錯誤，將該用戶文本調出。
根據編程報錯的經驗，猜測可能是中英文格式的字符原因，將文本中的冒號： 替換至程序中的冒號:，果然報出了之前讓我摸不着頭腦的錯誤類型：
這個錯誤應該是微博內部的錯誤。我需要將這個錯誤解決。即在字符匹配處增加篩選條件。英文的冒號字符和中文的冒號字符都做篩選。

最終代碼如下，獲取文本中含有的轉發來源博主暱稱：

import re
#工具類
class Tool:  
    repostEN=re.compile('//<a.*?>@(.*?)</a>:')#英文字符冒號
    repostCN=re.compile('//<a.*?>@(.*?)</a>：')#中文字符冒號
    @classmethod
    def findSource(cls,x):
        sourceName=''
        xEN=xCN=''
        
        xEN = re.findall(cls.repostEN,x)
        xCN = re.findall(cls.repostCN,x)
        
        
        #如果其中一者存在，另一者不存在，即返回該者
        if(len(xCN)==0 and len(xEN)>0):
            sourceName=xEN[0]
            #print(xEN[0])
        elif(len(xEN)==0 and len(xCN)>0):
            sourceName=xCN[0]
            #print(xCN[0]) 
        #若二者都存在，則返回第一位置字符串較小的
        elif(len(xEN)>0 and len(xCN)>0):
            sourceName=xCN[0]  if(len(xEN[0])>len(xCN[0])) else xEN[0]
                
        return sourceName

構造數據結構

沿用自制集成的爬取工具，通過修改配置就可以得到需要的數據。
提取關鍵的數據，並存儲在字典中，設計爲Categories類方便集成調用

choice='轉發'#'原文'#
categories=Categories()

for name,text in  zip(dataDict[choice+'screen_name'],dataDict[choice+'text']):
    
    if categories.nameExist(name) is False:
        categories.add(name)
            
            
    sourceName=(Tool.findSource(text))
    if sourceName is not '':    
        categories.addTarget(sourceName,name)
        
    else:
        categories.addTarget(tweeter,name)

由於統計每個結點的轉發量爲轉發後所有結點總合，故需要在結束統計後進行全局運算，加入countAll(self,name)方法；
由於刪博及爬取的時效性等多種原因，有的微博出現數據缺失情況，加入fillSource(self,tweeter)方法解決。
最終Categories類如下：

class Categories:
    
    def __init__(self):
        self.compose={}
    def add(self,name):
        self.compose[name]={}
        category=self.compose[name]
        category['value']=0 #記錄被轉次數
        category['target']={}
        category['source']={}
    def nameExist(self,name):
        if self.compose.get(name) is None:   
            return False
        else:
            return True
        
    def addTarget(self,sourceName,targetName):
        if self.nameExist(sourceName) is False:
            self.add(sourceName)
        if self.nameExist(targetName) is False:
            self.add(targetName)
        
        ##防止循環調用        
        if sourceName == targetName:
            #print(sourceName)
            return 
        if self.compose[targetName]['source'].get(sourceName) is not None:
            #print(sourceName)
            self.compose[targetName]['source'].pop(sourceName)
            
            
        if self.compose[targetName]['source'].get(sourceName) is None:
            self.compose[targetName]['source'][sourceName]=1
        else:
            self.compose[targetName]['source'][sourceName]+=1    
            
        
        if self.compose[sourceName]['target'].get(sourceName) is None:
            self.compose[sourceName]['target'][targetName]=1
        else:
            self.compose[sourceName]['target'][targetName]+=1
        self.compose[sourceName]['value']+=1    
            
    def countAll(self,name):        
        targets=self.compose[name]['target']
        if targets == {}:
            self.compose[name]['value']=0
        else:
            for targetName in targets:
                if self.compose[targetName]['target']=={}:
                    self.compose[targetName]['value']=0
                    #self.compose[name]['value']+=1
                else:
                    self.countAll(targetName)
                    self.compose[name]['value']+=self.compose[targetName]['value']
     #數據缺失補充 假定爲轉發原博主    
    def fillSource(self,tweeter):
        for item in self.compose:
            source=self.compose[item].get('source')
            if  (len(source))!=1 and item !=tweeter:
                self.addTarget(tweeter,item)

集成json文件

此後便是根據獲取的數據構建json文件

nodes=[]
links=[]
category=[]

for i in  categories.compose:
    value=categories.compose[i]['value']
    try:
        source=list(categories.compose[i]['source'])[0]
    except:
        source=tweeter
                
    node={  "name":i,
            "symbolSize": 5,
            "draggable": "False",
            "value": value,
            "category": source
         }
           
    if value > 0:
        if i==tweeter:
            node["category"]=i
        #change node
        if source !=tweeter:
            #print(i)
            node["category"]=i
        symbolSize=value//10 
        if symbolSize>5:
            node['symbolSize']=symbolSize
        node['label']={
                "normal": {
                    "show": "True"
                }
            }
        #add link
        targets=categories.compose[i]['target']
        if targets != {}:
            for target in targets:
                links.append({'source':i,'target':target})
        
        
        #歸入category
        category.append({'name':i})
    nodes.append(node)
        
content='wuhan'
mid='4444444444444'
tweeter=tweeter

jsonData=[nodes,links,category,content,mid,tweeter]

寫入文件：

import json

testFile=addrFile(tweeter,'.json')
with open(testFile,'w',encoding='utf-8') as file_obj:
    json.dump(jsonData,file_obj)
file_obj.close()

至此，以該文件替換文章開頭處的文件地址即可得到新的數據圖像。

當然，調整圖像結點的大小還需要進一步修正以獲取美觀圖像。

代碼地址

將可運行的完整jupyter notebook文件上傳至我的GitHub測試項目中，方便有需要者自取。

利用pyecharts繪製新浪微博傳播圖（文末附完整代碼地址）

文章目錄

任務

來龍去脈

過程

繪圖代碼

繪圖參數分析

結點參數格式

聯繫參數格式

類別參數格式

總覽

獲取轉發關係

構造數據結構

集成json文件

代碼地址

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

一個docker容器暴露多個端口

leetcode 60 排列序列

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

HTML學習——CSS

HTML——script

ROS學習筆記（一）搭建工作空間

爬蟲分析某微博賬號轉發影響因子

sql server示例

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結