蒐集百度關鍵詞的相關網站、生成詞雲

原創

2020-06-27 22:29

光是數據不展示粗來，怎麼能一目瞭然呢？因此今天以百度“AI”這個關鍵詞爲例子，蒐集搜索結果相關網站中的網頁內容，用matplotlib+wordcloud實現生成詞雲圖。
我們首先瞧一瞧百度搜索“AI”的是什麼，https://www.baidu.com/s?wd=AI，結果發現基本由Artificial Intelligence人工智能的AI、Adobe Illustrator繪圖工具的AI、“愛”的拼音等其他信息構成。其中除了人工智能方面以外的信息都是需要剔除的。
因此我們主要的思路是：採集數據→篩選→統計詞頻→生成詞雲圖。

前期準備

下載好 urllib、BeautifulSoup、re正則表達式、matplotlib繪圖、jieba分詞、wordcloud詞雲、PIL、numpy數據處理這幾個庫並引用。

初寫大綱

先來寫個大綱版的，只有採集數據→詞雲圖這兩個簡單的步驟。

數據採集部分：
需要進入到百度搜索出的結果裏，爬取其中包含AI的頁面內容。

from urllib import request
import urllib.parse
from bs4 import BeautifulSoup
import re
import random
import datetime

def getLinks(url):
    html = request.urlopen(url)
    bsObj = BeautifulSoup(html, "html.parser")
    return bsObj.find("div",{"id":"bodyContent"}).findAll("a",{"href":re.compile("^(/wiki/)((?!:).)*$")}) 
    #findAll結果是列表ResultSet
    #我們發現class="result-op c-container"和class="HMCpkB"等均是百度相關、廣告等內容，因此剔除

random.seed(datetime.datetime.now())
url = "https://www.baidu.com/s?wd=AI"
linkList = getLinks(url)
while len(linkList)>0:
    nextLink=linkList[random.randint(0,len(linkList)-1)].attrs['href']  #href屬性值只有後半段鏈接
    print(nextLink)          
    linkList=getLinks(nextLink)

當我們手中有了數據信息的txt文檔後，便可以進行簡單的詞雲圖繪製。
繪圖部分：

import matplotlib.pyplot as plt
import jieba
from wordcloud import WordCloud , ImageColorGenerator
from PIL import Image
import numpy as np

txt=open(r'C:\Users\AER\Desktop\text.txt',"r",encoding="utf-8").read()

cut_text=jieba.cut(txt,cut_all=False)
result='/'.join(cut_text)

img=Image.open(r'C:\Users\AER\Desktop\PICPIC.png')
graph=np.array(Image)

wc=WordCloud(
    font_path=r"C:\Users\AER\testgit\Study-Notes\msyh.ttc",
    background_color='white', max_font_size=50, mask=graph)     #
wc.generate(result)

image_color=ImageColorGenerator(graph)
wc.recolor(color_func=image_color)
wc.to_file(r"C:\Users\AER\testgit\Study-Notes\5gpic.png")

plt.figure("詞雲圖")
plt.imshow(wc)
plt.axis("off")
plt.show()

數據處理

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

蒐集百度關鍵詞的相關網站、生成詞雲

前期準備

初寫大綱

數據處理

Git常見命令及報錯

CAD二次開發（VB）代碼整理

蒐集百度關鍵詞的相關網站、生成詞雲

Python收取郵件並數據處理

採集中常見HTML標籤

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結