使用spyder編寫爬蟲_CodingPark編程公園

原創

TEAM-AG

2020-05-19 00:06

文章介紹

本文主要講述了利用Anaconda spyder進行爬蟲編寫

使用spyder編寫爬蟲

準備工作

這次我們使用 heartbeat -> cid

需要注意的坑

1
每行腳本按 command + 回車 —> 執行
⚠️每行都需執行一次

2
可以不寫print語句
而選取所要print的部分進行 ** command + 回車 —> 執行** 輸出

3
終端輸出信息不完全

pd.set_option(‘display.max_rows’,n)將看不到的行顯示完整

import numpy as np
import pandas as pd
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',100)			#設置最大可見100行
df=pd.DataFrame(np.random.rand(100,10))
df.head(100)

pd.set_option(‘display.max_columns’,n)將看不到的列顯示完整

import numpy as np
import pandas as pd
pd.set_option('display.max_columns',10)			 #給最大列設置爲10列
df=pd.DataFrame(np.random.rand(2,10))
df.head()

完整代碼(基礎功能)

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sat May  9 17:34:24 2020

@author: atom-g
"""

import requests 
from bs4 import BeautifulSoup
import pandas as pd      # 數據處理(美化)

url = 'https://comment.bilibili.com/83089367.xml'
request = requests.get(url)
request.status_code     # 200
request.encoding = 'utf-8'
request.text

soup = BeautifulSoup(request.text,'lxml')

results = soup.find_all('d')
type(results)       # s4.element.ResultSet

# 我們要轉換成list 才能.text 取出文字
comments = [comment.text for comment in results]
# 接下來 我一條一條刷 我這麼寫爲了讀者看的清每一步操作
comments = [comment.upper() for comment in comments]  # 大寫
comments = [comment.replace(' ','') for comment in comments ]   # 去除空格

# 引入stop_words
stop_words = ['▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅',
              '▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄']

comments_fin = [comment for comment in comments if comment not in stop_words]

pd.set_option('display.max_rows',100000)    #設置最大可見行
catalog = pd.DataFrame({'DanMu':comments_fin})
cipin = catalog['DanMu'].value_counts()

輸出信息

完整代碼(Plus)

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sat May  9 17:34:24 2020

@author: atom-g
"""

import requests 
from bs4 import BeautifulSoup
import pandas as pd      # 數據處理(美化)
import numpy as np

url = 'https://comment.bilibili.com/83089367.xml'
request = requests.get(url)
request.status_code     # 200
request.encoding = 'unicode'
request.text

soup = BeautifulSoup(request.text,'lxml')

results = soup.find_all('d')
type(results)       # s4.element.ResultSet

# 我們要轉換成list 才能.text 取出文字
comments = [comment.text for comment in results]
# 接下來 我一條一條刷 我這麼寫爲了讀者看的清每一步操作
comments = [comment.upper() for comment in comments]  # 大寫
comments = [comment.replace(' ','') for comment in comments ]   # 去除空格

# 引入stop_words
stop_words = ['▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅',
              '▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▄▄▅▆▇█▇▆▅▄▄▅▆▇█▇▆▅▄▄',
              '，','！',']','。']

comments_fin = [comment for comment in comments if comment not in stop_words]

pd.set_option('display.max_rows',100000)    # 設置最大可見100行
catalog = pd.DataFrame({'DanMu':comments_fin})
cipin = catalog['DanMu'].value_counts()

import jieba

DanMustr = ''.join(i for i in comments_fin if i not in stop_words)     # 拼成串

words = list(jieba.cut(DanMustr))
words_fin_DanMustr = [word for word in words if word not in stop_words]      # 這裏我學到了stop_word需完全對應纔可
words_fin_DanMustr_str = ','.join(words_fin_DanMustr)
# py 生成本地txt
file_handle=open('/Users/atom-g/spyder/Cai.txt',mode='w')
file_handle.write(words_fin_DanMustr_str)
file_handle.close()

words_fin = [i for i in words if len(i)>1]

# np.set_printoptions(threshold=1e6)   #利用np全部輸出
# cc = np.array(words_fin)
# cc.tofile('/Users/atom-g/spyder/Cai.txt')       # txt至本地


import  wordcloud   # 生成詞雲
wc = wordcloud.WordCloud(height = 1000, width = 1000, font_path = 'simsun.ttc')
wc.generate(' '.join(words_fin))

from matplotlib import pyplot as plt

plt.imshow(wc)
wc.to_file('/Users/atom-g/spyder/Cai.png')      # 圖片至本地

輸出信息

✏️Python-list轉字符串

命令：''.join(list)
其中，引號中是字符之間的分割符，如“,”，“;”，“\t”等等
如：
list = [1, 2, 3, 4, 5]
''.join(list) 結果即爲：12345
','.join(list) 結果即爲：1,2,3,4,5

✏️Python-字符串轉list

print list('12345')
輸出： ['1', '2', '3', '4', '5']
print list(map(int, '12345'))
輸出： [1, 2, 3, 4, 5]

str2 = "123 sjhid dhi" 
list2 = str2.split() #or list2 = str2.split(" ") 
print list2 
['123', 'sjhid', 'dhi']

str3 = "www.google.com" 
list3 = str3.split(".") 
print list3 
['www', 'google', 'com']

✏️Python-生成本地txt模式

特別鳴謝

📍Python spyder顯示不全df列和行
https://blog.csdn.net/Arwen_H/article/details/83510364?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.nonecase
📍python（如何將數據寫入本地txt文本文件）
https://blog.csdn.net/huo_1214/article/details/79153847
📍Python list 和 str 互轉
https://blog.csdn.net/qq_35531549/article/details/88209377

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用spyder編寫爬蟲_CodingPark編程公園

文章介紹

使用spyder編寫爬蟲

準備工作

需要注意的坑

完整代碼(基礎功能)

完整代碼(Plus)

《日本蠟燭圖》讀書筆記 & 技術分析回測

《期貨-市場技術分析》讀書筆記

Python多線程編程深度探索：從入門到實戰

mongodb處理json數據很好

35K*14 薪，入職了！這公司只要不裁員，我能一直呆下去！

使用spyder編寫爬蟲_CodingPark編程公園

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結