兩種方式爬取Hacker News網頁

Hacker News網頁是國外一個很受歡迎的新聞聚合網站，計算機科學家、企業家、數據科學家對此很感興趣

使用requests和Beautiful Soup爬取

使用requests和Beautiful Soup爬取是一般的做法嗎，以下代碼將使用一個簡單的Python字典對象列表存儲爬取的信息

import requests
import re
from bs4 import BeautifulSoup

articles = []

url = 'https://news.ycombinator.com/news'

r = requests.get(url)
html_soup = BeautifulSoup(r.text, 'html.parser')

for item in html_soup.find_all('tr', class_='athing'):
    item_a = item.find('a', class_='storylink')
    item_link = item_a.get('href') if item_a else None
    item_text = item_a.get_text(strip=True) if item_a else None
    next_row = item.find_next_sibling('tr')
    item_score = next_row.find('span', class_='score')
    item_score = item_score.get_text(strip=True) if item_score else '0 points'
    # We use regex here to find the correct element
    item_comments = next_row.find('a', text=re.compile('\d+(&nbsp;|\s)comment(s?)'))
    item_comments = item_comments.get_text(strip=True).replace('\xa0', ' ') \
                        if item_comments else '0 comments'
    articles.append({
        'link' : item_link,
        'title' : item_text,
        'score' : item_score,
        'comments' : item_comments})

for article in articles:
    print(article)

輸出內容如下

使用api爬取

Hacker News有Api。提供結構化、JSON格式的結果，下面是對上面示例代碼的修改，使其不依賴於BS對html的解釋

import requests

articles = []

url = 'https://hacker-news.firebaseio.com/v0'

top_stories = requests.get(url + '/topstories.json').json()

for story_id in top_stories:
    story_url = url + '/item/{}.json'.format(story_id)
    print('Fetching:', story_url)
    r = requests.get(story_url)
    story_dict = r.json()
    articles.append(story_dict)

for article in articles:
    print(article)

爬取書籍信息

使用requests和bs爬http://books.toscrape.com/上的書籍信息，網頁顯示內容如下

對於每本書，都需要獲得

標題
封面
價格和庫存情況
評級
產品說明
其他產品信息

然後使用dataset庫將此信息存儲在sqlite數據庫中(以使用更新的方式編寫程序，這樣就可以在多次運行程序的情況下不會在數據庫中插入重複記錄)

核心代碼如下

if __name__ == "__main__":
    # Scrape the pages in the catalogue
    url = base_url
    inp = input('Do you wish to re-scrape the catalogue (y/n)? ')
    while True and inp == 'y':
        # 訪問base_url拿到頁面的鏈接
        scrape_books(html_soup, url) # 分析url，拿到頁面中的圖書名和鏈接，存入到db數據庫對應的表中
        # Is there a next page?
        next_a = html_soup.select('li.next > a') # 獲取下一頁
        if not next_a or not next_a[0].get('href'):
            break
        url = urljoin(url, next_a[0].get('href'))

    # Now scrape book by book, oldest first
    books = db['books'].find(order_by=['last_seen']) # 從db中取出對應的圖書，裏面包含圖書名和鏈接
    for book in books:
        # 訪問圖書具體的鏈接
        scrape_book(html_soup, book_id) # 從圖書鏈接中拿到圖書的信息，存放到數據庫db['books_info']中
        # Update the last seen timestamp
        db['books'].upsert({'book_id': book_id,
                            'last_seen': datetime.now()
                            }, ['book_id'])

爬取GitHub上項目被收藏的次數

咱們要爬取的鏈接最終地址爲：https://github.com/Macuyiko?page=1&tab=repositories

如果是企業用戶，則爲：https://github.com/google?page=1&tab=repositories

要獲取的信息，主要是項目名和編程語言、star個數

代碼如下

#!/usr/bin/env python
# encoding: utf-8

import requests
from bs4 import BeautifulSoup
import re

session = requests.Session()

url = 'https://github.com/{}'
username = 'Macuyiko'

if __name__ == "__main__":
    r = session.get(url.format(username), params={'page': 1, 'tab': 'repositories'}, verify=False)
    html_soup = BeautifulSoup(r.text, 'html.parser')
    is_normal_user = False
    repos_element = html_soup.find(class_='repo-list') # 企業用戶纔會有repo-list，非企業用戶爲user-repositories-list
    if not repos_element:
        is_normal_user = True
        repos_element = html_soup.find(id='user-repositories-list')

    repos = repos_element.find_all('li')
    for repo in repos:
        name = repo.find('h3').find('a').get_text(strip=True) # 找到元素下的h3標籤下的a標籤，取出其text的內容
        language = repo.find(attrs={'itemprop': 'programmingLanguage'}) # 找到itemprop屬性等於programmingLanguage的標籤
        language = language.get_text(strip=True) if language else 'unknown'
        stars = repo.find('a', attrs={'href': re.compile('\/stargazers')})
        stars = int(stars.get_text(strip=True).replace(',', '')) if stars else 0
        print(name, language, stars)

爬取和分析網絡論壇的互動

爬取網絡論壇的互動信息

爬取的目標網站內容如下，分爲兩個，第一個是主頁的帖子列表

第二個爲帖子中的回覆信息，只取出對應的用戶，不關注發表的內容

所以獲取的核心就是首先拿到所有的標籤信息，然後通過bs或者xpath來匹配，可以多拿幾頁數據

爬取的內容主要爲評論和回覆

對應的帖子內容爲

分析網絡論壇的互動信息

核心代碼如下

heatmap = plt.pcolor(df, cmap='Blues')
y_vals = np.arange(0.5, len(df.index), 1)
x_vals = np.arange(0.5, len(df.columns), 1)
plt.yticks(y_vals, df.index)
plt.xticks(x_vals, df.columns, rotation='vertical')
for y in range(len(df.index)):
    for x in range(len(df.columns)):
        if df.iloc[y, x] == 0:
            continue
        plt.text(x + 0.5, y + 0.5, '%.0f' % df.iloc[y, x],
                 horizontalalignment='center',
                 verticalalignment='center')
plt.savefig("1.jpg")
plt.show()

整個部分的代碼，posts此時保存的內容，每一個listitem爲一個帖子，帖子的內容大概分爲兩種，非引用的(‘bluefish’, [])；引用的(‘almostthere’, [‘kayman’])

在這個部分，需要把posts的內容進行切分，保存成此類型：‘zeke’: {‘bluefish’: 1, ‘almostthere’: 4}
所表單的意思爲，zeke引用了兩個人，分別爲bluefish何almostthere，bluefish引用了一次，almostthere引用了4次；將切分處理後的內容保存到users中，內容處理完成後，大概如下

{'zeke': {'bluefish': 1, 'almostthere': 1}, 'trinity': {'almostthere': 1}, 'paula53': {'almostthere': 1}, 'toejam': {'almostthere': 1, 'Ohm': 1}, 'stickman': {'almostthere': 1}, 'tamtrails': {'almostthere': 1}, 'almostthere': {'tamtrails': 1, 'kayman': 1}, 'kayman': {'almostthere': 1}, 'lanceman': {'almostthere': 1}, 'pollock': {'almostthere': 1}, 'mitsmit': {'almostthere': 1}, 'Christian': {'almostthere': 1}, 'softskull': {'almostthere': 1}, 'argus': {'almostthere': 1}, 'lyssa7': {'almostthere': 1}, 'kevin': {'almostthere': 1}, 'dogrescuer': {'almostthere': 1}, 'RedDoug': {'kayman': 1}, 'Richard': {'almostthere': 1}, 'rebeccad': {'almostthere': 1, 'rangewalker': 2, 'Ohm': 1}, 'assen': {'almostthere': 1}, 'james2020': {'almostthere': 1}, 'gabby': {'rangewalker': 2, 'reuben': 1}, 'texasbb': {'rangewalker': 1}, 'High Sierra Fan': {'rangewalker': 1}, 'Ohm': {'rangewalker': 2, 'Lamebeaver': 1, 'reuben': 1, 'cheaptentguy': 1}, 'Lamebeaver': {'rangewalker': 1}, 'reuben': {'rangewalker': 2, 'Ohm': 1}, 'cheaptentguy': {'rangewalker': 1}}

將users轉成pd.DataFrame的二維數組類型，那麼index就是字典的key，columns就是value；再按照這個二維數組來遍歷值然後輸出到plt中，最後看到輸出的圖像如下

收集和聚類時尚數據集

這個例子中，使用Zalando(一個瑞典網上商店)來獲取時尚產品的圖片集合，並使用t-SNE對他們進行聚類

獲取圖片素材

網站的地址如下：https://www.zalando.co.uk/womens-clothing-dresses/

只需要把15頁的圖片下載下來做測試即可，代碼可參見附件

對圖片進行聚類分析

t-SNE的原理和推導如下：t-SNE原理與推導，特別適用於高維數據集(如圖片)的可視化

用imread加載圖片的時候，可能會遇到scipy.misc報錯的問題

from scipy.misc import imread報錯:ImportError: cannot import name imread

兩種方法解決

將scipy降級到1.2.1版本（pip install scipy==1.2.1）
使用imageio.imread代替imread讀取圖片
使用pillow來讀取圖片

t-SNE代碼執行的效果如圖(只選擇了20張來做測試)：

Amazon評論的情感分析

獲取評論數據

從Amazon網頁中，比如

https://www.amazon.com/product-reviews/1449355730

從chrome分析，獲取所有當前頁評論的結果是post方式，提交的字段爲：

拼接字段，上傳到path中就可以拿到當前頁的評論數據了，response的數據需要自己做解析

做評論的情感分析

對每次評論的情感進行評分，需要使用到vaderSentiment庫，安裝使用

pip install -U vaderSentiment

此時還需要用到nltk庫，安裝爲

pip install -U nltk

對於單個句子，使用vaderSentiment庫非常簡單，如下是使用vaderSentiment庫的測試代碼

#!/usr/bin/env python
# encoding: utf-8

from nltk.sentiment.vader import SentimentIntensityAnalyzer
# import nltk
# nltk.download('vader_lexicon') # 只需要執行一次

if __name__ == "__main__":
    analyzer = SentimentIntensityAnalyzer()
    sentence = "I'm really happy with my pyrchase"
    vs = analyzer.polarity_scores(sentence)
    print(vs) # {'neg': 0.0, 'neu': 0.556, 'pos': 0.444, 'compound': 0.6115}

對於較長文本的情感分析，一種簡單的方法是計算每個句子的情感得分，並將其平均到文本中的所有句子中，示例如下

#!/usr/bin/env python
# encoding: utf-8

from nltk.sentiment.vader import SentimentIntensityAnalyzer
# import nltk
# nltk.download('vader_lexicon') # 只需要執行一次，分析單個句子時候需要用到
# import nltk
# nltk.download('punkt') # 只需要執行一次，分析多個句子時候需要用到
from nltk import tokenize

if __name__ == "__main__":
    analyzer = SentimentIntensityAnalyzer()
    paragraph = """
        I'm really happy with my pyrchase.
        I've been using the producy for two weeks now.
        It does exactly as described in the product description.
        The only problem is that it takes a long time to charge.
        However, since I recharge during nights,this is something I can live with.
    """

    sentence_list = tokenize.sent_tokenize(paragraph)
    cumulative_sentiment = 0.0
    for sentence in sentence_list:
        vs = analyzer.polarity_scores(sentence)
        cumulative_sentiment += vs["compound"]
        print(sentence, " : ", vs["compound"])

    average_score = cumulative_sentiment / len(sentence_list)
    print("Average score：", average_score)

輸入如下

I'm really happy with my pyrchase.  :  0.6115
I've been using the producy for two weeks now.  :  0.0
It does exactly as described in the product description.  :  0.0
The only problem is that it takes a long time to charge.  :  -0.4019
However, since I recharge during nights,this is something I can live with.  :  0.0
Average score： 0.04192000000000001

如果遇到資源錯誤，又無法下載punkt的，請參閱參考鏈接中的“nltk.download(‘punkt’) False”

把這樣的思路應用到上面獲取到的Amazon評論當中，很容易得到每一個評論的情感，得到每一個評論的情感後再按照評價的星級進行歸類，就可以得到每個星級的情感平均分了。matplotlib中的小提琴圖，很容易能看到這些分佈

爬取和分析維基百科關聯圖

數據獲取

需要使用到兩個數據庫表

一個爲pages：記錄訪問過的URL列表機器頁面標題
另一個爲links，僅僅包含一對url來表示頁面之間的鏈接

爬取的時候使用joblib庫來進行多線程

比較有意思的可能是這個頁面發現鏈接了，使用了joblib模塊，在模塊的線程回調函數中執行get_title_and_links函數，該函數獲取當前html的標題和當前url鏈接、頁面發現的鏈接，然後到結果scraped_results中；一輪結束後外層函數遍歷這個scraped_results，把當前html的標題和當前url鏈接放到store_page中處理，頁面發現的鏈接在store_links中處理；

下一次的循環通過get_random_unvisited_pages返回，get_random_unvisited_pages主要完成從所以發現的鏈接中獲取未訪問的，main部分的代碼如下

if __name__ == '__main__':
    urls_to_visit = [base_url]
    while urls_to_visit:
        scraped_results = Parallel(n_jobs=5, backend="threading")(
            delayed(get_title_and_links)(base_url, url) for url in urls_to_visit
        )
        for url, page_title, links in scraped_results:
            store_page(url, page_title)
            store_links(url, links)
        urls_to_visit = get_random_unvisited_pages()

繪製關聯圖

有個爬取的數據，就可以使用NetworkX來可視化圖了

有關NetworkX的使用不是重點，需要進一步瞭解的話，可以查看這個：networkx整理

爬取和可視化董事會成員圖

爬取需要的數據

從網頁https://de.reuters.com/finance/markets/index/.SPX中獲取表格中的公司別名

得到別名後，拼接進鏈接中

officers = 'https://www.reuters.com/companies/{symbol}/people'

在鏈接中拿到對應公司的信息

然後存儲到pandas中，並持久化到pickle中

可視化數據

使用NetworkX來簡單地解析收集到的信息，並導出一種可以用Gephi(一個流行的圖形可視化工具)讀取的格式的圖形，此工具可以從https://gephi.org/users/download/下載

不過比較失敗的是，我沒能按照書中的過濾得到最後的圖

我的效果得到的是這個，還不清楚如何配置過濾選項

這是的用Gephi軟件打開的.gexf文件

使用深度學習破解驗證碼圖片

構建訓練集

首先需要安裝一些用到的模塊

pip install -U captcha
pip install -U numpy
pip install -U opencv-python

生成4個字母長度的訓練集

constants.py包含了一些變量

CAPTCHA_FOLDER = 'generated_images'
LETTERS_FOLDER = 'letters'

CHARACTERS = list('QWERTPASDFGHKLZXBNM')
NR_CAPTCHAS = 1000
NR_CHARACTERS = 4

MODEL_FILE = 'model.hdf5'
LABELS_FILE = 'labels.dat'

MODEL_SHAPE = (100, 100)

generate.py主要負責生成驗證碼

from random import choice
from captcha.image import ImageCaptcha
import os.path
from os import makedirs
from .constants import *

makedirs(CAPTCHA_FOLDER)

image = ImageCaptcha()

for i in range(NR_CAPTCHAS):
    captcha = ''.join([choice(CHARACTERS) for c in range(NR_CHARACTERS)])
    filename = os.path.join(CAPTCHA_FOLDER, '{}_{}.png'.format(captcha, i))
    image.write(captcha, filename)
    print('Generated:', captcha)

運行之後，可以看到在文件夾generated_images下出現如下的驗證碼

將圖像分割成單獨的部分，嘗試構建模型

接下來的操作是把驗證碼圖像分割成單獨的部分，每個部分一個字符。使用opencv的對生成的圖像進行閾值處理、開操作和輪廓檢測。下面的代碼主要做如下幾個操作

將圖片二值化後進行形態學操作，過濾掉噪聲
用opencv的findContours方法提取鏈接的白色像素部分
調用drawContours來繪製發現的部分
將提取出來的輪廓按照文件夾名，分別存放到不同的文件夾中

下面是一個測試腳本，完成的功能就是提取一個字母，完成的操作如下

將原圖像進行去噪得到圖像1
創一個新的黑色圖像，大小與原始圖像一致
取出來一個輪廓，並用白色繪製出來
將圖像1和蒙版按位and操作組合得到字母

import cv2
import numpy as np

# Change this to one of your generated images:
image_file = 'example.png'

image = cv2.imread(image_file)
cv2.imshow('Original image', image)

# Convert to grayscale, followed by thresholding to black and white
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)
cv2.imshow('Black and white', thresh)

# Apply opening: "erosion" followed by "dilation"
denoised = thresh.copy()
kernel = np.ones((4, 3), np.uint8)
denoised = cv2.erode(denoised, kernel, iterations=1)
kernel = np.ones((6, 3), np.uint8)
denoised = cv2.dilate(denoised, kernel, iterations=1)
cv2.imshow('Denoised', denoised)

# Now find contours and overlay them over our original image
_, cnts, _ = cv2.findContours(denoised.copy(), cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
contoured = image.copy()
cv2.drawContours(contoured, cnts, contourIdx=-1, color=(255, 0, 0), thickness=-1)
cv2.imshow('Contours', contoured)

# Create a fresh 'mask' image
mask = np.ones((image.shape[0], image.shape[1]), dtype="uint8") * 0
# We'll use the first contour as an example
contour = cnts[0]
# Draw this contour over the mask
cv2.drawContours(mask, [contour], -1, (255, 255, 255), -1)

cv2.imshow('Denoised image', denoised)
cv2.imshow('Mask after drawing contour', mask)

result = cv2.bitwise_and(denoised, mask)
cv2.imshow('Result after and operation', result)

retain = result > 0
result = result[np.ix_(retain.any(1), retain.any(0))]
cv2.imshow('Final result', result)

cv2.waitKey(0)

代碼運行後得到的效果如下

如果按輪廓從左取到後，可以得到驗證碼的整個字符，但仍然需要考慮字符重疊；書中按照從最左邊的白色像素到最右邊的白色像素的距離除以期望看到的字符數（4）獲取估計的寬度，如果輪廓比預期的要寬，將它切割成m個相等的部分，m等於輪廓的寬度除以預期的寬度

將思路封裝到functions.py中，那麼得到所有訓練集的字符，只需要運行cut.py代碼即可

使用深度學習框架

安裝keras

安裝tensorflow

pip install -U 其實安裝tensorflow==1.9.0

安裝配套版本的keras

pip install -U keras==2.2.0

keras和TensorFlow的版本需要匹配，當前我的tf版本是1.9.0，因此需要使用keras爲2.2.0的版本，版本對應請參考：
https://docs.floydhub.com/guides/environments/

通過字符集來訓練模型

要做的事情僅僅爲

循環遍歷創建的索引字符圖像，調整他們的大小並存儲他們的像素矩陣結果
數據進行規範化，使每個值都位於0~1之間
對字符進行二值化處理，每個標籤都轉換爲輸出定點，每個索引對應一個可能的字符，其值設置爲1或者0，使類似Q的字母變成[1,0,0,0,…]
保存上面的轉換，因爲後期在模型的應用過程中還需要對字符進行逆轉換
構建神經架構，開始訓練模型

運行train.py代碼後得到的輸出如下

最終的兩個輸出文件爲labels.dat(標籤信息)和model.hdf5(模型)

測試識別結果

使用或者重新用一開始的代碼生成一個新的驗證碼，放到test_images文件夾下作爲測試

執行apply.py代碼，可以看到如下結果，對於219中的圖，比較難識別出來的HL還說得過去，可是連D都能識別成了A

全文所涉及的代碼下載地址

https://download.csdn.net/download/zengraoli/12342255

《數據科學實戰之網絡爬取》讀書筆記

文章目錄