備忘:Python爬蟲(urllib.request和BeautifulSoup)

學習urllib.request和beautifulsoup,並從dribbble和behance上爬取了一些圖片,記錄一下

一、urllib.request

1. url的構造

構造請求的url遇到的主要問題是如何翻頁的問題,dribbble網站是下拉到底自動加載下一頁,地址欄的url沒有變化,如下:

但是通過檢查,我們可以發現request url裏關於page的字段,如下:

因此,我們構造如下的url:

for i in range(25):  # 最多25頁
    url = 'https://dribbble.com/shots?page=' + str(i + 1) + '&per_page=24'

2. header的構造

不同網頁需要的header的內容不一樣,參照檢查裏request header來構造。例如dribbble需要Referer,即從哪一個頁面跳轉到這個當前頁面的,一般填寫網站相關頁面網址就可以。

headers = {"Accept": "text/html,application/xhtml+xml,application/xml;",
           "Referer": "https://dribbble.com/",
           "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36"}

3. urllib.request獲取頁面內容

用url和header實例化一個urllib.request.Request(url, headers),然後url.request.urlopen()訪問網頁獲取數據,使用read()函數即可讀取頁面內容。

def open_url(url):
    # 將Request類實例化並傳入url爲初始值,然後賦值給req
    headers = {"Accept": "text/html,application/xhtml+xml,application/xml;",
               "Referer": "https://dribbble.com/",
               "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36"}
    req = urllib.request.Request(url, headers=headers)
    # 訪問url,並將頁面的二進制數據賦值給page
    res = urllib.request.urlopen(req)
    # 將page中的內容轉換爲utf-8編碼
    html = res.read().decode('utf-8')
    return html

這裏需要注意的是,有的頁面返回的數據是“text/html; charset=utf-8”格式,直接decode('utf-8')編碼即可,而有的頁面返回的是“application/json; charset=utf-8”格式數據,例如behance:

此時就需要json.loads()來獲取數據,得到的是列表,用操作列表的方式拿到html數據:

 html = json.loads(res.read())
 return html['html']

二、BeautifulSoup

BeautifulSoup將複雜的html文檔轉換爲樹形結構,每一個節點都是一個對象。

1.創建對象

soup = BeautifulSoup(open_url(url), 'html.parser')

‘html.parser’是解析器,BeautifulSoup支持Python標準庫中的HTML解析器,還支持一些第三方的解析器,如果我們不安裝它,則 Python 會使用 Python默認的解析器,lxml 解析器更加強大,速度更快,推薦安裝,常見解析器:

2. 標籤選擇器

標籤選擇篩選功能弱但是速度快,通過這種“soup.標籤名” 我們就可以獲得這個標籤的內容,但通過這種方式獲取標籤,如果文檔中有多個這樣的標籤,返回的結果是第一個標籤的內容

# 獲取p標籤
soup.p

# 獲取p標籤的屬性的兩種方法
soup.p.attrs['name']
soup.p['name']

# 獲取第一個p標籤的內容
soup.p.string

# 獲取p標籤下所有子標籤,返回一個列表
soup.p.contents

# 獲取p標籤下所有子標籤,返回一個迭代器
for i,child in enumerate(soup.p.children):
    print(i,child)

# 獲取父節點的信息
soup.a.parent

# 獲取祖先節點
list(enumerate(soup.a.parents))

# 獲取後面的兄弟節點
soup.a.next_siblings

# 獲取前面的兄弟節點
soup.a.previous_siblings

# 獲取下一個兄弟標籤
soup.a.next_sibling

# 獲取上一個兄弟標籤
souo.a.previous_sinbling

3. 標準選擇器

find_all(name,attrs,recursive,text,**kwargs)可以根據標籤名,屬性,內容查找文檔,返回一個迭代器,例如:

# 獲取所有class爲js-project-module--picture的所有img標籤,並選擇每個標籤的src構成一個列表
image.src = [item['src'] for item in soup.find_all('img', {"class": "js-project-module--picture"})]

# .string獲取div的內容,strip()去除前後空格
desc = soup.find_all('div', {"class": "js-basic-info-description"})
if desc:
    image.desc = [item.string.strip() for item in desc]

find(name,attrs,recursive,text,**kwargs),返回匹配的第一個元素

其他一些類似的用法:
find_parents()返回所有祖先節點,find_parent()返回直接父節點
find_next_siblings()返回後面所有兄弟節點,find_next_sibling()返回後面第一個兄弟節點
find_previous_siblings()返回前面所有兄弟節點,find_previous_sibling()返回前面第一個兄弟節點
find_all_next()返回節點後所有符合條件的節點, find_next()返回第一個符合條件的節點
find_all_previous()返回節點後所有符合條件的節點, find_previous()返回第一個符合條件的節點

三、從dribbble爬取圖片完整代碼

1.批量獲取圖片頁面鏈接

# -*- coding: utf-8 -*-

import random
import urllib.request
from bs4 import BeautifulSoup
import os
import time


def open_url(url):
    headers = {"Accept": "text/html,application/xhtml+xml,application/xml;",
               "Referer": "https://dribbble.com/",
               "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36"}
    req = urllib.request.Request(url, headers=headers)
    res = urllib.request.urlopen(req)
    html = res.read().decode('utf-8')
    return html


# 打開/創建“dribbble_list.txt”文件,O_CREAT:不存在即創建、O_WRONLY:只寫、O_APPEND:追加
fd = os.open('dribbble_list.txt', os.O_CREAT | os.O_WRONLY | os.O_APPEND)
for i in range(25):
    url = 'https://dribbble.com/shots?page=' + str(i + 1) + '&per_page=24'
    soup = BeautifulSoup(open_url(url), 'html.parser')
    srcs = soup.find_all('a', {"class": "dribbble-link"})
    src_list = [src['href'] for src in srcs]
    for src in src_list:
        os.write(fd, bytes(src, 'UTF-8'))
        os.write(fd, bytes('\n', 'UTF-8'))
    time.sleep(random.random()*5)

2. 獲取圖片和信息

import os
import random
import urllib.request
import re
import time
from bs4 import BeautifulSoup


class Image:
    title = ''
    src = ''
    desc = []
    tags = []
    colors = []
    view = []
    like = []
    save = []


def open_url(url):
    headers = {"Accept": "text/html,application/xhtml+xml,application/xml;",
               "Referer": "https://dribbble.com/shots",
               "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36"}
    try:
        req = urllib.request.Request(url, headers=headers)
        res = urllib.request.urlopen(req)
        html = res.read().decode('utf-8')
    except:
        return None
    return html


def get_number(x):
    return int(re.sub('\D', "", x))


def get_img_info(html):
    # 實例化一張圖
    image = Image()
    soup = BeautifulSoup(html, 'html.parser')
    # 標題
    image.title = soup.find('div', {"class": "slat-header"}).find('h1').string.strip()
    # 地址
    image.src = soup.find('div', {"class": "detail-shot"}).find('img')['src']
    # 描述
    desc = soup.find('div', {"class": "shot-desc"})
    if desc:
        image.desc = [item.string.strip() for item in desc.find_all(text=True)]
    # 標籤
    image.tags = [item.string for item in soup.find_all('a', {"rel": "tag"})]
    # 顏色
    image.colors = [item.string for item in soup.find_all('a', {"style": re.compile('background-color.*')})]
    # 瀏覽量
    view = soup.find('div', {"class": "shot-views"})
    if view:
        image.view = [str(get_number(item)) for item in view.stripped_strings]
    # 喜歡
    like = soup.find('div', {"class": "shot-likes"})
    if like:
        image.like = [str(get_number(item)) for item in like.stripped_strings]
    # 收藏
    save = soup.find('div', {"class": "shot-saves"})
    if save:
        image.save = [str(get_number(item)) for item in save.stripped_strings]
    return image


def save_text(root_path, img, num):
    text = {
        'src': img.src,
        'desc': ';'.join(img.desc),
        'tags': ';'.join(img.tags),
        'colors': ';'.join(img.colors),
        'score': ';'.join([img.title, ''.join(img.view), ''.join(img.like), ''.join(img.save)])
    }
    text_list = ['src', 'desc', 'tags', 'colors', 'score']
    for item in text_list:
        save_path = root_path + item + '.txt'
        fd = os.open(save_path, os.O_CREAT | os.O_WRONLY | os.O_APPEND)
        write_str = str(num).zfill(3) + ' ' + text[item] + '\n'
        os.write(fd, bytes(write_str, 'UTF-8'))
        os.close(fd)


def read_dribbble_data(data_folder):
    import pandas as pd
    import os
    columns = ['url']
    df = pd.read_csv(os.path.join(data_folder, 'dribbble_list.txt'), names=columns)
    return df


def to_url(img_url):
    return 'https://dribbble.com{img_url}'.format(img_url=img_url)


if __name__ == '__main__':
    data_folder = './'
    df = read_dribbble_data(data_folder)
    urls = map(to_url, df['url'].values)
    for i, url in enumerate(urls):
        print(url)
        # 獲取並解析網頁
        html = open_url(url)
        if html:
            image = get_img_info(open_url(url))
            # 獲取並保存圖片
            # save_path_img = 'img/' + image.title + '.jpg'
            save_path_img = 'img/' + str(i+556).zfill(3) + '.jpg'
            urllib.request.urlretrieve(image.src, save_path_img)
            # 保存“標題 地址 描述 標籤 顏色 瀏覽量 喜歡 收藏”
            save_path_text_root = 'dribbble_text/'
            save_text(save_path_text_root, img=image, num=i+556)
            time.sleep(random.random()*5)

四、從behance爬取圖片完整代碼

1. 批量獲取圖片頁面鏈接

# -*- coding: utf-8 -*-

import random
import urllib.request
from bs4 import BeautifulSoup
import os
import time
import json


def open_url(url):
    headers = {"Accept": "*/*",
               "Referer": "https://www.behance.net/search?field=48&content=projects&sort=appreciations&time=week",
               "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36",
               "Host": "www.behance.net",
               "Connection": "keep-alive",
               "X-BCP": "523bc8eb-c6a4-4eeb-a73d-0bf9ec1c06d9",
               "X-NewRelic-ID": "VgUFVldbGwACXFJSBAUF",
               "X-Requested-With": "XMLHttpRequest"}
    req = urllib.request.Request(url, headers=headers)
    res = urllib.request.urlopen(req)
    html = json.loads(res.read())
    return html['html']


fd = os.open('behance_list.txt', os.O_CREAT | os.O_WRONLY | os.O_APPEND)
for i in range(200):
    url = 'https://www.behance.net/search?ordinal=' + str((i+100) * 48) + '&per_page=48&field=48&content=projects&sort=appreciations&time=week&location_id=&timestamp=0&mature=0'
    print(url)
    soup = BeautifulSoup(open_url(url), 'html.parser')
    srcs = soup.find_all('a', {"class": "js-project-cover-image-link"})
    src_list = [src['href'] for src in srcs]
    for src in src_list:
        os.write(fd, bytes(src, 'UTF-8'))
        os.write(fd, bytes('\n', 'UTF-8'))
    time.sleep(random.random()*5)
os.close(fd)

2. 獲取圖片和信息

# -*- coding: utf-8 -*-

import os
import random
import urllib.request
import re
import time
from bs4 import BeautifulSoup


class Image:
    title = ''
    src = []
    desc = []
    tags = []
    data = []


def open_url(url):
    headers = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
               "Referer": "https://www.behance.net/gallery/70675447/YELLOWSTONE",
               "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36",
               "Host": "www.behance.net",
               "Connection": "keep-alive",
               "Upgrade-Insecure-Requests": 1,
               "Cookie": "巴啦啦小魔仙全身變"
               }
    try:
        req = urllib.request.Request(url, headers=headers)
        res = urllib.request.urlopen(req)
        html = res.read().decode('utf-8')
    except:
        return None
    return html


def get_number(x):
    return int(re.sub('\D', "", x))


def get_img_info(html):
    # 實例化一張圖
    image = Image()
    soup = BeautifulSoup(html, 'html.parser')
    # 地址
    image.src = [item['src'] for item in soup.find_all('img', {"class": "js-project-module--picture"})]
    # 標題
    image.title = soup.find('div', {"class": "js-project-title"}).string.strip()
    # 描述
    desc = soup.find_all('div', {"class": "js-basic-info-description"})
    if desc:
        image.desc = [item.string.strip() for item in desc]
    # 標籤
    tags = soup.find_all('a', {"class": "object-tag"})
    if tags:
        image.tags = [item.string.strip() for item in tags]
    # 瀏覽 點贊 評論
    data = soup.find_all('div', {"class": "project-stat"})
    if data:
        image.data = [item.string.strip() for item in data][:2]
    return image


def save_text(root_path, img, num):
    text = {
        'title': image.title.replace(' ', '_'),
        'score': ' '.join(img.data),
        'desc': ';' + (';'.join(img.desc)).replace('\n', ';'),
        'tags': ';' + ';'.join(img.tags),
        'src': ';' + ';'.join(img.src)
    }
    text_list = ['title', 'score', 'desc', 'tags', 'src']
    for item in text_list:
        save_path = root_path + item + '.txt'
        fd = os.open(save_path, os.O_CREAT | os.O_WRONLY | os.O_APPEND)
        write_str = str(num).zfill(5) + ' ' + text[item] + '\n'
        os.write(fd, bytes(write_str, 'UTF-8'))
        os.close(fd)


def read_dribbble_data(data_folder):
    import pandas as pd
    import os
    columns = ['url']
    df = pd.read_csv(os.path.join(data_folder, 'behance_list.txt'), names=columns)
    return df


if __name__ == '__main__':
    data_folder = './'
    urls = read_dribbble_data(data_folder)['url'].values
    for i, url in enumerate(urls):
        print(url)
        # 獲取並解析網頁
        html = open_url(url)
        if html:
            image = get_img_info(open_url(url))
            # 獲取並保存圖片
            for j, src in enumerate(image.src):
                save_path_img = './behance_img/' + str(i).zfill(5) + '_' + str(j).zfill(3) + src[-4:]
                urllib.request.urlretrieve(src, save_path_img)
                time.sleep(random.random()*3)
            # 保存“標題 瀏覽量 喜歡 收藏 描述 標籤 ”
            save_path_text_root = './behance_text/'
            save_text(save_path_text_root, img=image, num=i)
            time.sleep(random.random()*5)

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章