Python實例之-抓小哥哥

簡介:
話不多說,本文章直接介紹如何粗暴的用Python抓一波小哥哥
使用到的技術有:
使用到的庫有:os, re, sqlite3, time, requests, lxml
Sqlite數據庫名: data.db
…

第一步確定目標

1. 文字信息

分類、標題、標籤、URL

2. 圖片信息

主圖、小哥哥頁內圖

第二步分解目標

1. 分析特徵:

入口信息披露

分類信息披露

標籤信息披露

小哥哥頁內信息披露

通過上面的圖片分析出以下特徵:

URL特徵
- 開始頁面URL:
  http://www.shuaia.net/
- 分類頁面URL:
  http://www.shuaia.net/+ '分類名'/
  - 頁碼URL
    http://www.shuaia.net/+ '分類名'/index_+ '頁碼數字'.html
- 標籤頁面URL:
  http://www.shuaia.net/e/tags/?tagname=+ '標籤名'
  - 頁碼URL
    http://www.shuaia.net/e/tags/index.php?page=+ '頁碼數字'&tagname=+ '標籤名'
- 小哥哥URL:
  http://www.shuaia.net/…
DOM特徵

下面使用的是瀏覽器開發工具查看/審查元素

分類DOM

標籤DOM

小哥哥DOM

2. 建立模型

通過上面的特徵開始建立模型:

需要先建立好sqlite數據庫:

2.1. 分類頁信息

開始頁面處元素節點爲<div class="nav_nav">下的所有<li>標籤內的<a>節點的href屬性和文字信息爲分類信息存放位置
Ps. 剔除URL不包含www.shuaia.net的內容

2.2. 標籤頁信息

開始頁面處元素節點爲<div id="hot-tags">下的所有<li>標籤內的<a>節點的href屬性和文字信息爲標籤信息存放位置

2.3. 小哥哥信息

a. 分類頁小哥哥
- 循環所有分類頁碼獲取小哥哥URL
  信息存放在: 分類頁處元素節點爲<div id="content">下的所有<div>標籤內<a class="item-img">節點的href屬性、<img class='attachment-weiran'>節點的src屬性
b. 標籤頁小哥哥
- 循環所有標籤頁碼獲取小哥哥URL
  信息存放在: 分類頁處元素節點爲<div id="content">下的所有<div>標籤內<a class="item-img">節點的href屬性、<img class='attachment-weiran'>節點的src屬性

c. 清洗去重, 爲小哥哥加上屬性

小哥哥 = {
'class':...    # 分類
'tag':...      # 標籤
'url':...      # URL地址
'img':...      # 圖片地址
}

d. 獲取小哥哥其它屬性

循環小哥哥url屬性獲取元素節點

分類 = ...
標籤 = ...
URL = ...
標題 = <div class='wr-sigle-intro'> 內 <h1>
發佈人 = <p class='single-meta'> 內 <a>
發佈時間 = <p class='single-meta'> 內<span>
瀏覽次數 = <p class='single-meta'> 內<span>
喜歡人數 =<a class='heart-this'>

第三步開始編寫

通過分析模型開始編寫代碼:

#!/usr/bin/env python
# -*- coding:UTF-8 -*-
import os
import re
import sqlite3
import time

import requests
from lxml import etree
from requests.exceptions import RequestException

count = 1  # 爬取頁碼
count_sleep = 0.5  # 爬取延時
url_imdex = "http://www.shuaia.net/"
headers = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:59.0) Gecko/20100101 Firefox/59.0"}


def getPages(urls):
    """獲取頁面信息"""
    print('開始獲取頁面信息! url=', urls)
    try:
        res = requests.get(urls, headers=headers)
        res.encoding = 'utf-8'
        if res.status_code == 200:
            return res.text
        else:
            return None
    except RequestException:
        print('獲取頁面信息異常! url=', urls)
        return None


def parseClass(context):
    """解析分類"""
    html = etree.HTML(context)
    items = html.xpath('//div[@class="nav_nav"]')[0]
    for item in items:
        yield{
            'class': item.xpath('a/text()')[0].split()[0],  # 分類名
            'url': item.xpath('a/@href')[0]  # 分類URL
        }


def parseTag(context):
    """解析標籤"""
    html = etree.HTML(context)
    items = html.xpath('//div[@id="hot-tags"]/div/div/ul/li')
    for item in items:
        yield{
            'tag': item.xpath('a/text()')[0],  # 標籤名
            'url': 'http://www.shuaia.net' + item.xpath('a/@href')[0]  # 標籤URL
        }


def parseClassPages(context):
    """解析分類頁面"""

    def parse_context(url):
        return getPages(url)

    index = 1
    while index <= count:
        try:
            if index == 1:
                html = parse_context(context['url'])
            else:
                html = parse_context(context['url'] + 'index_{}.html'.format(index))
            pages = etree.HTML(html)
            items = pages.xpath('//a[@class="item-img"]')
            for item in items:
                yield{
                    'class': context['class'],  # 分類名
                    'tag': None,  # 標籤名
                    'img': item.xpath('img/@src')[0],  # 主圖地址
                    'url': item.xpath('@href')[0]  # 小哥哥URL
                }
        except Exception:
            pass
        index += 1


def parseTagPages(context):
    """解析標籤頁面"""
    def parse_context(url):
        return getPages(url)

    index = 1
    while index <= count:
        try:
            html = getPages(context['url'] + '&page={}'.format(index-1))
            pages = etree.HTML(html)
            items = pages.xpath('//a[@class="item-img"]')
            for item in items:
                yield{
                    'class': None,  # 分類名
                    'tag': context['tag'],  # 標籤名
                    'img': item.xpath('img/@src')[0],  # 主圖地址
                    'url': item.xpath('@href')[0]  # 小哥哥URL
                }
        except Exception:
            pass
        index += 1


def parsePages(context):
    """解析小哥哥頁面"""
    try:
        html = etree.HTML(context)
    except Exception:
        return None

    items = html.xpath('//div[@class="wr-single-right"]')

    def imgs(context):
        img_count = 1
        while img_count:
            if img_count == 1:
                img_html = getPages(context)
            else:
                img_html = getPages(context.replace('.html', '_' + str(img_count) + '.html'))
            if img_html is None:
                break
            imghtml = etree.HTML(img_html)
            imghtml = imghtml.xpath('//div[@class="wr-single-content-list"]/p/a/img/@src')
            for i in imghtml:
                yield 'http://www.shuaia.net' + i
            img_count += 1

    for item in items:
        yield{
            'title': item.xpath('//div[@class="wr-sigle-intro"]/h1/text()')[0],  # 標題
            'img': imgs(item.xpath('//div[@id="bdshare"]/@data')[0].split("','")[0][8:])  # 圖片地址
        }


# 連接數據庫
conn = sqlite3.connect('data.db')
# 獲取頁面信息
page_html = getPages(url_imdex)
# 解析分類信息去重儲存到數據庫
for i in parseClass(page_html):
    print('開始讀取 {} 信息!'.format(i['class']))
    for ii in parseClassPages(i):
        cursor = conn.cursor()
        cursor.execute('select id from data where url=?', (ii['url'],))
        values = cursor.fetchall()
        cursor.close()
        if values:
            cursor = conn.cursor()
            cursor.execute('update data set class = ? where id = ?', (ii['class'], values[0][0]))
            conn.commit()
            cursor.close()
        else:
            cursor = conn.cursor()
            cursor.execute('insert into data (class, url, img) values (?, ?, ?)', (ii['class'], ii['url'], ii['img'],))
            conn.commit()
            cursor.close()
    time.sleep(count_sleep)
# 解析標籤信息去重儲存到數據庫
for i in parseTag(page_html):
    print('開始讀取 {} 信息!'.format(i['tag']))
    for ii in parseTagPages(i):
        cursor = conn.cursor()
        cursor.execute('select id from data where url=?', (ii['url'],))
        values = cursor.fetchall()
        cursor.close()
        if values:
            cursor = conn.cursor()
            cursor.execute('update data set tag = ? where id = ?', (ii['tag'], values[0][0]))
            conn.commit()
            cursor.close()
        else:
            cursor = conn.cursor()
            cursor.execute('insert into data (tag, url, img) values (?, ?, ?)', (ii['tag'], ii['url'], ii['img'],))
            conn.commit()
            cursor.close()
    time.sleep(count_sleep)

# 清空image表
cursor_del = conn.cursor()
cursor_del.execute("DELETE from image;")
conn.commit()
cursor_del.close

# 抓取保存小哥哥圖片信息
cursor = conn.cursor()
cursor_re = conn.cursor()
cursor_img = conn.cursor()
cursor.execute('select id, url from data')
for url_s in cursor:
    page_html = getPages(url_s[1])
    for page in parsePages(page_html):
        cursor_re.execute('update data set title = ? where id = ?', (page['title'], url_s[0]))
        im = 0
        for img_url in page['img']:
            cursor_img.execute('insert into image (PID, img, title) values (?, ?, ?)', (url_s[0], img_url, page['title'] + "_" + str(im),))

            # 儲存圖片
            try:
                if not os.path.exists(page['title']):
                    os.makedirs(page['title'])
                print('圖片開始保存!: {}'.format(page['title'] + "_" + str(im) + ".jpg"))
                path = os.path.join("./", page['title'])
                save_pic = path + "/" + page['title'] + "_" + str(im) + ".jpg"
                saveImg = requests.get(img_url, headers=headers).content
                with open(save_pic, 'wb') as f:
                    f.write(saveImg)
                time.sleep(count_sleep)
            except Exception:
                print('圖片保存失敗!: {}'.format(page['title'] + "_" + str(im) + ".jpg"))

            im += 1
        conn.commit()
cursor.close()
cursor_re.close()
cursor_img.close()

# 關閉數據庫
conn.close()

第四步寫在結尾

感覺還可以優化, 如果你有好一點的建議或者問題,歡迎留言指正.嗯

SHI一樣的文章, 如果沒看明白一定是我沒寫好…

Python實例之-抓小哥哥

第一步確定目標

1. 文字信息

2. 圖片信息

第二步分解目標

1. 分析特徵:

2. 建立模型

2.1. 分類頁信息

2.2. 標籤頁信息

2.3. 小哥哥信息

第三步開始編寫

第四步寫在結尾

JavaScript腳本預覽本地maekdown圖書(支持SUMMARY書目TOC和章節TOC)

Python實例之-抓小哥哥

Charles使用

OSError: [Errno 8] Exec format error: --mac os 安裝 geckodriver不成功

Python實例之 OS 模塊

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python實例之-抓小哥哥

第一步 確定目標

1. 文字信息

2. 圖片信息

第二步 分解目標

1. 分析特徵:

2. 建立模型

2.1. 分類頁信息

2.2. 標籤頁信息

2.3. 小哥哥信息

第三步 開始編寫

第四步 寫在結尾

第一步確定目標

第二步分解目標

第三步開始編寫

第四步寫在結尾