1. 一次性爬取多页豆瓣绘本数据

Python手记-7中实现了单页爬取数据，本节来试试多页数据爬取，案例背景为豆瓣图书网页实现绘本的多页数据爬取，先看看网址信息：

复制出来：

第一页：https://book.douban.com/tag/%E7%BB%98%E6%9C%AC?start=0&type=T

第二页：https://book.douban.com/tag/%E7%BB%98%E6%9C%AC ?start=20&type=T

第三页：https://book.douban.com/tag/%E7%BB%98%E6%9C%AC?start=40&type=T

……

在原网址里的中文字符“绘本”复制出来变成了“%E7%BB%98%E6%9C%AC”，这个是URL编码问题，用哪个都不影响爬取；

观察推断url变化规律，变化的只是start参数值，且start值是首项为0，公差为20的等差递增数列：

第i页：https://book.douban.com/tag/%E7%BB%98%E6%9C%AC?start=（i-1）* 20&type=T

下面就粗略的试试看：

# -*- coding: utf-8 -*- 
# @Time : 2020/4/27 14:14
# @Author : ChengYu
# @File : requests_getbooks.py

import requests
import re

# 加上headers用来告诉网站这是通过一个浏览器进行的访问
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/81.0.4044.122 Safari/537.36'}
# page_index起始值为0，爬取20页
for page_index in range(20):
    url = 'https://book.douban.com/tag/%E7%BB%98%E6%9C%AC?start=' + str(page_index * 20) + '&type=T'
    # url = 'https://book.douban.com/tag/绘本?start=' + str(page_index * 20) + '&type=T'

    # 获取网页源码
    res = requests.get(url, headers=headers).text
    # 获取绘本链接、作者信息、书名、评价数、简介
    p_href = '<h2 class="">.*?<a href="(.*?)"'
    href = re.findall(p_href, res, re.S)
    p_info = '<div class="pub">(.*?)</div>'
    info = re.findall(p_info, res, re.S)
    # p_title = '<h2 class="">.*?>(.*?)</a>'
    p_title = '<h2 class="">.*?title="(.*?)"'
    title = re.findall(p_title, res, re.S)
    p_appraise = '<span class="pl">(.*?)</span>'
    appraise = re.findall(p_appraise, res, re.S)
    # p_detail = '<div class="info">.*?<p>(.*?)</p>'
    # detail = re.findall(p_detail, res, re.S)

    for i in range(len(title)):
        title[i] = title[i].strip()
        info[i] = info[i].strip()
        appraise[i] = appraise[i].strip()
        href[i] = href[i].strip()
        # detail[i] = detail[i].strip()
        # 出去换行符
        # detail[i] = re.sub('\n', '', detail[i])
        print(str(i + 1) + '.' + title[i] + ' ' + href[i] + '\n' + info[i] + '\n' + appraise[i])
        
    print('第' + str(page_index + 1) + '页爬取成功')

“run”一下看看结果，书名、链接、作者/出版等信息、评论数了然在列：

2. 爬取多页豆瓣绘本数据存入CSV文件

CSV模块是实现以CSV格式读取和写入表格数据的类，CSV模块主要功能（官文：https://docs.python.org/3.8/library/csv.html）：

（1）csv.reader(csvfile, dialect='excel', **fmtparams)/csv.writer(csvfile, dialect='excel', **fmtparams)：csvfile可以是支持迭代器协议并在每次__next__()调用其方法时都返回字符串的任何对象- 文件对象和列表对象均适用，简言之Excel或者CSV文件都可以；其中的Dialects and Formatting参数：

Dialect.delimiter：分隔字段的字符，默认为','；
Dialect.doublequote：双引号，当单引号已经被定义，并且quoting参数不是QUOTE_NONE的时候，使用双引号表示引号内的元素作为一个元素使用；
Dialect.escapechar：当quoting 为QUOTE_NONE时，指定一个字符使的不受分隔符限值，默认为None，将禁用转义；
Dialect.lineterminator：行分隔符，默认为'\r\n'；
Dialect.quotechar：引用符，用于引用包含特殊字符（例如定界符或quotechar）或包含换行符的字段，默认为'"'。
Dialect.quoting：控制csv中的引号常量，可选QUOTE_*常量，默认为QUOTE_MINIMAL；
Dialect.skipinitialspace：如果为True，则分隔符后的空白将被忽略，默认值为False；
Dialect.strict：如果为True，则Error在CSV输入错误时引发异常，默认值为False。

（2）csv.DictReader(f,fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)/csv.DictWriter(f, fieldnames, restval='', extrasaction='raise', dialect='excel', *args, **kwds)：

fieldnames：字段名列表，默认是第一行的数据；
restkey：当实际的字段数大于字段名数量时，多出来的字段名称就是这个restkey指定（默认为None）；
restval：如果非空白行的字段少于字段名，则缺少的值将填充为restval的值（默认为None）；
extrasaction：当数据中有额外的字段时所采取的操作，默认为’raise’，即抛异常，有时候这种情况比较烦，所以还可以设置成’ignore’直接忽略。

（3）csvreader.fieldnames：字段名称，默认是第一行的数据。

（4）csvwriter.writerow(row)/csvwriter.writerows(rows)：单行写入/多行写入。

（5）DictWriter.writeheader()：写入字段名称，也就是csv头部信息。

写入用法示例（偷懒了，来源官文）：

import csv

with open('names.csv', 'w', newline='') as csvfile:
    fieldnames = ['first_name', 'last_name']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    writer.writerow({'first_name': 'Baked', 'last_name': 'Beans'})
    writer.writerow({'first_name': 'Lovely', 'last_name': 'Spam'})
    writer.writerow({'first_name': 'Wonderful', 'last_name': 'Spam'})

读取用法示例（偷懒了，来源官文）：

import csv
with open('some.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

学以致用，接下来在前面的案例基础上改动一下，继续爬取豆瓣绘本前10页的书本数据，并稍微美化一下输出，以CSV文件格式保存数据。

# -*- coding: utf-8 -*- 
# @Time : 2020/4/27 14:14 
# @Author : ChengYu 
# @File : requests_getbooks.py

import requests
import re
import csv

# 加上headers用来告诉网站这是通过一个浏览器进行的访问
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/81.0.4044.122 Safari/537.36'}
# 初始化csv文件
with open('books.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['序号', '书名', '评分', '评论人数', '作者/出版社/出版日期/价格', '链接'])
    # bookinfo = []
    count = 1
    # page_index起始值为0，爬取10页
    for page_index in range(10):
        url = 'https://book.douban.com/tag/%E7%BB%98%E6%9C%AC?start=' + str(page_index * 20) + '&type=T'
        # url = 'https://book.douban.com/tag/绘本?start=' + str(page_index * 20) + '&type=T'

        # 获取网页源码
        res = requests.get(url, headers=headers).text
        # 获取绘本链接、作者信息、书名、评价数、简介
        p_href = '<h2 class="">.*?<a href="(.*?)"'
        href = re.findall(p_href, res, re.S)
        p_info = '<div class="pub">(.*?)</div>'
        info = re.findall(p_info, res, re.S)
        # p_title = '<h2 class="">.*?>(.*?)</a>'
        p_title = '<h2 class="">.*?title="(.*?)"'
        title = re.findall(p_title, res, re.S)
        p_grade = '<span class="rating_nums">(.*?)</span>'
        grade = re.findall(p_grade, res, re.S)
        p_appraise = '<span class="pl">(.*?)</span>'
        appraise = re.findall(p_appraise, res, re.S)

        # p_detail = '<div class="info">.*?<p>(.*?)</p>'
        # detail = re.findall(p_detail, res, re.S)

        for j in range(len(title)):
            title[j] = title[j].strip()
            info[j] = info[j].strip()
            appraise[j] = appraise[j].strip().replace('(', '').replace('人评价)', '')
            grade[j] = grade[j].strip()
            href[j] = href[j].strip()
            writer.writerow([count, title[j], grade[j], appraise[j], info[j], href[j]])
            # print(title[i], info[i], href[i], appraise[i])
            # bookinfo.append([title[i], info[i], href[i], appraise[i]])
            # print('第' + str(i + 1) + '页爬取成功')
            count += 1

    # 注意csvfile.close()缩进问题，这应该是与for同级别，否则可能造成：ValueError: I/O operation on closed file.
    csvfile.close()

“run”之后会在当前project目录生成books.csv，内容显示如下，序号不再是1-20的循环，而是从1一直递增到爬取结束，评论人数从（558486人评价）变为558486，增加评分列爬取，当然也可以增加其他的爬取内容，无外乎就是正则和re库方法的使用。

叮！浅尝辄止，这个主题暂时下课！

Python手记-8：python一次性爬取多页数据并存入CSV文件

1. 一次性爬取多页豆瓣绘本数据

2. 爬取多页豆瓣绘本数据存入CSV文件

公司刚入职了一名 Java 中级开发，短短 4 行代码居然凑齐了 3 个 bug！我哭了~~

Nginx R31 doc-13-Limiting Access to Proxied HTTP Resources 访问限流

中外程序员到底有啥区别？

Python数据分析与挖掘实战（5章）

python包：pandas

C++文件/流

一、什么是Docker

二、Docker 组件

揹包九讲一 01揹包

今天！通义灵码在北京、成都、杭州三城开讲啦

Linux、Oracle、MySQL命令提示符顯示時間

mysqldump: Couldn‘t execute ‘SET OPTION SQL_QUOTE_SHOW_CREATE=1‘

MySQL 8 導出之mysqlpump

mysqlshow

Python手記-2：Python IDE之PyCharm安裝簡介

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結