关于爬虫爬图

最近自己看着网上教程学习如何爬图，发现爬虫的优越性，也发现有些博客对初学者不太友好，因此写了这篇博客。

如何获取网页源代码

打开谷歌浏览器，打开百度图片，点击更多工具—>打开开发者工具，如图

可以发现，在你下划的时候，旁边Name一栏会出现很多图片链接。

随便点击一个图片，发现旁边有一个关于Response Headers内容
这个就是你现在访问百度图片的访问信息，而这一些信息其实都是大同小异的。
再左键网页查看网页源代码。

再按Ctrl/Command + F，利用查找工具，查找url，即图片的下载地址，还有一些相关信息，例如data等都可以很快找出，在代码中有体现。

输入frompagetitle可以查找到图像的Title。

源代码

源代码中有详细的注释便于大家理解。

# 爬取百度图片
from urllib.parse import urlencode
import requests
import re
import os

# 图片下载的存储文件
save_dir = '百度图片/'


# 百度加密算法
def baidtu_uncomplie(url):
    res = ''
    c = ['_z2C$q', '_z&e3B', 'AzdH3F']
    d = {'w': 'a', 'k': 'b', 'v': 'c', '1': 'd', 'j': 'e', 'u': 'f', '2': 'g', 'i': 'h', 't': 'i', '3': 'j', 'h': 'k',
         's': 'l', '4': 'm', 'g': 'n', '5': 'o', 'r': 'p', 'q': 'q', '6': 'r', 'f': 's', 'p': 't', '7': 'u', 'e': 'v',
         'o': 'w', '8': '1', 'd': '2', 'n': '3', '9': '4', 'c': '5', 'm': '6', '0': '7', 'b': '8', 'l': '9', 'a': '0',
         '_z2C$q': ':', '_z&e3B': '.', 'AzdH3F': '/'}
    if (url == None or 'http' in url):  # 判断地址是否有http
        return url
    else:
        j = url
        # 解码百度加密算法
        for m in c:
            j = j.replace(m, d[m])
        for char in j:
            if re.match('^[a-w\d]+$', char):    # 正则表达式
                char = d[char]
            res = res + char
        return res


# 获取页面信息
def get_page(offset):
    params = {
        'tn': 'resultjson_com',
        'ipn': 'rj',
        'ct': '201326592',
        'is': '',
        'fp': 'result',
        'queryWord': '中国人',  # 关键字
        'cl': '2',
        'lm': '-1',
        'ie': 'utf-8',
        'oe': 'utf-8',
        'adpicid': '',
        'st': '-1',
        'z': '',
        'ic': '0',
        'word': '中国人',  # 关键字
        's': '',
        'se': '',
        'tab': '',
        'width': '',
        'height': '',
        'face': '0',
        'istype': '2',
        'qc': '',
        'nc': '1',
        'fr': '',
        'expermode': '',
        'pn': offset * 30,
        'rn': '30',
        'gsm': '1e',
        '1537355234668': '',
    }
    url = 'https://image.baidu.com/search/acjson?' + urlencode(params)
    try:    # 尝试连接服务器
        response = requests.get(url)
        if response.status_code == 200:     # 获取HTTP状态，即服务器响应HTTP请求
            return response.json()
    except requests.ConnectionError as d:
        print('Error', d.args)


# 获取图像
def get_images(json):
    if json.get('data'):
        for item in json.get('data'):   # 获取图片数据字典值
            if item.get('fromPageTitle'):   # 获取图片Title
                title = item.get('fromPageTitle')
            else:
                title = 'noTitle'
            image = baidtu_uncomplie(item.get('objURL'))    # 图片地址
            if (image):
                yield {     # 存储图片信息
                    'image': image,
                    'title': title
                }


def save_image(item, count):
    try:
        response = requests.get(item.get('image'))
        if response.status_code == 200:      # 获取HTTP状态，即服务器响应HTTP请求
            file_path = save_dir + '{0}.{1}'.format(str(count), 'jpg')        # 命名并存储图片
            if not os.path.exists(file_path):   # 判断图片是否在文件中
                with open(file_path, 'wb') as f:
                    f.write(response.content)
            else:
                print('Already Downloaded', file_path)
    except requests.ConnectionError:    # 如果出现连接错误
        print('Failed to Save Image')


def main(pageIndex, count):
    json = get_page(pageIndex)
    for image in get_images(json):
        save_image(image, count)
        count += 1
    return count


if __name__ == '__main__':
    if not os.path.exists(save_dir):    # 判断是否存在文件，若没有则创建一个
        os.mkdir(save_dir)
    count = 1
    for i in range(1, 200):        # 循环页数下载图片
        count = main(i, count)     # i表示页数，统计图片并运行主函数
    print('total:', count)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

如何用爬虫爬图，以百度图片为例

如何用爬虫爬图，以百度图片为例

关于爬虫爬图

如何获取网页源代码

源代码

[kuanbin帶我飛]簡單搜索

MNIST數據集--學習筆記

Java初學---仿照手機與Sim卡的關係，自己創作一個程序

HRBUST - 1684最大連續和

初探STL（vector,set,map）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結