Python网络爬虫(十四)——threading

简介

对于爬取图片或者爬取章节数目过多的小说来说,采取同步的方式进行下载会导致效率的下降,这对于网络爬虫来说是一个很大的缺陷。而使用多线程则可以避免这个问题,提高整个爬取过程的效率。

多线程(multithreading),是指从软件或者硬件上实现多个线程并发执行的技术。具有多线程能力的计算机因有硬件支持而能够在同一时间执行多于一个线程,进而提升整体处理性能。

threading

threading 是 python 中用来进行多线程的一个模块。

基本使用

import threading
import time

def func1():
    for i in range(3):
        print('func1')
        time.sleep(1)

def func2():
    for i in range(3):
        print('func2')
        time.sleep(1)

if __name__ == '__main__':
    thread1 = threading.Thread(target=func1)
    thread2 = threading.Thread(target=func2)

    thread1.start()
    thread2.start()

结果为:

func1
func2
func2
func1
func1
func2

上边的结果显示执行顺序并不是先执行 func1 再执行 func2,而是 func1 和 func2 同时进行,这就是多线程。

派生 Thread 类

也可以将上边提到的 Thread 类进行派生,并添加 run 方法,实现 run 方法的自动运行:

import threading
import time

class func1Thread(threading.Thread):
    def run(self):
        for i in range(3):
            print('func1',threading.current_thread())
            time.sleep(1)

class func2Thread(threading.Thread):
    def run(self):
        for i in range(3):
            print('func2',threading.current_thread())
            time.sleep(1)

if __name__ == '__main__':
    thread1 = func1Thread()
    thread2 = func2Thread()

    thread1.start()
    thread2.start()
    print(threading.enumerate())

结果为:

func1 <func1Thread(Thread-1, started 7736)>
func2 <func2Thread(Thread-2, started 2972)>
[<_MainThread(MainThread, started 804)>, <func1Thread(Thread-1, started 7736)>, <func2Thread(Thread-2, started 2972)>]
func1 <func1Thread(Thread-1, started 7736)>
func2 <func2Thread(Thread-2, started 2972)>
func1 <func1Thread(Thread-1, started 7736)>
func2 <func2Thread(Thread-2, started 2972)>

上边的程序中存在三个线程,主线程 MainThread 对应 main 函数,另外两个线程 Thread-1 和 Thread-2 分别为 func1Thread 和 func2Thread 构建的线程。

上边使用到的两个函数:

  • threading.enumerate() 可以获得当前的所有线程信息
  • threading.current_thread() 可以获得当前的线程信息

全局变量

在同一进程中,多线程能够大大提升任务的执行效率,而因为所有线程共享进程中的全局变量,因此也会造成某些问题。

import threading
import time

NUM = 0

def func():
    global NUM
    for i in range(3):
        NUM += 1
        time.sleep(1)
    print(NUM)

if __name__ == '__main__':
    for i in range(3):
        th = threading.Thread(target=func)
        th.start()

结果为:

9
9
9

实际上我们想要的结果是 3,6,9,但是多线程忽略了各个线程之间的执行顺序,因此最后的结果跟预想的结果不同。

Lock

而为了解决多线程下共享变量的问题,threading 提供了 Lock 类,该类能够在某个线程访问某个变量的时候对变量加锁,此时其它线程就不能访问该变量,直到该 Lock 被释放其它线程才能够访问该变量。

import threading
import time

NUM = 0
glock = threading.Lock()

def func():
    global NUM
    glock.acquire()
    for i in range(3):
        NUM += 1
        time.sleep(1)
    glock.release()
    print(NUM)

if __name__ == '__main__':
    for i in range(3):
        th = threading.Thread(target=func)
        th.start()

结果为:

3
6
9

此时当某个线程获得该全局变量的时候,别的线程就不能获得该全局变量,因此保证了单个线程对全局变量的访问。

生产者-消费者

生产者消费者模式是多线程开发中的一种常见模式。简单的说,生产者用来产生数据,消费者用来处理数据,在生产者与消费者之间还存在缓冲区,图示为:

基本实现

import threading
import random
import time

TOTAL = 100
TIMES = 0

class Generant(threading.Thread):
    def run(self):
        global TIMES,TOTAL
        while True:
            produce = random.randint(0,100)
            if TIMES >= 10:
                break
            TOTAL += produce
            print('%s,TIMES = %d,produce = %d,TOTAL = %d' % (threading.current_thread(),TIMES,produce,TOTAL))
            TIMES += 1
            time.sleep(1)

class Consumer(threading.Thread):
    def run(self):
        global TIMES,TOTAL
        while True:
            consumption = random.randint(0,100)
            if TOTAL > consumption:
                TOTAL -= consumption
                print('%s,TIMES = %d,consumption = %d,TOTAL = %d' % (threading.current_thread(),TIMES,consumption,TOTAL))
                time.sleep(1)
            else:
                if TIMES >= 10:
                    break
                print('buffer error.')

if __name__ == '__main__':
    for i in range(3):
        Consumer(name='ConsumerThread%d' % i).start()

    for i in range(3):
        Generant(name='ConsumerThread%d' % i).start()

结果为:

<Consumer(ConsumerThread0, started 15168)>,TIMES = 0,consumption = 43,TOTAL = 57
buffer error.
<Consumer(ConsumerThread1, started 15016)>,TIMES = 0,consumption = 12,TOTAL = 45
buffer error.
buffer error.
<Generant(ConsumerThread0, started 6100)>,TIMES = 0,produce = 49,TOTAL = 94
<Consumer(ConsumerThread2, started 12324)>,TIMES = 1,consumption = 90,TOTAL = 4
<Generant(ConsumerThread1, started 8116)>,TIMES = 1,produce = 54,TOTAL = 58
<Generant(ConsumerThread2, started 8032)>,TIMES = 2,produce = 7,TOTAL = 65
buffer error.
<Consumer(ConsumerThread1, started 15016)>,TIMES = 3,consumption = 9,TOTAL = 56
<Generant(ConsumerThread0, started 6100)>,TIMES = 3,produce = 86,TOTAL = 142
<Consumer(ConsumerThread2, started 12324)>,TIMES = 4,consumption = 88,TOTAL = 54
<Consumer(ConsumerThread0, started 15168)>,TIMES = 4,consumption = 10,TOTAL = 44
<Generant(ConsumerThread2, started 8032)>,TIMES = 4,produce = 9,TOTAL = 53
<Generant(ConsumerThread1, started 8116)>,TIMES = 5,produce = 77,TOTAL = 130
<Consumer(ConsumerThread1, started 15016)>,TIMES = 6,consumption = 54,TOTAL = 76
<Generant(ConsumerThread0, started 6100)>,TIMES = 6,produce = 39,TOTAL = 115
<Generant(ConsumerThread1, started 8116)>,TIMES = 7,produce = 7,TOTAL = 122
<Generant(ConsumerThread2, started 8032)>,TIMES = 7,produce = 37,TOTAL = 159
<Consumer(ConsumerThread2, started 12324)>,TIMES = 7,consumption = 6,TOTAL = 153
<Consumer(ConsumerThread0, started 15168)>,TIMES = 7,consumption = 63,TOTAL = 90
<Generant(ConsumerThread0, started 6100)>,TIMES = 9,produce = 21,TOTAL = 111
<Consumer(ConsumerThread1, started 15016)>,TIMES = 10,consumption = 17,TOTAL = 94
<Consumer(ConsumerThread2, started 12324)>,TIMES = 10,consumption = 29,TOTAL = 65
<Consumer(ConsumerThread0, started 15168)>,TIMES = 10,consumption = 20,TOTAL = 45
<Consumer(ConsumerThread1, started 15016)>,TIMES = 10,consumption = 41,TOTAL = 4

上边的代码段创建了三个生产者线程和三个消费者线程,这里的 TOTAL 可以认为是缓冲区,并用 TIMES 限制了生产者生产的次数,但并未对消费者的消费情况做出限制。虽然输出的结果并没有错误,但是涉及到全局变量 TIMES 和 TOTAL 的使用,因此为了安全,最好还是使用 Lock。

Lock 实现

import threading
import random
import time

TOTAL = 100
TIMES = 0
LOCK = threading.Lock()

class Generant(threading.Thread):
    def run(self):
        global TIMES,TOTAL,LOCK
        while True:
            produce = random.randint(0,100)
            LOCK.acquire()
            if TIMES >= 10:
                LOCK.release()
                break
            TOTAL += produce
            print('%s,TIMES = %d,produce = %d,TOTAL = %d' % (threading.current_thread(),TIMES,produce,TOTAL))
            TIMES += 1
            LOCK.release()
            time.sleep(1)

class Consumer(threading.Thread):
    def run(self):
        global TIMES,TOTAL,LOCK
        while True:
            consumption = random.randint(0,100)
            LOCK.acquire()
            if TOTAL > consumption:
                TOTAL -= consumption
                print('%s,TIMES = %d,consumption = %d,TOTAL = %d' % (threading.current_thread(),TIMES,consumption,TOTAL))
                time.sleep(1)
            else:
                if TIMES >= 10:
                    LOCK.release()
                    break
                print('buffer error.')
            LOCK.release()

if __name__ == '__main__':
    for i in range(3):
        Consumer(name='ConsumerThread%d' % i).start()

    for i in range(3):
        Generant(name='ConsumerThread%d' % i).start()

结果为:

<Consumer(ConsumerThread0, started 14616)>,TIMES = 0,consumption = 79,TOTAL = 21
buffer error.
buffer error.
<Generant(ConsumerThread0, started 11508)>,TIMES = 0,produce = 17,TOTAL = 38
<Generant(ConsumerThread1, started 9180)>,TIMES = 1,produce = 42,TOTAL = 80
<Generant(ConsumerThread2, started 12116)>,TIMES = 2,produce = 61,TOTAL = 141
<Consumer(ConsumerThread0, started 14616)>,TIMES = 3,consumption = 42,TOTAL = 99
<Consumer(ConsumerThread1, started 6692)>,TIMES = 3,consumption = 59,TOTAL = 40
buffer error.
buffer error.
buffer error.
<Consumer(ConsumerThread0, started 14616)>,TIMES = 3,consumption = 37,TOTAL = 3
<Generant(ConsumerThread0, started 11508)>,TIMES = 3,produce = 66,TOTAL = 69
buffer error.
<Consumer(ConsumerThread2, started 13116)>,TIMES = 4,consumption = 52,TOTAL = 17
<Generant(ConsumerThread2, started 12116)>,TIMES = 4,produce = 35,TOTAL = 52
<Generant(ConsumerThread1, started 9180)>,TIMES = 5,produce = 76,TOTAL = 128
<Consumer(ConsumerThread0, started 14616)>,TIMES = 6,consumption = 13,TOTAL = 115
<Consumer(ConsumerThread1, started 6692)>,TIMES = 6,consumption = 77,TOTAL = 38
buffer error.
buffer error.
buffer error.
buffer error.
buffer error.
buffer error.
buffer error.
<Consumer(ConsumerThread2, started 13116)>,TIMES = 6,consumption = 2,TOTAL = 36
<Generant(ConsumerThread0, started 11508)>,TIMES = 6,produce = 14,TOTAL = 50
<Generant(ConsumerThread1, started 9180)>,TIMES = 7,produce = 87,TOTAL = 137
<Consumer(ConsumerThread0, started 14616)>,TIMES = 8,consumption = 94,TOTAL = 43
<Generant(ConsumerThread2, started 12116)>,TIMES = 8,produce = 55,TOTAL = 98
<Consumer(ConsumerThread1, started 6692)>,TIMES = 9,consumption = 81,TOTAL = 17
buffer error.
<Generant(ConsumerThread1, started 9180)>,TIMES = 9,produce = 18,TOTAL = 35
<Consumer(ConsumerThread2, started 13116)>,TIMES = 10,consumption = 8,TOTAL = 27

虽然增加了 Lock 之后,输出的结果并没有发生什么变化,但是这样对全局变量的访问形式无疑会安全很多。

Condition

上边利用 Lock 虽然实现了对全局变量的安全访问,但是在生产者和消费者中都是使用条件为 True 的 while 循环来实现的,循环的结束则是通过 if 后 break 跳出的,这就导致了在每一次循环开始都要上锁,在每一次跳出循环都要解锁。虽然这样的简短程序的运行效率可能不会受到影响,但是对于一些比较大的项目来说,可能会耗费很多的 CPU 资源,因此可以使用其它方式。

threading.Condition 可以在没有获取到数据的时候一直处于阻塞等待状态,一旦获取数据的条件满足就可以使用 notify 通知其它处于等待状态的线程。这样就可以不用想 Lock 一直地进行上锁和解锁地操作,从而能够提高程序的性能。

threading.Condition 主要的相关函数为:

  • acquire:和之前一样,对资源上锁
  • release:和之前一样,对资源解锁
  • wait:使线程处于等待状态,并释放锁。可以被 notify 函数唤醒,唤醒之后继续等待上锁,上锁后继续执行
  • notify:通知某个正在等待的线程,默认为第一个
  • notify_all:通知所有正在等待的线程
import threading
import random
import time

TOTAL = 100
TIMES = 0
TOTAL_TIMES = 5
CONDITION = threading.Condition()


class Generant(threading.Thread):
    def run(self):
        global TIMES,TOTAL,CONDITION
        while True:
            produce = random.randint(0,100)
            CONDITION.acquire()
            if TIMES >= TOTAL_TIMES:
                CONDITION.release()
                break
            TOTAL += produce
            print('%s,TIMES = %d,produce = %d,TOTAL = %d' % (threading.current_thread(),TIMES,produce,TOTAL))
            TIMES += 1
            time.sleep(0.5)
            CONDITION.notify_all()
            CONDITION.release()


class Consumer(threading.Thread):
    def run(self):
        global TOTAL,TIMES,TOTAL_TIMES,CONDITION
        while True:
            consumption = random.randint(0,100)
            CONDITION.acquire()
            # 这里使用 while 进行判断而不是 if,如果 TOTAL < consumption 就将当前线程挂起之后
            # 很可能下次的进程也是 TOTAL < consumption 的。
            while TOTAL < consumption:
                # 这里设置 TIMES >= TOTAL_TIMES 作为程序中止的关键,如果采用 Buffer Error 计数器作为 flag
                # 则如果在 TIMES 达到最大值,计数器达到最大值之前所有的线程都陷入 wait,就会无限期的 wait 下去
                # 而这里的 TIMES 作为中止条件,刚好也符合生产者的停止生产条件
                if TIMES >= TOTAL_TIMES:
                    CONDITION.release()
                    return
                print('Buffer Error.')
                CONDITION.wait()
            TOTAL -= consumption
            print('%s,TIMES = %d,consumption = %d,TOTAL = %d' % (threading.current_thread(), TIMES, consumption, TOTAL))
            time.sleep(0.5)
            CONDITION.release()


if __name__ == '__main__':
    for i in range(3):
        Consumer(name='ConsumerThread%d' % i).start()

    for i in range(3):
        Generant(name='GenerantThread%d' % i).start()

结果为:

<Consumer(ConsumerThread0, started 12300)>,TIMES = 0,consumption = 16,TOTAL = 84
<Consumer(ConsumerThread1, started 6488)>,TIMES = 0,consumption = 67,TOTAL = 17
Buffer Error.
<Generant(GenerantThread0, started 4132)>,TIMES = 0,produce = 75,TOTAL = 92
<Generant(GenerantThread1, started 13176)>,TIMES = 1,produce = 12,TOTAL = 104
<Generant(GenerantThread2, started 15288)>,TIMES = 2,produce = 10,TOTAL = 114
<Consumer(ConsumerThread0, started 12300)>,TIMES = 3,consumption = 76,TOTAL = 38
Buffer Error.
<Generant(GenerantThread0, started 4132)>,TIMES = 3,produce = 11,TOTAL = 49
<Consumer(ConsumerThread2, started 8716)>,TIMES = 4,consumption = 18,TOTAL = 31
<Generant(GenerantThread1, started 13176)>,TIMES = 4,produce = 32,TOTAL = 63
<Consumer(ConsumerThread0, started 12300)>,TIMES = 5,consumption = 43,TOTAL = 20

queue

之前我们提到过,多线程会忽视执行的顺序,造成输出次序的紊乱,而如果是跟顺序有关的数据存储是否也能够使用多线程呢?

在 python 中内置了一个线程安全的队列模块 queue。queue 中包括了 FIFO queue,LIFO queue,这些 queue 的实现也都借用了 Lock 的操作,使之能够直接在多线程中使用。常见函数为:

  • Queue():创建一个先进先出的 queue
  • qsize():返回 queue 大小
  • empty():queue 判空
  • full():queue 判满
  • get():queue 出栈
  • put():queue 入栈
import threading
import requests
from urllib import request
from lxml import etree
import re
import queue
import os
import time

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"
}

class Generrant(threading.Thread):
    def __init__(self,url_queue,img_queue,*arg,**argv):
        super(Generrant,self).__init__(*arg,**argv)
        self.url_queue = url_queue
        self.img_queue = img_queue

    def run(self):
        while True:
            if self.url_queue.empty():
                pass
            url = self.url_queue.get()
            self.parse_queue(url)

    def parse_queue(self,url):
        global headers
        response = requests.get(url,headers=headers)
        text = response.text
        html = etree.HTML(text)
        imgs = html.xpath("//div[@class='page-content text-center']//img")
        for img in imgs:
            if img.get('class') == 'gif':
                continue
            img_url = img.xpath('.//@data-original')[0]
            suffix = os.path.splitext(img_url)[1]
            alt = img.xpath('.//@alt')[0]
            alt = re.sub(r'[,。??,/\\·]','',alt)
            img_name = alt+suffix
            self.img_queue.put((img_url,img_name))



class Consumer(threading.Thread):
    def __init__(self,url_queue,img_queue,*arg,**argv):
        super(Consumer,self).__init__(*arg,**argv)
        self.url_queue = url_queue
        self.img_queue = img_queue

    def run(self):
        while True:
            if self.img_queue.empty():
                if self.url_queue.empty():
                    return
            url, filename = self.img_queue.get(block=True)
            request.urlretrieve(url,'images/'+filename)
            time.sleep(1)


if __name__ == '__main__':
    url_queue = queue.Queue(100)
    img_queue = queue.Queue(500)
    for i in range(1,10):
        url = "http://www.doutula.com/photo/list/?page=%d" % i
        url_queue.put(url)

    for i in range(5):
        Thread_G = Generrant(url_queue,img_queue)
        Thread_G.start()

    for i in range(5):
        Thread_C = Consumer(url_queue,img_queue)
        Thread_C.start()

GIL

  • 官网下载的 python 的解释器为 CPython。现在的计算机一般都是多核的,而 CPython 解释器中的多线程实际上只利用了其中的一个核。
  • 在同一时刻只有一个线程在运行,而为了保证同一时刻只有一个线程执行,CPython 解释器利用了全局解释器锁(Global Interpreter Lock,GIL)来实现。
  • 因为 CPython 解释器的内存管理不是线程安全的,因此使用全局解释器锁也是必须的

当然不同的解释器也不一定都存在 GIL。而 CPython 中的 GIL 虽然是并不是真正的多线程,但是在处理 IO 密集型的操作时,还是很能够提高执行效率的。在对于 CPU 密集型的操作时,则最好使用多进程实现。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章