一、并发和并行

并发(concurrency)和并行( parallelism)是两个相似的概念。 引用一个比较容易理解的说法，并发是指在一个时间段内发生若干事件的情况，并行是指在同一时刻发生若干事件的情况。

这个概念用单核CPU和多核CPU比较容易说明。在使用单核CPU时，多个工作任务是以并发的方式运行的，因为只有一个CPU,所以各个任务会分别占用CPU的一段时间依次执行。如果在自己分得的时间段没有完成任务，就会切换到另一个任务，然后在下一次得到CPU使用权的时候再继续执行，直到完成。在这种情况下，因为各个任务的时间段很短、经常切换，所以给我们的感觉是“同时”进行。在使用多核CPU时，在各个核的任务能够同时运行，这是真正的同时运行，也就是并行。

二、同步和异步

同步和异步也是两个值得比较的概念。下面在并发和并行框架的基础上理解同步和异步。

同步就是并发或并行的各个任务不是独自运行的，任务之间有一定的交替顺序，可能在运行完一个任务得到结果后，另一个任务才会开始运行。就像接力赛跑一样，要拿到交接棒之后下一个选手才可以开始跑。

异步则是并发或并行的各个任务可以独立运行，一个任务的运行不受另一个任务影响，任务之间就像比赛的各个选手在不同的赛道比赛一样，跑步的速度不受其他赛道选手的影响。

在网络爬虫中，假设你需要打开4个不同的网站，IO过程就相当于你打开网站的过程，CPU就是你单击的动作。你单击的动作很快，但是网站却打开得很慢，同步IO是指你每单击一个网址，要等待该网站彻底显示才可以单击下一个网站。异步IO是指你单击定一个网址，不用等对方服务器返回结果，立马可以用新打开的测览器窗口打开另一个网址，最后同时等待4个网站彻底打开。

很明显，异步的速度要快得多。

三、多线程爬虫

多线程爬虫是以并发的方式执行的。也就是说，多个线程并不能真正的同时执行，而是通过进程的快速切换加快网络爬虫速度的。

Python本身的设计对多线程的执行有所限制。在Python设计之初，为了数据安全所做的决定设置有GIL（Global Interpreter Lock，全局解释器锁）。在Python中，一个线程的执行过程包括获取GIL、执行代码直到挂起和释放GIL。

例如，某个线程想要执行，必须先拿到GIL，我们可以把GIL看成“通行证”，并且在一个Python进程中，GIL只有一个。拿不到通行证的进程就不允许进入CPU执行。

每次释放GIL锁，线程之间都会进行锁竞争，而切换线程会消耗资源。由于GIL锁的存在，Python 里一个进程永远只能同时执行一个线程(拿到GIL的线程才能执行)，这就是在多核CPU上Python的多线程效率不高的原因。

由于GIL的存在，多线程是不是就没用了呢?

以网络爬虫来说，网络爬虫是IO密集型,多线程能够有效地提升效率，因为单线程下有IO操作会进行IO等待，所以会造成不必要的时间浪费，而开启多线程能在线程A等待时自动切换到线程B,可以不浪费CPU的资源，从而提升程序执行的效率。

Python的多线程对于IO密集型代码比较友好，网络爬虫能够在获取网页的过程中使用多线程，从而加快速度。

下面将以获取访问量最大的1000个中文网站的速度为例，通过和单线程的爬虫比较，证实多线程方法的速度提升。

1000个网址写在alexa.txt文件中。

3.1 简单单线程爬虫

import requests
import time

link_list = []
with open('alexa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n','')
        link_list.append(link)
    
start = time.time()
for eachone in link_list:
    try:
        r = requests.get(eachone)
        print(r.status_code, eachone)
    except Exception as e:
        print("Error: ", e)
end = time.time()
print("串行的总时间为：", end-start)

得到的时间结果大概是：2030.428秒

3.2 简单的多线程爬虫

import threading
import time
import requests

link_list = []
with open('alexa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n','')
        link_list.append(link)

start = time.time()
class MyThread(threading.Thread):
    def __init__(self, name, link_range):
        threading.Thread.__init__(self)
        self.name = name
        self.link_range = link_range
    def run(self):
        print("Starting " + self.name)
        crawler(self.name, self.link_range)
        print("Exiting " + self.name)
        
def crawler(threadName, link_range):
    for i in range(link_range[0], link_range[1]+1):
        try:
            r = requests.get(link_list[i], timeout=20)
            print(threadName, r.status_code, link_list[i])
        except Exception as e:
            print(threadName, 'Error: ', e)

thread_list = []
link_range_list = [(0,200), (201,400), (401,600), (601,800), (801,1000)]

#创建新线程
for i in range(1,6):
    thread = MyThread("Thread-" + str(i), link_range_list[i-1])
    thread.start()
    thread_list.append(thread)

#等待所有线程完成
for thread in thread_list:
    thread.join()
    
end = time.time()
print("简单多线程爬虫总时间为：", end-start)
print("Exiting Main Thread")

结果大概是532.81秒，明显快了很多。

3.3 使用Queue的多线程爬虫

这里，我们不再分给每个线程分配固定个数个网址，因为某个线程可能先于其他线程完成任务，然后就会空闲，这就造成了浪费。我们用一个队列存储所有网址，每个线程每次都从队列里取一个网址来执行。

import threading
import time
import requests
import queue as Queue

link_list = []
with open('alexa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n','')
        link_list.append(link)

start = time.time()
class MyThread(threading.Thread):
    def __init__(self, name, q):
        threading.Thread.__init__(self)
        self.name = name
        self.q = q
    def run(self):
        print("Starting " + self.name)
        while True:
            try:
                crawler(self.name, self.q)
            except:
                break
        print("Exiting " + self.name)
        
def crawler(threadName, q):
    url = q.get(timeout=2)
    try:
        r = requests.get(url, timeout=20)
        print(q.qsize(), threadName, r.status_code, url)
    except Exception as e:
         print(q.qsize(), threadName, url, 'Error: ', e)

threadList = ["Thread-1", "Thread-2", "Thread-3", "Thread-4", "Thread-5"]
workQueue = Queue.Queue(1000)
threads = []

for url in link_list:
    workQueue.put(url)

#创建新线程
for tName in threadList:
    thread = MyThread(tName, workQueue)
    thread.start()
    threads.append(thread)

#等待所有线程完成
for thread in threads:
    thread.join()
    
end = time.time()
print("Queue多线程爬虫总时间为：", end-start)
print("Exiting Main Thread")

结果大概是441.40秒，又快了很多。

四、Python使用多线程的两种方法

（1）函数式：调用_thread模块中的start_new_thread()函数产生新线程：

import _thread
import time

#为线程定义一个函数
def print_time(threadName, delay):
    count = 0
    while count < 3:
        time.sleep(delay)
        count += 1
        print(threadName, time.ctime())
        
_thread.start_new_thread(print_time, ("Thread-1", 1))
_thread.start_new_thread(print_time, ("Thread-2", 2))
print("Main Finished")

输出结果如下：

Main Finished
Thread-1 Wed Apr 8 21:44:57 2020
Thread-2 Wed Apr 8 21:44:58 2020
Thread-1 Wed Apr 8 21:44:58 2020
Thread-1 Wed Apr 8 21:44:59 2020
Thread-2 Wed Apr 8 21:45:00 2020
Thread-2 Wed Apr 8 21:45:02 2020

（2）类包装式：调用Threading库创建线程，从threading.Thread继承

run()：用以表示线程活动的方法。
start()：启动线程活动。
join([time])：等待至线程至终止。阻塞调用线程直至线程的join()方法被调用为止。
isAlive()：返回线程是否是活动的。
getName()：返回线程名。
setName()：设置线程名。

import threading
import time

class MyThread(threading.Thread):
    def __init__(self, name, delay):
        threading.Thread.__init__(self)
        self.name = name
        self.delay = delay
    def run(self):
        print("Starting " + self.name)
        print_time(self.name, self.delay)
        print("Exiting " + self.name)

def print_time(threadingName, delay):
    counter = 0
    while counter < 3:
        time.sleep(delay)
        print(threadingName, time.ctime())
        counter += 1

threads = []

#创建新线程
thread1 = MyThread("Thread-1", 1)
thread2 = MyThread("Thread-2", 2)

#开启新线程
thread1.start()
thread2.start()

#添加线程到线程列表
threads.append(thread1)
threads.append(thread2)

#等待所有线程完成
for t in threads:
    t.join()
    
print("Exiting Main Thread")

结果如下：
Starting Thread-1
Starting Thread-2
Thread-1 Wed Apr 8 21:53:14 2020
Thread-1 Wed Apr 8 21:53:15 2020
Thread-2 Wed Apr 8 21:53:15 2020
Thread-1 Wed Apr 8 21:53:16 2020
Exiting Thread-1
Thread-2 Wed Apr 8 21:53:17 2020
Thread-2 Wed Apr 8 21:53:19 2020
Exiting Thread-2
Exiting Main Thread

并发和并行、同步和异步、多线程爬虫

一、并发和并行

二、同步和异步

三、多线程爬虫

3.1 简单单线程爬虫

3.2 简单的多线程爬虫

3.3 使用Queue的多线程爬虫

四、Python使用多线程的两种方法

HTML5 Day_3 超鏈接與多媒體文件應用

天梯賽 L2-022 重排鏈表 (25分)

使用lxml解析網頁

html5 怎麼自定義字體

PAT甲級 1148 Werewolf - Simple Version (20分) 狼人殺（簡單版）暴力枚舉

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結