學習筆記之爬蟲篇

網絡爬蟲（又被稱爲網頁蜘蛛，網絡機器人，更經常的稱爲網頁追逐者），是一種按照一定的規則，自動的抓取萬維網信息的程序或者腳本。另外一些不常使用的名字還有螞蟻，自動索引，模擬程序或者蠕蟲。

模塊：scrapy requests

環境：centos

****************** 如果想深入瞭解scrapy 請繞路 *************************

當抓取網頁時，你做的最常見的任務是從HTML源碼中提取數據。現有的一些庫可以達到這個目的：

BeautifulSoup 是在程序員間非常流行的網頁分析庫，它基於HTML代碼的結構來構造一個Python對象，對不良標記的處理也非常合理，但它有一個缺點：慢。
lxml 是一個基於 ElementTree (不是Python標準庫的一部分)的python化的XML解析庫(也可以解析HTML)。
（下邊第二種方法有實例）

Scrapy提取數據有自己的一套機制。它們被稱作選擇器(seletors)，因爲他們通過特定的 XPath 或者CSS 表達式來“選擇” HTML文件中的某個部分。

XPath 是一門用來在XML文件中選擇節點的語言，也可以用在HTML上。 CSS 是一門將HTML文檔樣式化的語言。選擇器由它定義，並與特定的HTML元素的樣式相關連。

Scrapy選擇器構建於 lxml 庫之上，這意味着它們在速度和解析準確性上非常相似。

選擇器詳見 https://scrapy-chs.readthedocs.io/zh_CN/master/topics/selectors.html

#Selector網頁提取數據方法一
from scrapy import Selector

html = "<html><body><span>good</span><span>buy</span></body></html>"
sel = Selector(text=html)
nodes = sel.xpath('//span')
for node in nodes:
    print(node.extract())

#Selector網頁提取數據方法二
from scrapy import Selector

html = """
        <html>
            <body>
                 <span>good</span>
                 <span>buy</span>
                 <ul>
                    <li class="video_part_list">aa</li>
                    <li class="video_part_list">bb</li>
                    <li class="audio_part_list">cc</li>
                    <li class="video_part_list">
                                 <a href="/index">index</a>
                    </li></ul></body></html>
          """
sel = Selector(text=html)
nodes = sel.css('li.video_part_list')

for node in nodes:
    print(node.css('a::attr(href)'))

第一種：利用爬蟲模塊scrapy

1、建立爬蟲 scrapy start object 爬蟲目錄名

例：scrapy start object tututu #tututu爲爬蟲目錄名

2、在爬蟲目錄名/爬蟲目錄名/spider/ 下建立爬蟲文件

例：vim pachong.py

3、書寫爬蟲代碼

import scrapy

class DmozSpider(scrapy.Spider):
    name="dadada"                                #定義爬蟲名字 固定格式用name='爬蟲名'
    start_urls = [
        "http://www.cnblogs.com/wangkongming/default.html?page=22",
        "http://www.cnblogs.com/wangkongming/default.html?page=21",
    ]                                            #啓始的url   固定格式用start_urls=[]

    def parse(self,response):                    
        filename = response.url.split("/")[-2]   #response.url  即要爬取的url
        with open(filename,"wb") as f:                
            f.write(response.body)               #response.body  爬取到的網頁代碼

4、啓動代碼 scrapy crawl dadada #這裏的dadada爲爬蟲的名字

第二種：利用requests模塊

#coding:utf-8

from datetime import datetime
from urllib.parse import urlencode
import requests
from bs4 import BeautifulSoup
import time
from itertools import product



url = "http://www.ivsky.com/"

def download_url(url):
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}
    response = requests.get(url,headers=headers)                 #請求網頁
    return response.text                                         #返回頁面的源碼


def connet_image_url(html):
    
    soup = BeautifulSoup(html,'lxml')           #格式化html，方便查找網頁中的數據標籤
    body = soup.body                                            #獲取網頁源碼的body
    data_main = body.find('div',{"class":"ileft"})              #找body到'div'標籤  且標籤中 class=ileft
    
    if data_main:
        images = data_main.find_all('img')                      #找到data_main中所有的img標籤
        with open('img_url','w') as f:                         
            for i,image in enumerate(images):                   #遍歷images 並添加序號
                image_url = image.get('src')                    #獲取image中src的值
                f.write(image_url+'\r')
    save_image()
def save_image():
    with open('img_url','r') as f:
        i=0
        headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}
        for line in f:
            if line:
                i+=1
                line=line[0:(len(line)-1)]                     #切除換行符
                response = requests.get(url=line,headers=headers)
                filename=str(i)+'.jpg'
                with open(filename,'wb') as f:
                    f.write(response.content)                  #將圖片寫進f
                print('這是第%s張圖片'%i)
connet_image_url(download_url(url))

小白筆記如有錯誤請下方評論提醒修改

學習筆記之爬蟲篇

在Oracle VM VirtualBox安裝CentOS教程

nmap

sqlmap命令

學習筆記之xlsx文件操作篇

sqlmap命令

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結