1.數據獲取方式

xpath

/#未完成

import requests
from lxml import html as lxml_html  #pip install lxml 用於解析html
html = requests.get(url = 'http://www.baidu.com')
html.encoding = 'utf-8'
html = html.text
doc = lxml_html.fromstring(html)
title = doc.xpath('這裏是xpath規則')

>>>title
['匹配的內容']

正則

最常用的 (.*?) 非貪婪匹配（.*?）默認匹配所有如果遇到換行問題 ([\s\S]*) 貪婪匹配
例：

import re
string = '{"vm_type":"kvm","ve_status":"running","ve_mac1":"***********","ve_used_disk_space_b":5088407552,"ve_disk_quota_gb":"10","is_cpu_throttled":"","ssh_port":29364,"live_hostname":"ubuntu","load_average":"0.00 0.00 0.00 1\\/167 27303","mem_available_kb":345856,"swap_total_kb":135164,"swap_available_kb":100160,"hostname":"localhost.localdomain","node_ip":"**********","node_alias":"v7415","node_location":"US, California","node_location_id":"USCA_3","node_datacenter":"US: Los Angeles, California (DC3 CN2)","location_ipv6_ready":false,"plan":"kvmv3-10g-512m-500m-ca-cn2","plan_monthly_data":536870912000,"monthly_data_multiplier":1,"plan_disk":10737418240,"plan_ram":536870912,"plan_swap":0,"plan_max_ipv6s":0,"os":"ubuntu-16.04-x86_64","email":"[email protected]","data_counter":14346176667,"data_next_reset":1532495173,"ip_addresses":["**********"],"rdns_api_available":true,"ptr":{"********":null},"suspended":false,"error":0,"veid":938308}'
re_plan = '"plan":"(.*?)"'
plan = re.findall(re_plan,string)

>>>plan
['kvmv3-10g-512m-500m-ca-cn2']


re_plan = '"plan":"([\s\S]*)"'
plan = re.findall(re_plan,string)

>>>plan
['kvmv3-10g-512m-500m-ca-cn2","plan_monthly_data":536870912000,"monthly_data_multiplier":1,"plan_disk":10737418240,"plan_ram":536870912,"plan_swap":0,"plan_max_ipv6s":0,"os":"ubuntu-16.04-x86_64","email":"[email protected]","data_counter":14346176667,"data_next_reset":1532495173,"ip_addresses":["**********"],"rdns_api_available":true,"ptr":{"********":null},"suspended":false,"error":0,"veid']

css

暫時還沒有接觸過。。。

2.string 轉 json

import json
from pprint import pprint
string = '{"vm_type":"kvm","ve_status":"running","ve_mac1":"***********","ve_used_disk_space_b":5088407552,"ve_disk_quota_gb":"10","is_cpu_throttled":"","ssh_port":29364,"live_hostname":"ubuntu","load_average":"0.00 0.00 0.00 1\\/167 27303","mem_available_kb":345856,"swap_total_kb":135164,"swap_available_kb":100160,"hostname":"localhost.localdomain","node_ip":"**********","node_alias":"v7415","node_location":"US, California","node_location_id":"USCA_3","node_datacenter":"US: Los Angeles, California (DC3 CN2)","location_ipv6_ready":false,"plan":"kvmv3-10g-512m-500m-ca-cn2","plan_monthly_data":536870912000,"monthly_data_multiplier":1,"plan_disk":10737418240,"plan_ram":536870912,"plan_swap":0,"plan_max_ipv6s":0,"os":"ubuntu-16.04-x86_64","email":"[email protected]","data_counter":14346176667,"data_next_reset":1532495173,"ip_addresses":["**********"],"rdns_api_available":true,"ptr":{"********":null},"suspended":false,"error":0,"veid":938308}'
j_son = json.loads(string)
pprint(j_son) #能夠帶格式輸出

3.關於一些常見編碼問題通用方案

import requests
from pprint import pprint
url = "http://www.baidu.com"
html = requests.get(url)
html.encoding = 'utf-8'
pprint(html.text)

大致效果

4.logging

爲什麼要用logging ：

我記得有個皮皮怪這麼回答一個計算式返回結果計算耗時1% print 耗時 99%

當然，寫python隨心就行，但是大項目還是得用logging來做下記錄的，小項目 print + file wirte 也沒的事

import logging
import requests

class Project(object):
    def __init__(self):
        #如果你覺得有些無關緊要的logging太礙眼  比方說 requests 的url請求的logging記錄等，可以通過下行代碼（通過設置logging的等級 ）
        #logging.getLogger("requests").setLevel(logging.WARNING)
        self.logger = logging.getLogger()
        handler = logging.StreamHandler()
        formatter = logging.Formatter('%(asctime)s [%(threadName)s][%(levelname)s] %(message)s') 
        handler.setFormatter(formatter)
        self.logger.addHandler(handler)
        self.logger.setLevel(logging.DEBUG)#這裏設置logging的等級

具體到某一步的logging記錄！！！注意不管是debug還是 info還是warning 括號內把必須是字符串

self.logger.debug("這裏來記錄一些內容") #不侷限於debug 還有 info 、warning 括號內 type = string

5.關於一些比較常用的字符串處理

字符串去除空格換行符

string.strip()

字符串分割

#比如以':'爲界分割
string.split(':')

字符串替換

string = "w:w:m"
a=string.replace(':','/')

>>>a
'w/w/m'

6.python切片

爲什麼要講切片
當我們的爬蟲獲取到信息時候，不可能是十全十美的，總是會有殘缺的情況的
比如：
我們應該拿到的完整的信息是這個樣子的

info = {
'title':'titel_content',
'content':['1','2','3'],
'tags':['life','love'],
'author':['name:horsun','introduce:~~~~~~~~']# 比如 author是的list 長度爲2 
}

但是實際上我們拿到了這個樣子的：

info = {
'title':'titel_content',
'content':['1','2','3'],
'tags':['life','love'],
'author':['name:horsun']#但是 有些內容缺失 導致只有一個長度
}

所以當我們要獲取introduce的時候應該是info['author'][1] 但是當我們的數據出現第二種情況的時候 info['author'][1] 就會拋錯 list長度錯誤但是我們可以通過切片獲取 info['author'][1]的對象 info['author'][0:1] #獲取的是info['author'][1] 但是這樣返回的是一個list ------>['name:horsun'] info['author'][0:1] 第一個0 是起始位置第二個1 是終止位置類似數學的區間 [4,8)---->4,5,6,7 所以我們可以通過判斷 info['author'][1:2] 的長度 ---->info['author'][1:2].__len__() ==0


info = {
'title':'titel_content',
'content':['1','2','3'],
'tags':['life','love'],
'author':['name:horsun']#但是 有些內容缺失 導致只有一個長度
}
article = Article()
article.create(
title = info['title'],
content= info['content'],
tags= info['tags'],
author_name= info['author'][0]   if info['author'][0:1].__len__() !=0 else ' ',
author_intro= info['author'][1]   if info['author'][1:2].__len__() !=0 else ' ',
)

if info['author'][1:2].__len__() !=0 else '' 關於這句就是如果info['author'][1:2].__len__() !=0 成立就執行 author_name= info['author'][0] 否則就執行 author_name=' '

7.一些常見的反反爬手段

設置headers 即請求頭
-基本上的url設置一個UA(User-Agent)就行了，除非一些特例要完全按照抓包的請求 headers來
例：headers2 = { 'Host': 'jwxt.zwu.edu.cn', 'Connection': 'keep-alive', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0(Windows NT 10.0;Win64;x64)AppleWebKit /537.36(KHTML, like Gecko)Chrome / 67.0.3396.62 Safari / 537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Referer': 'http://jwxt.zwu.edu.cn/xs_main.aspx?xh=2014014701', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh,en;q=0.9,zh-CN;q=0.8',
time sleep #注意不要固定一個sleep時間可以 random 否則也會判定你是機器人（哪有人能控制時間這麼準確的）
例：

import time
import random

time.sleep(random.randint(5, 10))

設置代理（效率最高）# requests爲例

import requests
url = 'http://www.baidu.com'
proxies = {
    "http":"110.110.110.110:110"   #"協議"+：+"ip+端口號"
}
html = requests.get(url,proxies = proxies)

通過 selenium 來爬（萬能,但是效率較低）
有代理誰用 selenium
以及一些攜帶奇葩的post參數
記錄登陸 session
下面講一個 asp.NET的奇葩post參數例子
大學校園教務網個人成績頁查詢頁爬蟲。頁面是asp.net編寫的。
·——————————————————·
先通過登陸頁面 url1 登陸來記錄session，然後請求成績查詢頁面（還沒有數據的頁面） url2 ，請求查詢全部成績 url3。其實講道理，一般的網站，你記錄了登陸session，所有頁面都是隨心所欲的，所已url2也能正常訪問，但是當我打算獲取我的所有成績的時候，通過請求url3發現，請求失敗，並沒有出現我想要的成績數據，通過抓包發現，其實url2和url3 是同一個url，但是url2 是get方法，url3 是post方法，發現 url3 在post的時候發送了一個 '__VIEWSTATE' 和一個 '__VIEWSTATEGENERATOR'參數，通過網上查找發現這倆個參數其實是通過url2的get請求插在了html中在url3發送post請求的時候攜帶了了這倆個參數，服務器接受請求的時候驗證了這倆個參數纔給相應

`注意，這倆個url請求的headers也不完全相同，一定要按照抓包的時候請求頭 headers來請求`

代碼片段

    def get_response_data(self):
        """
        通過第一次訪問 self.url 來獲取 第二次訪問 self.url的的所需參數
        第一次 訪問self.url 是get方法
        第二次 訪問self.url 是post方法 post帶了寫 data 所需參數 需要從第一次訪問所返回的html中找到
        (第一次是get方法 請求返回後 html 內有__VIEWSTATE 和__VIEWSTATEGENERATOR 倆大參數)
        :return:
        """
        response = self.session.get(url=self.url4.format(self.student_id),
                                    headers=self.headers2,
                                    )
        response.encoding = 'gb2312'
        html = response.text
        __VIEWSTATE = re.findall('name="__VIEWSTATE" value="(.*?)"', html)
        __VIEWSTATEGENERATOR = re.findall('name="__VIEWSTATEGENERATOR" value="(.*?)"', html)
        self.data = {
            '__VIEWSTATE': ''.join(__VIEWSTATE),
            '__VIEWSTATEGENERATOR': ''.join(__VIEWSTATEGENERATOR),
            'ddlXN': '',
            'ddlXQ': '',
            'Button1': '%B0%B4%D1%A7%C6%DA%B2%E9%D1%AF', }

    def get_score(self):
        """
        獲取目標url的html內容
        接下來就可以對目標頁面進行解析
        可以通過正則或者xpath來提取數據
        :return:
        """
        cookies = self.session.cookies
        response = self.session.post(url=self.url4.format(self.student_id),
                                     data=self.data,
                                     headers=self.score_headers,
                                     cookies=cookies
                                     )
        response.encoding = 'gb2312'
        html = response.text
        print(html)#這一步就獲取到了所有的頁面數據信息了 ✌
        doc = lxml_html.fromstring(html)

完整代碼：https://github.com/helloworld19951213/get_my_school_scroe/blob/master/spider.py

8.增量爬取/去重（重複爬取）/更新爬取

去重和增量爬取主要思想就是把已經爬過的數據放到列表裏面，通過判斷url或者別的參數是否再列表中來決定是否爬取

詳細見我另一篇文章 https://blog.csdn.net/qq_33042187/article/details/78929834
——————————————————分割線————————————————————————————

待補充 2018/7/4

TODO
-> ~~應對常見的反爬有效手段~~ 7.5
->~~數據去重~~
->~~增量爬取~~
->多線程爬蟲以及多線程+隊列實現線程間的通信
->一些抓包手段
待補充

python爬蟲彙總

1.數據獲取方式

xpath

正則

css

2.string 轉 json

3.關於一些常見編碼問題通用方案

4.logging

爲什麼要用logging ：

當然，寫python隨心就行，但是大項目還是得用logging來做下記錄的，小項目 print + file wirte 也沒的事

5.關於一些比較常用的字符串處理

字符串去除空格換行符

字符串分割

字符串替換

6.python切片

7.一些常見的反反爬手段

`注意，這倆個url請求的headers也不完全相同，一定要按照抓包的時候請求頭 headers來請求`

8.增量爬取/去重（重複爬取）/更新爬取

待補充 2018/7/4

如何使用 JS 判斷用戶是否處於活躍狀態

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

django人性化設置時間改爲多久前

JWT失效方式---------django rest framework jwt

初識約瑟夫環--python

python爬蟲彙總

list of dict 轉換成 dict of list 字典形列表轉換列表形字典 in Python

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

python爬蟲彙總

1.數據獲取方式

xpath

正則

css

2.string 轉 json

3.關於一些常見編碼問題通用方案

4.logging

爲什麼要用logging ：

當然，寫python隨心就行，但是大項目還是得用logging來做下記錄的，小項目 print + file wirte 也沒的事

5.關於一些比較常用的字符串處理

字符串去除 空格 換行符

字符串分割

字符串替換

6.python切片

7.一些常見的反 反爬 手段

注意，這倆個url請求的headers也不完全相同，一定要按照抓包的時候請求頭 headers來請求

8.增量爬取/去重（重複爬取）/更新爬取

待補充 2018/7/4

字符串去除空格換行符

7.一些常見的反反爬手段

`注意，這倆個url請求的headers也不完全相同，一定要按照抓包的時候請求頭 headers來請求`