python爬蟲知識回顧

又要重新開始python的道路了，爭取快些找回感覺啊。近來一直java，jsp，ssh，db。

最常用的requests庫, 通過requests對象的get方法，獲取一個response對象。jsp的東西。

image.png

image.png

image.png

其中timeout,proxies,headers,cookies,verify,是我用到過的東西。

response對象的方法和屬性 text屬性，屬於字符流，獲取文字。 content屬性，二進制，獲取圖片，文件等

hashlib 摘要算法簡介 Python的hashlib提供了常見的摘要算法，如MD5，SHA1等等。

什麼是摘要算法呢？摘要算法又稱哈希算法、散列算法。它通過一個函數，把任意長度的數據轉換爲一個長度固定的數據串（通常用16進制的字符串表示）。

舉個例子，你寫了一篇文章，內容是一個字符串'how to use python hashlib - by Michael'，並附上這篇文章的摘要是'2d73d4f15c0db7f5ecb321b6a65e5d6d'。如果有人篡改了你的文章，並發表爲'how to use python hashlib - by Bob'，你可以一下子指出Bob篡改了你的文章，因爲根據'how to use python hashlib - by Bob'計算出的摘要不同於原始文章的摘要。

可見，摘要算法就是通過摘要函數f()對任意長度的數據data計算出固定長度的摘要digest，目的是爲了發現原始數據是否被人篡改過。

摘要算法之所以能指出數據是否被篡改過，就是因爲摘要函數是一個單向函數，計算f(data)很容易，但通過digest反推data卻非常困難。而且，對原始數據做一個bit的修改，都會導致計算出的摘要完全不同。

def get_MD5(st="alice"):
    md5=hashlib.md5()
    md5.update(st.encode(encoding="utf-8"))
    print(md5.hexdigest())
    
get_MD5()

代理和頭部處理

def get_html(url):
   headers = {'Accept': '*/*',
               'Accept-Language': 'en-US,en;q=0.8',
               'Cache-Control': 'max-age=0',
               'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36',
               'Connection': 'keep-alive',
               'Referer': 'http://www.baidu.com/'
               }
    proxy = [
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
        {'https': 'http://yx827w:[email protected]:888'},
    ]
    pro=random.choice(proxy)
    print(type(pro))
    print(pro)
    res=requests.get(url,headers=head,proxies=pro)
    html=res.text //返回字符串。
    print(html)
    return html

xpath技術 1.0 使用etree的HTML方法獲取數據,返回的是一個節點對象

from lxml import etree
html=get_html("https://blog.csdn.net/u014595019/article/details/51884529")
print(html)
page=etree.HTML(html)
print(type(page),page)
xp='//*[@id="mainBox"]/main/div[1]/div/div/div[2]/div[1]/span[2]'
readnum=page.xpath(xp)

for a in readnum:
    print(a.attrib)
    print(a.text)
    print(a.get("class"))

結果如下

<class 'lxml.etree._Element'> <Element html at 0x47a7288>
{'class': 'read-count'}
閱讀數：40927
read-count

參考文獻摘要算法簡介學習lxml解析html兩小時後總結

python爬蟲知識回顧

結果如下

java對文件的操作如下

我是如何在5個月內跨專業考上北科計算機的

C語言的知識點

知網的鏈接構造

python經常用到的東西。

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結