python進行簡單爬蟲示例

原創

2018-12-07 23:39

一直覺得“爬蟲”這個詞很高大尚，然後就想着感受一下，百度了之後。頓時覺得也沒那麼高深，實現簡單一點的爬蟲，即便是菜鳥也可以做到。
一個簡單的爬蟲，兩部分組成，下載html和解析html文檔。下面示例中實現了一個爬取糗事百科的首頁的笑話的網絡爬蟲，可作大家參考
工作環境：VS code ，python2.7
需要導入基本包：requests、 BeautifulSoup 、bs4

導入包的方法：
進入cmd
輸入 python -m pip install XXX(包名)即可自行安裝

檢測：輸入python回車
import XXX（包名）如果沒有報錯，表面安裝導入成功。
獲取整個頁面的內容如下：

# -*- coding=utf-8 -*-
import requests
from bs4 import BeautifulSoup
# 獲取html文檔
def get_html(url):
    """get the content of the url"""
    response = requests.get(url)
    response.encoding = 'utf-8'
    print (response.text)

url_joke = "https://www.qiushibaike.com"
html = get_html(url_joke)

上面輸出的內容，相當於在網頁上右擊，查看源碼
如果只需要輸出制定內容，還需要用到 BeautifulSoup包，下面示例是輸出笑話

# -*- coding=utf-8 -*-

import requests
from bs4 import BeautifulSoup

# 獲取html文檔
def get_html(url):
    """get the content of the url"""
    response = requests.get(url)
    response.encoding = 'utf-8'
    return response.text

# 獲取笑話
def get_certain_joke(html):
    """get the joke of the html"""
    soup = BeautifulSoup(html, 'lxml')
    joke_content = soup.select('div.content')[0].get_text()
    return joke_content

url_joke = "https://www.qiushibaike.com"
html = get_html(url_joke)
joke_content = get_certain_joke(html)
print joke_content

代碼講解：
joke_content = soup.select(‘div.content’)[0].get_text()
是輸出第一笑話。
如果想輸出第二個笑話，則代碼改成如下：
joke_content = soup.select(‘div.content’)[1].get_text()
如果你需要輸出title,只需要改成下面代碼：
joke_content = soup.select(‘title’)[0].get_text()
就輸出：糗事百科 - 超搞笑的原創糗事笑話分享社區

參考鏈接爲https://blog.csdn.net/fujianjun6/article/details/72979643/

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python進行簡單爬蟲示例

Wondows 使用composer搭建symfony框架過程

初建symfony框架無法訪問到symfony歡迎界面

rsa生成公鑰祕鑰中產生的問題

python進行簡單爬蟲示例

梳理python對文件、文件夾的基本操作

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結