[Python] 命令行模式閱讀博客園的博文

通過Python腳本讀取博客園分頁數據，把標題、摘要和鏈接過濾出來，方便我們在命令行中閱讀。

閱讀本文可以熟悉一般爬蟲的原理，以及指令交互界面的開發。

一、說明

　　運行環境：win10/Python 3.5（Win10的玩家可以下載 Window Terminal Preview玩玩，確實不錯！）；

　　主要模塊：requests（發送http請求）、lxml.etree（格式化DOM樹，xpath查找）、sys（獲取命令行參數、重寫標準輸出等）、os（系統相關、如清屏操作）；

　　注意：由於時間倉促，沒有詳細測試，遇到問題麻煩這裏反饋；所有英文指令不區分大小寫；

　　後續：後續還會增加查看詳情，跳轉瀏覽器等。

　　本文地址：https://www.cnblogs.com/reader/p/11487398.html

二、交互頁面

　　2、1 開始

　　　　直接運行腳本，即可出現如圖操作界面。可以輸入 1、2和Q進行操作。

　　2、2 開始閱讀

　　　　如上圖所示，輸入 1 進入的開始閱讀界面。可以在頭部看到頁碼。這裏操作相對多，N(next)下一頁，B(Back)上一頁，H(Home)首頁，Q（Quit）退出，D {num}（Detail, 後面需要輸入標題前面對應的數字進行讀取摘要和鏈接等）。

　　2、3 閱讀摘要和鏈接　

　　　如圖所示，每行分別對應標題、鏈接和摘要。　

三、代碼分析

　　3、1 思路

　　　　採集頁面-->解析頁面內容-->收集必要信息

　　3.2 數據採集

　　　　如下代碼所示，根據頁碼進行數據採集。這裏需要注意摘要的獲取方式：

li.xpath('p')[0].xpath('string(.)')

 1     def download(self, page):
 2         """下載html頁面內容"""
 3         self.set_target_url(page)
 4         response = requests.get(self.target_url, headers=self.headers)
 5         if response.status_code == 200:
 6             return response.content
 7         else:
 8             print("download fail")
 9             return ""
10 
11     def parse(self, content):
12         """解析HTML內容"""
13         html = etree.HTML(content)
14         lists = html.xpath('//div[@id="post_list"]//div[@class="post_item_body"]')
15 
16         del html
17         k = 1
18         print('+', '--' * 50, '+')
19         print('|', str("當前頁碼："+str(self.page)).ljust(95), '|')
20         print('+', '--' * 50, '+')
21         for li in lists:
22             title = str(li.xpath('h3/a/text()')[0])
23             link = li.xpath('h3/a/@href')[0]
24             desc = li.xpath('p')[0].xpath('string(.)')
25 
26             self.lists[k] = {
27                 'title': title,
28                 'desc': desc.strip(),
29                 'link': link
30             }
31 
32             print('|', k, self.formatByWidth(title, 100-1-len(str(k))), '|')
33             k += 1
34         del lists
35         print('+', '--' * 50, '+')

四、源碼

　　注意：代碼僅供學習，杜絕其他方式使用！請註明轉發地址。

  1 # -*- coding:UTF-8 -*-
  2 import requests
  3 from lxml import etree
  4 import sys
  5 import io
  6 import os
  7 
  8 
  9 sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='gb18030')
 10 
 11 
 12 class CnBlogs:
 13     """"
 14         Auth：reader
 15         發表地址：https://www.cnblogs.com/reader/p/11487398.html
 16         作者地址：https://www.cnblogs.com/reader
 17     """
 18     def __init__(self):
 19         self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
 20 
 21         self.target_domain = "https://www.cnblogs.com"
 22         self.page = 1
 23         self.lists = {}
 24 
 25     def clearscreen(self):
 26         """根據系統，清屏操作"""
 27         # window下的清屏方式
 28         os.system("cls")
 29 
 30     def set_target_url(self, page):
 31         if page == 1:
 32             self.target_url = self.target_domain
 33         else:
 34             self.target_url = 'https://www.cnblogs.com/sitehome/p/'+str(page)
 35 
 36     def download(self, page):
 37         """下載html頁面內容"""
 38         self.set_target_url(page)
 39         response = requests.get(self.target_url, headers=self.headers)
 40         if response.status_code == 200:
 41             return response.content
 42         else:
 43             print("download fail")
 44             return ""
 45 
 46     def isascii(self, ch):
 47         return ch <= u'\u007f'
 48 
 49     def formatByWidth(self, text, width):
 50         """格式化字符串長度"""
 51         count = 0
 52         for u in text:
 53             if not self.isascii(u):
 54                 count += 1
 55         return text + " " * (width - count - len(text))
 56 
 57     def parse(self, content):
 58         """解析HTML內容"""
 59         html = etree.HTML(content)
 60         lists = html.xpath('//div[@id="post_list"]//div[@class="post_item_body"]')
 61 
 62         del html
 63         k = 1
 64         print('+', '--' * 50, '+')
 65         print('|', str("當前頁碼："+str(self.page)).ljust(95), '|')
 66         print('+', '--' * 50, '+')
 67         for li in lists:
 68             title = str(li.xpath('h3/a/text()')[0])
 69             link = li.xpath('h3/a/@href')[0]
 70             desc = li.xpath('p')[0].xpath('string(.)')
 71 
 72             self.lists[k] = {
 73                 'title': title,
 74                 'desc': desc.strip(),
 75                 'link': link
 76             }
 77 
 78             print('|', k, self.formatByWidth(title, 100-1-len(str(k))), '|')
 79             k += 1
 80         del lists
 81         print('+', '--' * 50, '+')
 82 
 83     def descopt(self, k):
 84         """讀取詳情"""
 85         k = int(k)
 86         if k not in self.lists.keys():
 87             return
 88         self.clearscreen()
 89         print('+', '--' * 50, '+')
 90         print('|', self.formatByWidth(self.lists[k]['title'], 100), '|')
 91         print('+', '--' * 50, '+')
 92         print('|', self.formatByWidth(self.lists[k]['link'], 100), '|')
 93         print('+', '--' * 50, '+')
 94 
 95         print('|', self.formatByWidth(self.lists[k]['desc'], 100), '|')
 96 
 97         print('+', '--' * 50, '+')
 98         input("輸入任意鍵返回...\r\n")
 99 
100     def readopt(self):
101         """開始閱讀操作"""
102         while True:
103             self.clearscreen()
104             print("\r\n")
105             html = self.download(page=self.page)
106             self.parse(html)
107 
108             print("[N]：下一頁，[B]：上一頁，[H]：首頁，[D {num}]：簡述, [Q]：返回")
109 
110             cmd = input("請輸入操作編號[N、B、H、D、Q]：")
111 
112             if cmd == 'Q' or cmd == 'q':    # 返回
113                 break
114             elif cmd == 'N' or cmd == 'n':  # 下一頁
115                 self.page += 1
116             elif cmd == 'B' or cmd == 'b':  # 上一頁
117                 self.page -= 1
118                 if self.page <= 0:
119                     self.page = 1
120             elif cmd == 'H' or cmd == 'h':  # 首頁
121                 self.page = 1
122             else:
123                 cmd = cmd.split(' ')
124 
125                 if len(cmd) != 2:
126                     continue
127                 # 讀取簡述
128                 if cmd[0] == 'D' or cmd[0] == 'd':
129                     self.descopt(cmd[1])
130 
131     def aboutopt(self):
132         self.clearscreen()
133         print("博客園地址: https://www.cnblogs.com/reader\r\n")
134         input("輸入任意鍵返回...\r\n")
135 
136     def start(self):
137         self.clearscreen()
138         while True:
139             print('+', '--'*50, '+')
140             print('|', "歡迎使用博客園閱讀器(reader 開發)".center(88), '|')
141             print('+', '--' * 50, '+')
142             print('|', "[1]：開始閱讀".center(95), '|')
143             print('|', "[2]：關於作者".center(95), '|')
144             print('|', "[Q]：退出軟件".center(95), '|')
145             print('+', '--' * 50, '+')
146 
147             cmd = input("請輸入操作編號[1、2、Q]：")
148             if cmd == '1':
149                 self.readopt()
150             elif cmd == '2':
151                 self.aboutopt()
152             elif cmd == 'Q' or cmd == 'q':
153                 break
154 
155             os.system("cls")
156 
157         print("已退出，歡迎使用！")
158 
159 
160 if __name__ == "__main__":
161     obj = CnBlogs()
162     obj.start()

View Code

[Python] 命令行模式閱讀博客園的博文

致遠OA及相關OA系統集成與二次開發

EXCEL公式使用總結

System.Object未被引用的程序集中定義

Java 信號量（semaphore）搭配CountDownLatch 實現多線程處理循環內邏輯並限制創建線程數

[轉帖]linux命令top內存顯示M兆或者G

【面試準備】項目經驗——接口自動化項目

thinkphp5框架的model支持多地區數據庫切換

簡單說明

賬戶授權管理

thinkphp5兼容PostgreSql的model操作

[Swoole] 在Ubuntu下安裝、快速開始

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結