數據爬取之基本概念

原創

2020-06-14 02:20

初識HTML

這一塊我也是新手，html是一種用來描述網頁的語言，也叫超文本標記語言，就是我們肉眼所看到經過瀏覽器解釋的網頁，實際背後是用html書寫的文本。其中關鍵是html標記標籤，如,一般這種標籤由正反尖括號組成，裏面是關鍵詞，成對出現，代表該關鍵詞的開始位置和結束位置，一般我們爬取數據只要找到所需關鍵詞標籤，然後截取出來即可，太細節的事可能需要前端知識來解釋，以後慢慢修正。下面由網上的小實例練練手。

# -*- coding: utf-8 -*-
#首先要導入urlib.request庫，還有很多打開URL的庫，大家可以嘗試
import urllib.request
#用該庫的urlopen函數打開目標網頁
response = urllib.request.urlopen(r'file:///Users/herenyi/Downloads/6/6.1/html.html')
#然後read屬性讀取源代碼
html = response.read();
#很簡單的一個練手html，\r是回車的意思（回到該行行首），\n是換行的意思（跳到下一行用列），\t是一個tab健
html
b'<html>\r\n\t<body>\r\n\t\t<table>\r\n\t\t\t<tr><td>name</td><td>age</td></tr>\r\n\t\t\t<tr><td>Ken</td><td>28</td></tr>\r\n\t\t\t<tr><td>John</td><td>30</td></tr>\r\n\t\t</table>\r\n\t</body>\r\n</html>\r\n'

讀取網頁完成後，就輪到Beatifulsoup來大顯身手了

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup
Out[197]: 
<html>
<body>
<table>
<tr><td>name</td><td>age</td></tr>
<tr><td>Ken</td><td>28</td></tr>
<tr><td>John</td><td>30</td></tr>
</table>
</body>
</html>#界面是不是清潔很多了
#找到所有帶tr關鍵詞的標記標籤
soup.find_all('tr')
Out[199]: 
[<tr><td>name</td><td>age</td></tr>,
 <tr><td>Ken</td><td>28</td></tr>,
 <tr><td>John</td><td>30</td></tr>]

初識JSON

JSON是一種輕量級的數據交換格式，全稱——JavaScript 對象表示法（JavaScript Object Notation）。看着意思貌似是Javascript的一個數據格式，先試着敲一遍，背後細節以後有空在研究。

#照樣是導入讀取庫讀取然後使用read屬性
import urllib.request;
response = urllib.request.urlopen(r'file:///Users/herenyi/Downloads/6/6.2/json.json');
jsonString = response.read();
#讀取後的源代碼是這樣的
jsonString
Out[202]: b'{\r\n\t"employees": [\r\n\t\t{ "firstName":"Bill" , "lastName":"Gates" },\r\n\t\t{ "firstName":"George" , "lastName":"Bush" },\r\n\t\t{ "firstName":"Thomas" , "lastName":"Carter" }\r\n\t]\r\n}\r\n'
#然後倒入json庫
import json
#用json庫裏的loads函數重新編碼
jsonObject = json.loads(jsonString.decode())
#重新編碼後的樣子
jsonObject
Out[206]: 
{'employees': [{'firstName': 'Bill', 'lastName': 'Gates'},
  {'firstName': 'George', 'lastName': 'Bush'},
  {'firstName': 'Thomas', 'lastName': 'Carter'}]}
  #然後就可以跟普通的字典一樣隨意調取了

jsonObject
Out[206]: 
{'employees': [{'firstName': 'Bill', 'lastName': 'Gates'},
  {'firstName': 'George', 'lastName': 'Bush'},
  {'firstName': 'Thomas', 'lastName': 'Carter'}]}

jsonObject['employees']
Out[207]: 
[{'firstName': 'Bill', 'lastName': 'Gates'},
 {'firstName': 'George', 'lastName': 'Bush'},
 {'firstName': 'Thomas', 'lastName': 'Carter'}]

jsonObject['employees'][0]
Out[208]: {'firstName': 'Bill', 'lastName': 'Gates'}

jsonObject['employees'][0]['lastName']
Out[209]: 'Gates'

初識爬蟲

爬蟲是什麼，爬蟲就是借用工具來爬取網站數據的程序。
爬蟲設計的一般思路：

首先要確定需要爬取網頁的URL地址
通過Http協議來獲取對應的HTML頁面
提取HTML頁面裏有用的數據並保存

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

數據爬取之基本概念

初識HTML

初識JSON

初識爬蟲

【SQL進階】CASE語句的使用

npm error Cannot read properties of null (reading 'isDescendantOf')

數據處理之數據標準化和數據分組

數據可視化之matplotlib庫實戰（一）

數據爬取之基本概念

數據可視化之各類圖表繪製（待補充）

Python用戶消費行爲分析實例

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結