前言: 學到決策樹預測球隊輸贏時,按照書中網址去下載數據集,無奈怎麼也沒下載成功。即使下載了excel文件也是破損的。咱可是學了python的銀,那好吧,我就把它爬取下來。(資源在下面)
代碼:
'''
爬取《python數據挖掘入門與實踐》提到的nba賽況
https://www.basketball-reference.com/leagues/NBA_2014_games-october.html
操作:編譯.py後,使用save()方法即可
'''
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
BASE_URL = 'https://www.basketball-reference.com/leagues/NBA_2014_games-{month}.html'
all_month = np.array(['october','november','december','january','february','march','april','may','june'])
def get_content():
list = []
for i in range(len(all_month)):
url = BASE_URL.format(month=all_month[i])
print(url)
html = urlopen(url).read()
bsObj = BeautifulSoup(html,'lxml')
rows = [dd for dd in bsObj.select('tbody tr')]#selectk()可以多重刷選
for row in rows:
cell = [i.text for i in row.find_all('td')]#對於每一個tr標籤內也可以進行td標籤篩選
list.append(cell)
return list#返回二維列表
#存儲爲scv格式
def save():
file = open('D:\\Python\\PythonProject\\nba_decisiontree_test\\matches.csv','w')#地址要自己改
list = get_content()
df_data = pd.DataFrame(columns=[1,2,3,4,5,6,7,8,9] ,data=list)
df_data.to_csv(file)
print('done')
輸出:
>>> save()
https://www.basketball-reference.com/leagues/NBA_2014_games-october.html
https://www.basketball-reference.com/leagues/NBA_2014_games-november.html
https://www.basketball-reference.com/leagues/NBA_2014_games-december.html
https://www.basketball-reference.com/leagues/NBA_2014_games-january.html
https://www.basketball-reference.com/leagues/NBA_2014_games-february.html
https://www.basketball-reference.com/leagues/NBA_2014_games-march.html
https://www.basketball-reference.com/leagues/NBA_2014_games-april.html
https://www.basketball-reference.com/leagues/NBA_2014_games-may.html
https://www.basketball-reference.com/leagues/NBA_2014_games-june.html
done
數據展示:
補充: 看到後面發現還有一份數據需要用,但是上面的代碼卻不能夠用在這裏。原因是球隊排行的數據被註釋掉了(查看網頁源碼可發現)。所以這裏用到了正則表達式去獲取註釋。
代碼:
'''
#get_standing_data.py
獲取《python數據挖掘入門與實踐》決策樹nba球隊預測的球隊排行數據
存儲地址自行修改
'''
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re
#pattern = re.compile('<!--[\s\S]*?-->')#html註釋的正則:<!--[\s\S]*?-->
pattern = re.compile('<tbody>[\s\S]*?</tbody>')#模仿html註釋的正則
url = 'https://www.basketball-reference.com/leagues/NBA_2013_standings.html'
html = urlopen(url).read()
bsObj = BeautifulSoup(html,'lxml')
content = bsObj.find(id='all_expanded_standings').prettify()
match = re.search(pattern,content)
str_tbody = match.group()
html_tbody = BeautifulSoup(str_tbody,'lxml')#將str字符串傳入獲得html對象
list = []
for tr in html_tbody.find_all('tr'):
rows = [td.text for td in tr.find_all('td')]
list.append(rows)
#轉成csv格式
file = 'D:\\Python\\PythonProject\\nba_decisiontree_test\\standing.csv'#自行修改
df_data = pd.DataFrame(data=list)
df_data.to_csv(file)
print('done')
部分數據展示:
>>> df_data
0 1 2 3 4 5 6 7 \
0 Miami Heat 66-16 37-4 29-12 41-11 25-5 14-4 12-6
1 Oklahoma City Thunder 60-22 34-7 26-15 21-9 39-13 7-3 8-2
2 San Antonio Spurs 58-24 35-6 23-18 25-5 33-19 8-2 9-1
3 Denver Nuggets 57-25 38-3 19-22 19-11 38-14 5-5 10-0
4 Los Angeles Clippers 56-26 32-9 24-17 21-9 35-17 7-3 8-2
5 Memphis Grizzlies 56-26 32-9 24-17 22-8 34-18 8-2 8-2
6 New York Knicks 54-28 31-10 23-18 37-15 17-13 10-6 12-6
7 Brooklyn Nets 49-33 26-15 23-18 36-16 13-17 11-5 13-5
8 Indiana Pacers 49-32 30-11 19-21 31-20 18-12 6-11 13-3
9 Golden State Warriors 47-35 28-13 19-22 19-11 28-24 7-3 5-5
10 Chicago Bulls 45-37 24-17 21-20 34-18 11-19 13-5 9-7
11 Houston Rockets 45-37 29-12 16-25 21-9 24-28 7-3 7-3
12 Los Angeles Lakers 45-37 29-12 16-25 17-13 28-24 6-4 6-4
13 Atlanta Hawks 44-38 25-16 19-22 29-23 15-15 7-11 11-7
14 Utah Jazz 43-39 30-11 13-28 17-13 26-26 5-5 5-5
15 Boston Celtics 41-40 27-13 14-27 27-24 14-16 7-9 8-9
16 Dallas Mavericks 41-41 24-17 17-24 17-13 24-28 5-5 6-4
文件資源: 有用的話點個讚唄
鏈接:https://pan.baidu.com/s/1eUfa914 密碼:5ptu
———關注我的公衆號,一起學數據挖掘————