《python數據挖掘入門與實踐》決策樹預測nba數據集

前言: 學到決策樹預測球隊輸贏時,按照書中網址去下載數據集,無奈怎麼也沒下載成功。即使下載了excel文件也是破損的。咱可是學了python的銀,那好吧,我就把它爬取下來。(資源在下面)

代碼:

'''
    爬取《python數據挖掘入門與實踐》提到的nba賽況
    https://www.basketball-reference.com/leagues/NBA_2014_games-october.html
    操作:編譯.py後,使用save()方法即可
'''
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

BASE_URL = 'https://www.basketball-reference.com/leagues/NBA_2014_games-{month}.html' 
all_month = np.array(['october','november','december','january','february','march','april','may','june'])

def get_content():
    list = []
    for i in range(len(all_month)):
        url = BASE_URL.format(month=all_month[i])
        print(url)
        html = urlopen(url).read()
        bsObj = BeautifulSoup(html,'lxml')
        rows = [dd for dd in bsObj.select('tbody tr')]#selectk()可以多重刷選
        for row in rows:
            cell = [i.text for i in row.find_all('td')]#對於每一個tr標籤內也可以進行td標籤篩選
            list.append(cell)
    return list#返回二維列表
#存儲爲scv格式
def save():
    file = open('D:\\Python\\PythonProject\\nba_decisiontree_test\\matches.csv','w')#地址要自己改
    list = get_content()
    df_data = pd.DataFrame(columns=[1,2,3,4,5,6,7,8,9] ,data=list)
    df_data.to_csv(file)
    print('done')

輸出:

>>> save()
https://www.basketball-reference.com/leagues/NBA_2014_games-october.html
https://www.basketball-reference.com/leagues/NBA_2014_games-november.html
https://www.basketball-reference.com/leagues/NBA_2014_games-december.html
https://www.basketball-reference.com/leagues/NBA_2014_games-january.html
https://www.basketball-reference.com/leagues/NBA_2014_games-february.html
https://www.basketball-reference.com/leagues/NBA_2014_games-march.html
https://www.basketball-reference.com/leagues/NBA_2014_games-april.html
https://www.basketball-reference.com/leagues/NBA_2014_games-may.html
https://www.basketball-reference.com/leagues/NBA_2014_games-june.html
done

數據展示:
這裏寫圖片描述

補充: 看到後面發現還有一份數據需要用,但是上面的代碼卻不能夠用在這裏。原因是球隊排行的數據被註釋掉了(查看網頁源碼可發現)。所以這裏用到了正則表達式去獲取註釋。

代碼:

'''
    #get_standing_data.py
    獲取《python數據挖掘入門與實踐》決策樹nba球隊預測的球隊排行數據
    存儲地址自行修改
'''
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re

#pattern = re.compile('<!--[\s\S]*?-->')#html註釋的正則:<!--[\s\S]*?-->
pattern = re.compile('<tbody>[\s\S]*?</tbody>')#模仿html註釋的正則
url = 'https://www.basketball-reference.com/leagues/NBA_2013_standings.html'
html = urlopen(url).read()
bsObj = BeautifulSoup(html,'lxml')
content = bsObj.find(id='all_expanded_standings').prettify()
match = re.search(pattern,content)
str_tbody = match.group()
html_tbody = BeautifulSoup(str_tbody,'lxml')#將str字符串傳入獲得html對象
list = []
for tr in html_tbody.find_all('tr'):
    rows = [td.text for td in tr.find_all('td')]
    list.append(rows)

#轉成csv格式
file = 'D:\\Python\\PythonProject\\nba_decisiontree_test\\standing.csv'#自行修改
df_data = pd.DataFrame(data=list)
df_data.to_csv(file)
print('done')



部分數據展示:

>>> df_data
                        0      1      2      3      4      5     6     7   \
0               Miami Heat  66-16   37-4  29-12  41-11   25-5  14-4  12-6   
1    Oklahoma City Thunder  60-22   34-7  26-15   21-9  39-13   7-3   8-2   
2        San Antonio Spurs  58-24   35-6  23-18   25-5  33-19   8-2   9-1   
3           Denver Nuggets  57-25   38-3  19-22  19-11  38-14   5-5  10-0   
4     Los Angeles Clippers  56-26   32-9  24-17   21-9  35-17   7-3   8-2   
5        Memphis Grizzlies  56-26   32-9  24-17   22-8  34-18   8-2   8-2   
6          New York Knicks  54-28  31-10  23-18  37-15  17-13  10-6  12-6   
7            Brooklyn Nets  49-33  26-15  23-18  36-16  13-17  11-5  13-5   
8           Indiana Pacers  49-32  30-11  19-21  31-20  18-12  6-11  13-3   
9    Golden State Warriors  47-35  28-13  19-22  19-11  28-24   7-3   5-5   
10           Chicago Bulls  45-37  24-17  21-20  34-18  11-19  13-5   9-7   
11         Houston Rockets  45-37  29-12  16-25   21-9  24-28   7-3   7-3   
12      Los Angeles Lakers  45-37  29-12  16-25  17-13  28-24   6-4   6-4   
13           Atlanta Hawks  44-38  25-16  19-22  29-23  15-15  7-11  11-7   
14               Utah Jazz  43-39  30-11  13-28  17-13  26-26   5-5   5-5   
15          Boston Celtics  41-40  27-13  14-27  27-24  14-16   7-9   8-9   
16        Dallas Mavericks  41-41  24-17  17-24  17-13  24-28   5-5   6-4   

文件資源: 有用的話點個讚唄

鏈接:https://pan.baidu.com/s/1eUfa914 密碼:5ptu

———關注我的公衆號,一起學數據挖掘————
這裏寫圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章