這個單子沒什麼新的技術，還是和那四個化學一樣，不知道他們最後的分數怎麼樣，希望高一點吧，不然我也不好意思，這裏面附帶一個爬蟲，就直接拿來用了

#!/usr/bin/env python
# coding: utf-8
# In[1]:
#!/usr/bin/env python
# coding: utf-8
# In[ ]:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import time
import re
import csv
from bs4 import BeautifulSoup
 
#設置URL固定部分
url='http://www.cbooo.cn/year?year='
#設置請求頭部信息
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
#循環抓取列表頁信息
for year in range(2009,2019):
    if year == 2009:
        year=str(year)
        a=(url+year)
        r=requests.get(url=a,headers=headers)
        html=r.content
    else:
        year=str(year)
        a=(url+year)
        r=requests.get(url=a,headers=headers)
        html2=r.content
        html = html + html2
    #每次間隔0.5秒
    time.sleep(0.5)
lj=BeautifulSoup(html,'html.parser')
#print(lj)
#提取名稱、類型、總票房（萬）、平均票價、場均人次及國家及地區
result=lj.find_all('td')
#print(result)
#print(len(result))
mname=[]
title=""
index=1
year=2009
for i in result:
    i=str(i)
    title=re.findall(r'</span>(.*?)</p>',i,re.I|re.M)
    if len(title)>0:
        mname.append(index)
        index=index+1
        mname.append(title[0])
    else:
        info=re.findall(r'<td>(.*?)</td>',i,re.I|re.M)
        mname.append(info[0])
#print(len(mname))
#print(mname)
k=0
data=[]
while k<2000:
    year=2009
    year=year+(k//200)
    data.append([mname[k],mname[k+1],mname[k+2],mname[k+3],mname[k+4],mname[k+5],mname[k+6],mname[k+7],year,1])
    k=k+8
#print(data)
print(len(data))#一共250條數據
#將結果存到CSV文件
with open('data.csv','w') as fout:
    cin= csv.writer(fout,lineterminator='\n')
    #寫入row_1    cin.writerow(["index","name","type","zpf","mantimes","price","area","datatime","year","mark"])
    for item in data:
        cin.writerow(item)

import pandas as pd
import numpy  as np
import matplotlib.pyplot as plt
%matplotlib inline
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默認字體
mpl.rcParams['axes.unicode_minus'] = False # 解決保存圖像是負號'-'顯示爲方塊的問題

test=pd.read_csv('data.csv',encoding='gbk')
test.head()

	id	影片名	類型	總票房	平均票價	場均人次	國家及地區	上映日期	年
0	1	2012世界末日	災難	44745	32	68	美國	2009/11/13	2009
1	2	變形金剛2	科幻/動作	40364	32	53	美國	2009/6/24	2009
2	3	建國大業	劇情	39288	32	54	中國/中國香港	2009/9/16	2009
3	4	赤壁(下)	動作	24353	34	49	中國/中國香港	2009/1/7	2009
4	5	三槍拍案驚奇	喜劇	22011	33	49	中國	2009/12/10	2009

檢查數據合理性，對數據進行清洗

缺省值查看

test.isnull().sum()

id       0
影片名      0
類型       1
總票房      0
平均票價     0
場均人次     0
國家及地區    1
上映日期     2
年        0
dtype: int64

test.loc[test['類型'].isnull()]
test.drop([36],inplace=True)
test.loc[test['上映日期'].isnull()]
test.drop([65],inplace=True)

test.isnull().sum()

id       0
影片名      0
類型       0
總票房      0
平均票價     0
場均人次     0
國家及地區    0
上映日期     0
年        0
dtype: int64

test.isnull().sum()

id       0
影片名      0
類型       0
總票房      0
平均票價     0
場均人次     0
國家及地區    0
上映日期     0
年        0
dtype: int64

test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 248 entries, 0 to 249
Data columns (total 9 columns):
id       248 non-null int64
影片名      248 non-null object
類型       248 non-null object
總票房      248 non-null int64
平均票價     248 non-null int64
場均人次     248 non-null int64
國家及地區    248 non-null object
上映日期     248 non-null object
年        248 non-null int64
dtypes: int64(5), object(4)
memory usage: 19.4+ KB

test.hist(figsize=(20,10))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000021685CAD710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000021685F6C898>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000021685F94F28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000021685FC75C0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000021685FEDC50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000021685FEDC88>]],
      dtype=object)

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-CNQ89Apa-1575806387908)(output_10_1.png)]

對所得數據按日期排序

test=test.sort_values(by='上映日期')

test_num=test.groupby(by=['年']).sum()
test_num

	id	總票房	平均票價	場均人次
年
2009	325	395890	797	1058
2010	913	648652	856	961
2011	1509	710355	856	824
2012	2200	1011515	931	803
2013	2825	1174380	939	727
2014	3450	1633415	913	749
2015	4075	2495002	900	799
2016	4700	2513007	861	655
2017	5325	3287129	882	558
2018	5950	3916309	894	544

test_num['總票房'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x21685cad940>

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-WOAUBwR1-1575806387910)(output_14_1.png)]

test_num['平均票價'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x21686107cc0>

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-hMBEjtG2-1575806387911)(output_15_1.png)]

y = test_num['總票房']
X = test_num.drop(['總票房','id'],axis=1)
print('data shape: {0}; no. positive: {1}; no. negative: {2}'.format(
    X.shape, y[y==1].shape[0], y[y==0].shape[0]))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

data shape: (10, 2); no. positive: 0; no. negative: 0

from sklearn import linear_model
model =linear_model.LinearRegression()
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print('train score: {train_score:.6f}; test score: {test_score:.6f}'.format(
    train_score=train_score, test_score=test_score))

train score: 0.785987; test score: 0.901816

Happy丶lazy

發佈了78 篇原創文章 · 獲贊 27 · 訪問量 1萬+

私信關注

20191126_1_電影票房分析

檢查數據合理性，對數據進行清洗

缺省值查看

對所得數據按日期排序

20200308——多項式迴歸預測工資

20191226_2_淘寶乒乓球商品分析

20200203_knn分類算法

深度之眼_Week2 編程作業1_梯度下降

機器學習作業班_python實現支持向量機

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結