原博客:http://blog.csdn.net/u012609509/article/details/51910405
一、 LSTM網絡原理
- 要點介紹
(1)LSTM網絡用來處理帶“序列”(sequence)性質的數據,比如時間序列的數據,像每天的股價走勢情況,機械振動信號的時域波形,以及類似於自然語言這種本身帶有順序性質的由有序單詞組合的數據。
(2)LSTM本身不是一個獨立存在的網絡結構,只是整個神經網絡的一部分,即由LSTM結構取代原始網絡中的隱層單元部分。
(3)LSTM網絡具有“記憶性”。其原因在於不同“時間點”之間的網絡存在連接,而不是單個時間點處的網絡存在前饋或者反饋。如下圖2中的LSTM單元(隱層單元)所示。圖3是不同時刻情況下的網絡展開圖。圖中虛線連接代表時刻,“本身的網絡”結構連接用實線表示。
2.LSTM單元結構圖
圖4,5是現在比較常用的LSTM單元結構示意圖:
其主要結構成分包含如下:
(1)輸入節點input node:接受上一時刻隱層單元的輸出及當前時刻是樣本輸入;
(2)輸入門input gate:可以看到輸入門會和輸入節點的值相乘,組成LSTM中internal state單元值的一部分,當門的輸出爲1時,輸入節點的激活值全部流向internal state,當門的輸出爲0時,輸入節點的值對internal state沒有影響。
(3)內部狀態internal state。
(4)遺忘門forget gate:用於刷新internal state的狀態,控制internal state的上一狀態對當前狀態的影響。
各節點及門與隱藏單元輸出的關係參見圖4,圖5所示。
二、代碼示例
1.示例介紹
主要以今年參加的“2016年阿里流行音樂趨勢預測”爲例。
時間過得很快,今天已是第二賽季的最後一天了,我從5.18開始接觸賽題,到6.14上午10點第一賽季截止,這一期間,由於是線下賽,可以用到各種模型,而自已又是做深度學習(deep learning)方向的研究,所以選擇了基於LSTM的循環神經網絡模型,結果也很幸運,進入到了第二賽季。開始接觸深度學習也有大半年了,能夠將自已所學用到這次真正的實際生活應用中,結果也還可以,自已感覺很欣慰。突然意識到,自已學習生涯這麼多年,我想“學有所成,學有所用”該是我今後努力的方向和動力了吧。
下面我簡單的介紹一下今年的賽題:
官方給的“輸入”:2張表,一張是用戶行爲表(時間跨度20150301-20150830)mars_tianchi_user_actions,主要描述用戶對歌曲的收藏,下載,播放等行爲,一張是歌曲信息表mars_tianchi_songs,主要用來描述歌曲所屬的藝人,及歌曲的相關信息,如發行時間,初始熱度,語言等。
樣例:
樣例:
官方要求“輸出”:預測隨後2個月(20150901-20151030)每個歌手每天的播放量。輸出格式:
2.初賽所用模型思路
由於是對歌手的播放量進行預測,所以直接對每個歌手的“播放量”這一對象進行統計,查看在20150301-20151030這8個月內歌手的播放量變化趨勢,並以每天的播放量,連續3天的播放均值,連續3天的播放方差,作爲一個時間點的樣本,“滑動”構建神經網絡的訓練集。網絡的構成如下:
(1)輸入層:3個神經元,分別代表播放量,播放均值,播放方差;
(2)第一隱層:LSTM結構單元,帶有35個LSTM單元;
(3)第二隱層:LSTM結構單元,帶有10個LSTM單元;
(4)輸出層:3個神經元,代表和輸入層相同的含義。
目標函數:重構誤差。
下圖是某些歌手的播放統計曲線:
2.預測結果
藍色代表歌手真實的播放曲線,綠色代表預測曲線:
三、代碼
運行環境:windows下的spyder
語言:python 2.7,以及Keras深度學習庫。
由於看這個賽題前,沒有一點Python基礎,所以也是邊想思路邊學Python,對Python中的數據結構不怎麼了解,所以代碼寫得有點爛。但整個代碼是可以運行無誤的。這也是初賽時代碼的最終版本。
# -*- coding: utf-8 -*-
"""
Created on Wed Jun 01 16:34:45 2016
@author: Richer
"""
#%%修改記錄
#1.將最後一層激活函數改爲線性
#2.歌手播放曲線以歌曲量均值化(被第 4 點替換掉了)
#3.加入均值濾波 和 均值特徵
#4.分別對每個歌手進行歸一化處理(每個歌手之間相差太大了)
#5.對歌手進行聚類(效果不好)
#%% 時間序列及字典
from __future__ import division
import pandas as pd
import pdb
#import time
_DEBUG = False
_ISTEST = False
tempList = pd.date_range(start = '20150301',end = '20150830')
i = 0
dateList = [] #給出的數據集所在的時間序列
while i < len(tempList):
strTemp = str(tempList[i])[:10]
strTemp = strTemp.replace('-','')
dateList.append(strTemp)
i = i + 1
recDict = {}.fromkeys(dateList,0) # 給出的數據集所在的時間序列字典
del tempList,i,strTemp
tempList = pd.date_range(start = '20150831', end = '20151030')
i = 0
objDateL = [] #要預測的目標時間序列
while i < len(tempList):
strTemp = str(tempList[i])[:10]
strTemp = strTemp.replace('-','')
objDateL.append(strTemp)
i += 1
del strTemp, i
## 異常數據信息
newSongExcep = 0 # 用戶表中出現的新歌曲
userDsExcep = 0 # 用戶錶行爲不在20150301-20150830
#%% 表處理---歌曲藝人數據
from copy import deepcopy
fileSong = open("p2_mars_tianchi_songs.csv")
songData = fileSong.readlines()
bigSongDict = {} # 以歌曲爲中心的大表
for songInfo in songData:
songInfo = songInfo.replace('\n','')
arrayInfo = songInfo.split(',')
bigSongDict[arrayInfo[0]] = {} # 注:此處需要初始化,否則會出錯
bigSongDict[arrayInfo[0]]['artist_id'] = arrayInfo[1]
bigSongDict[arrayInfo[0]]['publish_time'] = arrayInfo[2]
bigSongDict[arrayInfo[0]]['song_init_plays'] = arrayInfo[3]
bigSongDict[arrayInfo[0]]['Language'] = arrayInfo[4]
bigSongDict[arrayInfo[0]]['Gender'] = arrayInfo[5]
bigSongDict[arrayInfo[0]]['nUser'] = 0 #用戶數目
bigSongDict[arrayInfo[0]]['playRec'] = deepcopy(recDict) #播放記錄
bigSongDict[arrayInfo[0]]['downloadRec'] = deepcopy(recDict) #下載記錄
bigSongDict[arrayInfo[0]]['colloctRec'] = deepcopy(recDict) #收藏記錄
fileSong.close()
del songData,arrayInfo,songInfo
# 用戶行爲數據
fileUser = open("p2_mars_tianchi_user_actions.csv")
userData = fileUser.readlines()
for userInfo in userData:
userInfo = userInfo.replace('\n','')
arrUser = userInfo.split(',')
if (arrUser[1] in bigSongDict):
bigSongDict[arrUser[1]]['nUser'] += 1
if arrUser[3] == '1':
bigSongDict[arrUser[1]]['playRec'][arrUser[4]] += 1
if arrUser[3] == '2':
bigSongDict[arrUser[1]]['downloadRec'][arrUser[4]] += 1
if arrUser[3] == '3':
bigSongDict[arrUser[1]]['colloctRec'][arrUser[4]] += 1
else:
newSongExcep = newSongExcep + 1
fileUser.close()
del userData,userInfo,arrUser
#%%統計每個藝人的播放,下載,收藏的變化曲線(20150301-20150830)
from collections import Counter
singerDict = {} #歌手信息統計
for songKey in bigSongDict.keys():
theArtist = bigSongDict[songKey]['artist_id']
if (theArtist in singerDict):
# dict(Counter())會把 0 值去掉
# 對應的 key 相加
singerDict[theArtist]['playRec'] = dict(Counter(singerDict[theArtist]['playRec']) + Counter(bigSongDict[songKey]['playRec']))
singerDict[theArtist]['downloadRec'] = dict(Counter(singerDict[theArtist]['downloadRec']) + Counter(bigSongDict[songKey]['downloadRec']))
singerDict[theArtist]['colloctRec'] = dict(Counter(singerDict[theArtist]['colloctRec']) + Counter(bigSongDict[songKey]['colloctRec']))
singerDict[theArtist]['nSongs'] += 1
else:
singerDict[theArtist] = {}
singerDict[theArtist]['playRec'] = deepcopy(bigSongDict[songKey]['playRec'])
singerDict[theArtist]['downloadRec'] = deepcopy(bigSongDict[songKey]['downloadRec'])
singerDict[theArtist]['colloctRec'] = deepcopy(bigSongDict[songKey]['colloctRec'])
singerDict[theArtist]['nSongs'] = 1
#%%將singerDict中字典轉換爲序列-按日期排序
import numpy as np
singerInfoList = {}
tpPlayList = [] # 播放列表
tpDownList = [] # 下載列表
tpCollectList = [] # 收藏列表
artList = [] # 歌手列表
i = 0
for singer in singerDict.keys():
artList.append(singer)
singerInfoList[singer] = {}
#numSongs = singerDict[singer]['nSongs'] #對應歌手的歌曲數量
while i < len(dateList):
if (dateList[i] in singerDict[singer]['playRec'].keys()):
tpPlayList.append(singerDict[singer]['playRec'][dateList[i]])
else:
tpPlayList.append(0)
if (dateList[i] in singerDict[singer]['downloadRec'].keys()):
tpDownList.append(singerDict[singer]['downloadRec'][dateList[i]])
else:
tpDownList.append(0)
if(dateList[i] in singerDict[singer]['colloctRec'].keys()):
tpCollectList.append(singerDict[singer]['colloctRec'][dateList[i]])
else:
tpCollectList.append(0)
i += 1
i = 0
meanPlays = np.mean(tpPlayList)
stdPlays = np.std(tpPlayList)
singerInfoList[singer]['meanPlay'] = meanPlays
singerInfoList[singer]['stdPlay'] = stdPlays
singerInfoList[singer]['maxPlay'] = (abs((np.array(tpPlayList) - meanPlays) / stdPlays)).max()
singerInfoList[singer]['playRec'] = deepcopy(tpPlayList)
singerInfoList[singer]['downloadRec'] = deepcopy(tpDownList)
singerInfoList[singer]['colloctRec'] = deepcopy(tpCollectList)
del tpPlayList, tpDownList, tpCollectList
tpPlayList = []
tpDownList = []
tpCollectList = []
del tpPlayList, tpDownList, tpCollectList, singer,meanPlays,stdPlays
#%%對每個歌手的播放曲線進行FFT變換
import matplotlib.pyplot as plt
import math
#i = 0
#if _ISTEST == True:
# while i < len(singerInfoList):
# flagY = i % 9
# if flagY ==0:
# plt.figure(figsize = (10,8), dpi = 150)
# plt.suptitle('FFT process')
# plt.subplot(3,3,flagY + 1)
# fAmp = np.fft.fft(singerInfoList[artList[i]]['playRec']) / len(dateList)
# plt.stem(abs(fAmp[1:(len(fAmp)/2)]))
# i += 1
# del fAmp
#
#pdb.set_trace()
#predictTestFFT = {} #使用FFT迴歸預測結果
#playLth = 0 #選取播放序列的長度做FFT
#chsNum = np.ones(len(singerInfoList),dtype=np.int) * 1 #選擇前10個峯值做趨勢預測
##chsNum[0] = 10
##chsNum[5] = 10
##chsNum[7] = 10
##chsNum[8] = 10
##chsNum[10] = 10
##chsNum[17] = 10
##chsNum[21] = 10
##chsNum[22] = 10
#
#if _ISTEST == True:
# playLth = len(dateList) - len(objDateL)
#else:
# playLth = len(dateList)
#
#j = 0 #歌手索引
#i = 0 #FFT索引
#while j < len(singerInfoList):
# i = 0
# ampFFT = np.fft.fft(singerInfoList[artList[j]]['playRec'][:playLth]) / playLth
# sortInd = sorted(xrange(len(ampFFT)),key = (abs(ampFFT)).__getitem__,reverse = True) #降序排列
# chsAmp = np.zeros(chsNum[j])
# while i < chsNum[j]:
# chsAmp[i] = ampFFT[sortInd[i]]
# i += 1
# dateRcon = np.zeros((playLth + len(objDateL)))
# ind = np.arange(0,len(dateRcon),1.0) / len(ampFFT) * (2 * np.pi)
# for k, p in enumerate(chsAmp):
# if k != 0:
# p *= 2
# dateRcon += np.real(p) * np.cos(k * ind)
# dateRcon -= np.imag(p) * np.sin(k * ind)
# predictTestFFT[artList[j]] = {}
# predictTestFFT[artList[j]]['playRec'] = deepcopy((list(dateRcon))[playLth:(playLth + len(objDateL))])
#
# if _ISTEST == True:
# flagY = j % 9
# if flagY == 0:
# plt.figure(figsize = (10,8),dpi = 150)
# plt.suptitle('predict test play - use fft')
# plt.subplot(3,3,flagY + 1)
# plt.plot(singerInfoList[artList[j]]['playRec'][playLth:(playLth + len(objDateL))],'b')
# plt.plot(predictTestFFT[artList[j]]['playRec'],'g')
# j += 1
# del ampFFT,sortInd,chsAmp,dateRcon,ind
#
#
#pdb.set_trace()
#%% 繪製歌手播放,下載,收藏曲線
xVal = range(len(dateList)) #x座標值
i = 0
while i < len(singerInfoList): # 每個歌手播放曲線
flagY = i % 9
if flagY == 0:
plt.figure(figsize = (10,8), dpi = 150)
plt.suptitle('every singer average playK-downloadB-colloctR line')
plt.subplot(3,3,flagY + 1)
plt.plot(singerInfoList[artList[i]]['playRec'],'k')
plt.plot(singerInfoList[artList[i]]['downloadRec'],'b')
plt.plot(singerInfoList[artList[i]]['colloctRec'],'r')
i += 1
del flagY
#%%提取歌手的標準差信息並進行排序
#nCls = 1 #分類數
#clsTh = 0 #第幾類
#
#nSgrToCls = [] #每類的歌手數量列表
#stdPlayList = [] #所有歌手標準差列表
#indStdList = [] #排序後的數據在原始序列中的索引
#
#i = 0
#while i < len(artList):
# stdPlayList.append(singerInfoList[artList[i]]['stdPlay'])
# i += 1
#
#indStdList = sorted(xrange(len(stdPlayList)),key = stdPlayList.__getitem__) #默認降序排列
#
#i = 0
#while i < (nCls - 1):
# nSgrToCls.append(int(len(singerInfoList) / nCls))
# i += 1
#if nCls == 1:
# nSgrToCls.append(int(len(singerInfoList)))
#else:
# nSgrToCls.append(int(len(singerInfoList) - (nCls - 1) * nSgrToCls[0]))
#
#nObjSgr = nSgrToCls[clsTh] #目標歌手數量
#objInd = [] #初始化-對應的索引
#if clsTh == (nCls -1):
# objInd = indStdList[( (nCls - 1) * nSgrToCls[0] ):]
#else:
# objInd = indStdList[(clsTh * nSgrToCls[0]):((clsTh + 1) * nSgrToCls[0])]
nObjSgr = len(singerInfoList)
objInd = range(nObjSgr)
#%% 將singerDict 的 playRec downloadRec colloctRec按時間順序轉換爲list
# 且分別對每個歌手數據進行歸一化
playList = [] #大播放列表
downList = [] # 大下載列表
collectList = [] #大收藏列表
avePlayList = [] # 播放曲線的均值濾波後曲線
varPlayList = [] #實際上是標準差曲線
i = 0
while i < nObjSgr:
artSg = artList[objInd[i]]
meanPlays = singerInfoList[artSg]['meanPlay']
stdPlays = singerInfoList[artSg]['stdPlay']
maxPlays = singerInfoList[artSg]['maxPlay']
playList = playList + list( (np.array(singerInfoList[artSg]['playRec']) - meanPlays) / (stdPlays * maxPlays) )
downList = downList + singerInfoList[artSg]['downloadRec']
collectList = collectList + singerInfoList[artSg]['colloctRec']
i += 1
del meanPlays,stdPlays,maxPlays,artSg
#所有歌手的播放下載收藏曲線放在一起
plt.figure(figsize = (10,8), dpi = 150)
plt.plot(playList,'k')
plt.plot(downList,'b')
plt.plot(collectList,'r')
plt.title('overall playK-downB-colloctR')
#相關參數(影響結果的重要參數)
seqLength = 10 #序列長度
testSetRate = 0 #測試集比例
if _ISTEST == True:
testSetRate = len(objDateL) / len(dateList)
else:
testSetRate = 0
lenDate = len(dateList) #給定的數據集時間長度
nSinger = nObjSgr #len(singerInfoList) #藝人數量
batchSize = 50
validRate = 0.2
aveFilter = 4 # 均值濾波長度
in_out_neurons = 3 #輸入輸出神經元個數
firLSTM = 35 #第一層神經元個數
secLSTM = 10 #第二層神經元個數
epochD = 600 #迭代次數
#%%對播放曲線列表 playList 進行均值濾波 及 求取標準差曲線
i = 0
while i < nSinger:
j = i * lenDate
fj = i * lenDate #起點
ej = (i + 1) * lenDate #終點
while j < ej:
if j < (i * lenDate + aveFilter -1):
avePlayList.append(np.mean(playList[fj:(j+1)]))
varPlayList.append(np.std(playList[fj:(j+1)]))
else:
avePlayList.append(np.mean(playList[(j-aveFilter+1):(j+1)]))
varPlayList.append(np.std(playList[(j-aveFilter+1):(j+1)]))
j +=1
i +=1
#均值濾波結果顯示
i = 0
while i < nSinger:
flagY = i % 9
if flagY == 0:
plt.figure(figsize = (10,8), dpi =150)
plt.suptitle('average filter-play-originalK filterB')
plt.subplot(3,3,flagY + 1)
stPt = i * lenDate
endPt = (i + 1) * lenDate
plt.plot(playList[stPt:endPt],'k')
plt.plot(avePlayList[stPt:endPt],'b')
i += 1
dateSet = pd.DataFrame({"avePlay":avePlayList,"play":playList,"varPlay":varPlayList}) #全體數據集
dateSet.to_csv("originalDataSet.csv")
dateSetOrigin = deepcopy(dateSet) # 原始數據集保存一份
# 數據預處理 去均值 方差歸一 縮放到[-1 1]
#if _DEBUG == True:
# pdb.set_trace()
#avePlayMean = dateSet['avePlay'].mean()
##downMean = dateSet['down'].mean()
#playMean = dateSet['play'].mean()
#
#dateSet['avePlay'] = dateSet['avePlay'] - avePlayMean
##dateSet['down'] = dateSet['down'] - downMean
#dateSet['play'] = dateSet['play'] - playMean
#
#avePlayStd = dateSet['avePlay'].std()
##downStd = dateSet['down'].std()
#playStd = dateSet['play'].std()
#
#dateSet['avePlay'] = dateSet['avePlay'] / avePlayStd
##dateSet['down'] = dateSet['down'] / downStd
#dateSet['play'] = dateSet['play'] / playStd
#
#factorMax = abs(dateSet).max().max() + 0.05
#
#dateSet = dateSet / factorMax
#dateSet.to_csv("preproceeDataSet.csv")
#所有歌手的播放曲線
plt.figure(figsize = (10,8), dpi = 150)
plt.plot(dateSet['play'],'k')
plt.plot(dateSet['avePlay'],'b')
plt.plot(dateSet['varPlay'],'g')
plt.xlabel('index')
plt.ylabel('playK-avePlayB')
plt.title('overall playK-avePlayB-varPlayG - preprocessed')
#%%訓練集測試集劃分
def load_data(data, n_prev = 14):
docX, docY = [], []
for i in range(len(data)-n_prev):
# pdb.set_trace()
docX.append(data.iloc[i:i+n_prev].as_matrix())
docY.append(data.iloc[i+n_prev].as_matrix())
# alsX = np.array(docX)
# alsY = np.array(docY)
return docX, docY
def train_test_split(df, test_size = 1 / 3, seqL = 14):
ntrn = int(round(len(df) * (1 - test_size)))
X_train, y_train = load_data(df.iloc[0:ntrn],seqL)
X_test, y_test = load_data(df.iloc[ntrn:],seqL)
return (X_train, y_train), (X_test, y_test)
# 訓練集 測試集 劃分
if _DEBUG == True:
pdb.set_trace()
#初值
(xTrain,yTrain), (xTest,yTest) = train_test_split(dateSet[0:lenDate],testSetRate,seqLength)
needPredict = [] # 需要被預測的後續序列的真實值
tempIndex = int(round(lenDate * (1 - testSetRate)))
if _ISTEST == True:
needPredict.append(dateSet[0:lenDate].iloc[tempIndex:].as_matrix()) # 三維數組,每組是一個歌手需要預測的序列
i = 1
while i < nSinger:
startPt = i * lenDate
endPt = (i + 1) * lenDate
tempData = dateSet[startPt:endPt]
(xTrainTp,yTrainTp), (xTestTp,yTestTp) = train_test_split(tempData,testSetRate,seqLength)
xTrain = np.vstack((xTrain,xTrainTp))
yTrain = np.vstack((yTrain,yTrainTp))
xTest = np.vstack((xTest,xTestTp))
yTest = np.vstack((yTest,yTestTp))
tempIndex = int(round(len(tempData) * (1 - testSetRate)))
if _ISTEST == True:
needPredict.append(tempData.iloc[tempIndex:].as_matrix())
i += 1
X_Train = np.array(xTrain)
Y_Train = np.array(yTrain)
X_Test = np.array(xTest)
Y_Test = np.array(yTest)
del xTrain, yTrain, xTest, yTest
#%%繪製需要被預測的數據之間的差異
if _ISTEST == True:
i = 0
plt.figure(figsize = (10,8), dpi = 150)
while i < nSinger:
orgValue = pd.DataFrame(needPredict[i])
plt.plot(orgValue[1])
i += 1
del orgValue
plt.suptitle('need predict test data - preprocess data')
#%% 訓練算法模型
if _DEBUG == True:
pdb.set_trace()
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.layers.recurrent import LSTM
from keras.callbacks import EarlyStopping
model = Sequential()
# LSTM作爲第一層---輸入層維度:input_dim,輸出層維度:hidden_neurons
model.add(LSTM(firLSTM, input_dim=in_out_neurons, input_length=seqLength,return_sequences=True))
model.add(LSTM(secLSTM,return_sequences=False))
#model.add(LSTM(thiLSTM))
# 標準的一維全連接層---輸出:in_out_neurons,輸入:input_dim
model.add(Dense(in_out_neurons,activation='linear'))
model.compile(loss="mse", optimizer="rmsprop") # mse mean_squared_error
#提前中斷訓練
earlyStopping = EarlyStopping(monitor = 'val_loss', patience = 10)
# X_Train三維數組,每組是一個序列
hist = model.fit(X_Train, Y_Train, batch_size=batchSize, nb_epoch=epochD, verbose=0, shuffle = False,validation_split=validRate,callbacks = [earlyStopping])
#print(hist.history)
#對訓練集進行預測-調試用
predictTrain = model.predict(X_Train) # 二維數組,每一行是一組預測值
predictDF = pd.DataFrame(predictTrain)
Y_TrainDF = pd.DataFrame(Y_Train)
plt.figure(figsize = (10,8), dpi = 150)
plt.plot(list(predictDF[1]),'g')
plt.plot(list(Y_TrainDF[1]),'b')
plt.title('train set predict check')
if _DEBUG == True:
pdb.set_trace()
#%%預測
i = 0
j = 0
predictTest = {} # 所有歌手最終預測結果
while j < nSinger:
artSg = artList[objInd[j]]
predictTest[artSg] = {}
predictTest[artSg]['playRec'] = []
predictTest[artSg]['avePlay'] = []
predictTest[artSg]['varPlay'] = []
j += 1
del artSg
if _DEBUG == True:
pdb.set_trace()
i = 0
j = 0
lastIndex = len(X_Train) / nSinger
while j < nSinger:
lastData = np.array([X_Train[int(lastIndex * (j+1) -1)]])
while i < len(objDateL): #預測天數
predictTp = model.predict(lastData)
artSg = artList[objInd[j]]
predictTest[artSg]['varPlay'].append(predictTp[0][2])
predictTest[artSg]['playRec'].append(predictTp[0][1])
predictTest[artSg]['avePlay'].append(predictTp[0][0])
lastData = np.array([np.vstack((lastData[0][1:],predictTp))])
i += 1
j += 1
i = 0
del lastData, predictTp
del artSg
# 預測結果分析---數據還原之前
i = 0
xIndex = range(len(objDateL))
if _ISTEST == True:
while i < nSinger: # 播放預測曲線
flagY = i % 9
if flagY == 0:
plt.figure(figsize = (10,8), dpi = 150)
plt.suptitle('test set: predict play')
plt.subplot(3,3,flagY + 1)
orgValue = pd.DataFrame(needPredict[i]) # needPredict三維數組,每組是一個歌手需要預測的序列值
artSg = artList[objInd[i]]
plt.plot(xIndex,predictTest[artSg]['playRec'],'g')
plt.plot(xIndex,orgValue[1],'b')
i += 1
del orgValue
del artSg
i = 0
if _ISTEST == True:
while i < nSinger: # 平均值預測曲線
flagY = i % 9
if flagY == 0:
plt.figure(figsize = (10,8), dpi = 150)
plt.suptitle('test-predict avePlay')
plt.subplot(3,3,flagY + 1)
orgValue = pd.DataFrame(needPredict[i])
artSg = artList[objInd[i]]
plt.plot(xIndex,predictTest[artSg]['avePlay'],'g')
plt.plot(xIndex,orgValue[0],'b')
i += 1
del orgValue
del artSg
#i = 0
#while i <nSinger: # 收藏預測曲線
# flagY = i % 9
# if flagY == 0:
# plt.figure(figsize = (10,8), dpi = 150)
#
# plt.subplot(3,3,flagY +1)
# orgValue = pd.DataFrame(needPredict[i])
# plt.plot(xIndex,predictTest[artList[i]]['colloctRec'],'g')
# plt.plot(xIndex,orgValue[0],'b')
#
# i += 1
# del orgValue
#plt.suptitle('test-predict colloct')
#%%預測---還原到原始數據集
if _ISTEST == True:
i = 0
while i < nSinger:
flagY = i % 9
if flagY == 0:
plt.figure(figsize = (10,8), dpi =150)
plt.suptitle('test-predict play- back to original')
plt.subplot(3,3,flagY + 1)
artSg = artList[objInd[i]]
meanPlays = singerInfoList[artSg]['meanPlay']
stdPlays = singerInfoList[artSg]['stdPlay']
maxPlays = singerInfoList[artSg]['maxPlay']
orgValue = ((pd.DataFrame(needPredict[i]))[1]) * maxPlays * stdPlays + meanPlays
aftValue = ((pd.DataFrame(predictTest[artSg]['playRec']))[0]) * maxPlays * stdPlays + meanPlays
plt.plot(xIndex,orgValue,'b')
plt.plot(xIndex,aftValue,'g')
i +=1
del orgValue, aftValue
del artSg
#使用 aveplay 預測真實 play
if _ISTEST == True:
i = 0
while i < nSinger:
flagY = i % 9
if flagY == 0:
plt.figure(figsize = (10,8), dpi =150)
plt.suptitle('use avePlay to predict real play line')
plt.subplot(3,3,flagY + 1)
artSg = artList[objInd[i]]
meanPlays = singerInfoList[artSg]['meanPlay']
stdPlays = singerInfoList[artSg]['stdPlay']
maxPlays = singerInfoList[artSg]['maxPlay']
orgValue = ((pd.DataFrame(needPredict[i]))[1]) * maxPlays * stdPlays + meanPlays
aftValue = ((pd.DataFrame(predictTest[artSg]['avePlay']))[0]) * maxPlays * stdPlays + meanPlays
plt.plot(xIndex,orgValue,'b')
plt.plot(xIndex,aftValue,'g')
i +=1
del orgValue, aftValue
del artSg
#%%融合svr
svrResult = {}
fileSVR = open("svr.csv")
svrData = fileSVR.readlines()
for svrInfo in svrData:
svrInfo = svrInfo.replace('\n','')
arrInfo = svrInfo.split(',')
svrResult[arrInfo[0]] = int(arrInfo[1])
fileSVR.close()
del svrData,svrInfo,arrInfo
#%% 評價指標
if _ISTEST == True:
singerF = [] # 每個歌手的評價指標值 F
sumF = 0
i = 0
while i < nSinger:
artSg = artList[objInd[i]]
meanPlays = singerInfoList[artSg]['meanPlay']
stdPlays = singerInfoList[artSg]['stdPlay']
maxPlays = singerInfoList[artSg]['maxPlay']
orgValue = ((pd.DataFrame(needPredict[i]))[1]) * maxPlays * stdPlays + meanPlays
aftValue = ((pd.DataFrame(predictTest[artSg]['playRec']))[0]) * maxPlays * stdPlays + meanPlays
tempArr = (np.array(aftValue) - np.array(orgValue)) / (np.array(orgValue))
tempS = ((tempArr * tempArr).sum()) / len(objDateL)
theta = math.sqrt(tempS)
tempFi = math.sqrt((np.array(orgValue)).sum())
sumF = sumF + (1-theta) * tempFi
singerF.append((1-theta) * tempFi)
i += 1
del orgValue,aftValue,tempArr
del artSg
if _ISTEST == True:
singerFA = [] # 每個歌手的評價指標值 F
sumF = 0
i = 0
while i < nSinger:
artSg = artList[objInd[i]]
meanPlays = singerInfoList[artSg]['meanPlay']
stdPlays = singerInfoList[artSg]['stdPlay']
maxPlays = singerInfoList[artSg]['maxPlay']
orgValue = ((pd.DataFrame(needPredict[i]))[1]) * maxPlays * stdPlays + meanPlays
aftValue = (((pd.DataFrame(predictTest[artSg]['playRec']))[0]) * maxPlays * stdPlays + meanPlays) * 0.5 + svrResult[artSg] * 0.5
tempArr = (np.array(aftValue) - np.array(orgValue)) / (np.array(orgValue))
tempS = ((tempArr * tempArr).sum()) / len(objDateL)
theta = math.sqrt(tempS)
tempFi = math.sqrt((np.array(orgValue)).sum())
sumF = sumF + (1-theta) * tempFi
singerFA.append((1-theta) * tempFi)
i += 1
del orgValue,aftValue,tempArr
del artSg
# resF = pd.DataFrame({"singerf":singerF})
# resF.to_csv("singerF.csv")
#%%使用均值預測後的評價指標值
#singerF_AVG = [] # 每個歌手的評價指標值 F
#sumF = 0
#i = 0
#while i < nSinger:
# meanPlays = singerInfoList[artList[i]]['meanPlay']
# stdPlays = singerInfoList[artList[i]]['stdPlay']
# maxPlays = singerInfoList[artList[i]]['maxPlay']
#
# orgValue = ((pd.DataFrame(needPredict[i]))[1]) * maxPlays * stdPlays + meanPlays
# aftValue = ((pd.DataFrame(predictTest[artList[i]]['avePlay']))[0]) * maxPlays * stdPlays + meanPlays
#
# tempArr = (np.array(aftValue) - np.array(orgValue)) / (np.array(orgValue))
# tempS = ((tempArr * tempArr).sum()) / len(objDateL)
# theta = math.sqrt(tempS)
#
# tempFi = math.sqrt((np.array(orgValue)).sum())
# sumF = sumF + (1-theta) * tempFi
#
# singerF_AVG.append((1-theta) * tempFi)
#
# i += 1
# del orgValue,aftValue,tempArr
#sum(singerF_AVG[:36]) + sum(singerF_AVG[37:56]) + sum(singerF_AVG[57:])
#%%寫入到預測文件
if _ISTEST == False:
import csv
resFile = open("mars_tianchi_artist_plays_predict.csv","wb")
writerRes = csv.writer(resFile)
i = 0
j = 1
while i < nSinger:
artSg = artList[objInd[i]]
meanPlays = singerInfoList[artSg]['meanPlay']
stdPlays = singerInfoList[artSg]['stdPlay']
maxPlays = singerInfoList[artSg]['maxPlay']
aftValue = (((pd.DataFrame(predictTest[artSg]['playRec']))[0]) * maxPlays * stdPlays + meanPlays) * 0.5 + svrResult[artSg] * 0.5
while j < len(objDateL):
oneLineData = [artSg,str(int(aftValue[j])),objDateL[j]]
writerRes.writerow(oneLineData)
del oneLineData
j += 1
del aftValue
j = 1
i += 1
resFile.close()
del artSg
四、參考文獻
1.LSTM入門介紹比較好的文章:A Critical review of rnn for sequence learning
2.LSTM學習思路,參見知乎的一個介紹,很詳細:https://www.zhihu.com/question/29411132 。
3.Python入門視頻教程—可看南京大學張莉老師在coursera上的公開課《用Python玩轉數據》,有例子介紹,很實用。https://www.coursera.org/learn/hipython/home/welcome。
4.Keras介紹—參看官方文檔http://keras.io/