背景:老闆提供了一份txt數據集,是關於視頻點播時長的統計,包括視頻ID和播放時長兩個數據變量,部分數據格式如下:
“視頻id” “播放時長”
“00000000020000047018” “00:29:59”
“00000000020000047031” “00:34:59”
“00000000040001292551” “01:05:00”
“00000000040001294405” “01:05:00”
“00000000040001242053” “00:41:00”
“00000000020000675981” “0”
“00000000020000050729” “00:30:00”
“00000000020000050735” “00:09:34”
“00000000020000050741” “00:04:53”
“00000000020000799816” “0”
“00000000020000675988” “0”
“00000000020000675989” “0”
“00000000020000050777” “00:16:22”
“00000000040001297877” “01:05:00”
… …
要實現播放時長的統計規律,需要從中提取處理來第二個字段,並轉化成秒長,排序後進行散點圖繪製,在Ubuntu下使用python語言處理該數據集,效果很好,代碼如下:
import numpy
import io
import csv
import time
import sys,re
import numpy as np
import matplotlib.pyplot as plt
## function to change the type of time into seconds
def time2itv(sTime):
p="^([0-9]+):([0-5][0-9]):([0-5][0-9])$"
cp=re.compile(p)
try:
mTime=cp.match(sTime)
except TypeError:
return "[InModuleError]:time2itv(sTime) invalid argument type"
if mTime:
t=map(int,mTime.group(1,2,3))
return 3600*t[0]+60*t[1]+t[2]
else:
return 0
##write the time into a csv,change it into seconds,and sort it
def write2csv(stime):
with open('result.csv', 'wb') as csvfile:
writer=csv.writer(csvfile)
writer.writerow(['time'])
writer.writerows([stime])
##read the time into numbers and time in txt,which are divided by'\t'
timelist=[];
for line in open("test.txt"):
numbers,time =line.split("\t")
time=time.strip()
time=time.rstrip('"')
time=time.lstrip('"')
time=time2itv(time)
timelist.append(time)
# print time
timelist.sort(reverse=True);
#print "end"
#print timelist
write2csv(timelist)
##draw out the point
#x=1:len(timelist);
y=timelist;
plt.plot(y,marker='o')
plt.show()
運行結果下圖:
長尾分佈特徵十分明顯,複合帕累託定律。
解決的問題
1. 利用python語言讀取txt文件並寫入csv文件;
2. 除去所需字段的非必要字符,如 空格,引號等;
2. 實現計時格式從XX:YY:ZZ到XXX格式的轉化;
存在的問題:
1. 爲了提取出有效數字,方便日後利用,選擇使用 list存儲播放時長,但是寫入csv文件的時候出現問題,數據按行排列,並沒有按照預期結果顯示按列排列。
2. 下一步使用python中安裝的SciPy庫,對數據進行擬合。
一個月之後看自己寫的代碼一坨屎,
除了計算時間那個正則表達式像回事,不過貌似第copy別人的
重新寫了一下這段代碼
好歹看上去舒服點。
import time
from numpy import array
from numpy.random import normal
from matplotlib import pyplot
def get_time(filename):
readfile=open(filename)
stime=[]
lines=readfile.readlines()
for line in lines:
video_id,time=line.split("\t")
time=time.strip()
time=time.strip('"')
if time!='0':
time = time.split(':')
hour= int(time[0])
minite = int(time[1])
second = int(time[2])
#total_time=time[0]*3600+60*time[1]+time[2]
total_time=3600*hour+60*minite+second
else:
total_time=0
stime.append(total_time)
return array(stime)
def draw_hist(lenths):
pyplot.hist(lenths,100)
pyplot.xlabel('lenth')
pyplot.xlim(0.0,10000)
pyplot.ylabel('Frequency')
pyplot.title('Lenth Of Fake Urls')
pyplot.show()
stime=get_time("STAT_CONTENT_TIME.txt")
draw_hist(stime)