提取某個字段數據並統計其分佈規律

原創

2020-02-22 09:44

背景：老闆提供了一份txt數據集，是關於視頻點播時長的統計，包括視頻ID和播放時長兩個數據變量，部分數據格式如下：

“視頻id” “播放時長”
“00000000020000047018” “00:29:59”
“00000000020000047031” “00:34:59”
“00000000040001292551” “01:05:00”
“00000000040001294405” “01:05:00”
“00000000040001242053” “00:41:00”
“00000000020000675981” “0”
“00000000020000050729” “00:30:00”
“00000000020000050735” “00:09:34”
“00000000020000050741” “00:04:53”
“00000000020000799816” “0”
“00000000020000675988” “0”
“00000000020000675989” “0”
“00000000020000050777” “00:16:22”
“00000000040001297877” “01:05:00”
… …

要實現播放時長的統計規律，需要從中提取處理來第二個字段，並轉化成秒長，排序後進行散點圖繪製，在Ubuntu下使用python語言處理該數據集，效果很好，代碼如下：

import numpy
import io
import csv

import time
import sys,re

import numpy as np  
import matplotlib.pyplot as plt 

## function to change the type of time into seconds
def time2itv(sTime):  

    p="^([0-9]+):([0-5][0-9]):([0-5][0-9])$"  
    cp=re.compile(p)  
    try:  
        mTime=cp.match(sTime)  
    except TypeError:  
        return "[InModuleError]:time2itv(sTime) invalid argument type"  

    if mTime:  
        t=map(int,mTime.group(1,2,3))  
        return 3600*t[0]+60*t[1]+t[2]  
    else:  
        return 0

##write the time into a csv,change it into seconds,and sort it 
def write2csv(stime):
    with open('result.csv', 'wb') as csvfile:
        writer=csv.writer(csvfile)
    writer.writerow(['time'])
        writer.writerows([stime])

##read the time into numbers and time in txt,which are divided by'\t'
timelist=[];

for line in open("test.txt"):
    numbers,time =line.split("\t")
    time=time.strip()
    time=time.rstrip('"')
    time=time.lstrip('"')
    time=time2itv(time)
    timelist.append(time)

#    print time

timelist.sort(reverse=True);


#print "end"
#print timelist
write2csv(timelist)

##draw out the point

#x=1:len(timelist);

y=timelist;

plt.plot(y,marker='o')
plt.show()

運行結果下圖：

長尾分佈特徵十分明顯，複合帕累託定律。

解決的問題
1. 利用python語言讀取txt文件並寫入csv文件；
2. 除去所需字段的非必要字符，如空格，引號等；
2. 實現計時格式從XX：YY：ZZ到XXX格式的轉化；

存在的問題：
1. 爲了提取出有效數字，方便日後利用，選擇使用 list存儲播放時長，但是寫入csv文件的時候出現問題，數據按行排列，並沒有按照預期結果顯示按列排列。
2. 下一步使用python中安裝的SciPy庫，對數據進行擬合。

一個月之後看自己寫的代碼一坨屎，
除了計算時間那個正則表達式像回事，不過貌似第copy別人的
重新寫了一下這段代碼
好歹看上去舒服點。

import time
from numpy import array
from numpy.random import normal
from matplotlib import pyplot

def get_time(filename):
    readfile=open(filename)
    stime=[]
    lines=readfile.readlines()
    for line in lines:
        video_id,time=line.split("\t")
        time=time.strip()
        time=time.strip('"')

        if time!='0':
            time = time.split(':')
            hour= int(time[0])
            minite = int(time[1])
            second = int(time[2])
            #total_time=time[0]*3600+60*time[1]+time[2]
            total_time=3600*hour+60*minite+second
        else:
            total_time=0
        stime.append(total_time)
    return array(stime)

def draw_hist(lenths):
    pyplot.hist(lenths,100)

    pyplot.xlabel('lenth')
    pyplot.xlim(0.0,10000)
    pyplot.ylabel('Frequency')
    pyplot.title('Lenth Of Fake Urls')
    pyplot.show()

stime=get_time("STAT_CONTENT_TIME.txt")
draw_hist(stime)

霸都湯抖森

發佈了29 篇原創文章 · 獲贊 8 · 訪問量 8萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

提取某個字段數據並統計其分佈規律

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

Shell/Python中的用戶名獲取

將頁腳固定在頁面底部

Web開發之Django框架的學習（2）

隨機過程學習之更新過程

Web開發之Django框架的學習

提取某個字段數據並統計其分佈規律

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結