採樣就是按照某種規則從數據集中挑選樣本數據,大致分爲3類:隨機採樣、系統採樣和分層採樣。
隨機採樣:就是從數據集中隨機的抽取特定數量的數據,分爲有放回和無放回兩種。
import random
def noRepetRandomSampling(dataMat,number):
'''
無放回採樣
:param dataMat: 數據集
:param number: 採樣數
:return: sample 採樣到的數據
'''
try:
length = len(dataMat)
sample = random.sample(dataMat, number)
return sample
except Exception as e:
print(e)
def repetRandomSampling(dataMat,number):
'''
有放回採樣
:param dataMat: 數據集
:param number: 採樣數
:return: sample 採樣到的數據
'''
sample = []
i = 0
while(i<number):
sample.append(dataMat[random.randint(0,len(dataMat)-1)]) #randint的範圍是a<=x<=b,包括上限,注意要減一
i+=1
return sample
系統採樣:一般是無放回抽樣,又稱等距採樣,先將總體數據集按順序分成n小份,再從每小份抽取第k個數據。
import random
def systemSampling(dataMat,number):
'''
系統採樣
:param dataMat: 數據集
:param number: 採樣數
:return: sample 採樣到的數據
'''
length=len(dataMat)
k=int(length/number)
sample=[]
i=0
if k>0:
while (i<number):
sample.append(dataMat[0+i*k])
i+=1
return sample
else:
return repetRandomSampling(dataMat,number)
分層採樣:就是先將數據分成若干個類別,再從每一層內隨機抽取一定數量的樣本,然後將這些樣本組合起來。
import random
def stratifiedSampling(dataMat1,dataMat2,dataMat3,number):
'''
分層採樣
:param dataMat1: 數據集1
:param dataMat2: 數據集2
:param dataMat3: 數據集3
:param number: 採樣數
:return: sample 採樣到的數據
'''
subNumber=int(number/3)
sample=[]
sample.append(noRepetRandomSampling(dataMat1,subNumber))
sample.append(noRepetRandomSampling(dataMat2,subNumber))
sample.append(noRepetRandomSampling(dataMat3,subNumber))
return sample
測試代碼:
dataMat=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
dataMat1=[101,102,103,104,105,106,107,108,109,110]
dataMat2=[201,202,203,204,205,206,207,208,209,210]
dataMat3=[301,302,303,304,305,306,307,308,309,310]
print(repetRandomSampling(dataMat,6))
print(noRepetRandomSampling(dataMat,6))
print(systemSampling(dataMat,6))
print(stratifiedSampling(dataMat1,dataMat2,dataMat3,6))
運行結果:
E:\Anaconda3\python.exe E:/數據採樣.py
[8, 1, 8, 13, 19, 3]
[14, 8, 5, 1, 17, 16]
[1, 4, 7, 10, 13, 16]
[[108, 105], [201, 208], [301, 308]]