最全隨機抽樣算法(從N個數中抽取M個等)集合

1.從N個數中等概率抽取M個數

從N個樣本中等概率抽取M個樣本(M<N)是常見的需求。現在我們以一個數組來模擬樣本，看看怎麼實現這個算法。
最容易想到的方法，肯定就是直接等概率抽取。具體做法如下：每次都隨機在[0, N-1](假設第一個樣本d的標號爲0)之間抽取一個數，並且與之前的數相比較。如果與前面生成的隨機數相同，則繼續隨機生成，直到生成一個與之前所有生成數不同的數。如果不相同，則將該隨機數添加到結果集中，並繼續隨機抽取，直至結果集中的數爲M個。

    public static Set<Integer> sampletest() {
        Set<Integer> set = new HashSet<>();
        int first = RandomUtils.nextInt(0, 24);
        int second = RandomUtils.nextInt(0, 24);
        int third = RandomUtils.nextInt(0, 24);
        set.add(first);
        while(set.contains(second)) {
            second = RandomUtils.nextInt(0, 24);
        }
        set.add(second);
        while(set.contains(third)) {
            third = RandomUtils.nextInt(0, 24);
        }
        set.add(third);
        return set;
    }
    public static void samplemassive() {
        Map<Integer, Integer> map = new HashMap();
        for(int i=0; i<10000; i++) {
            Set<Integer> res = sampletest();
            for(int each: res) {
                map.put(each, map.getOrDefault(each, 0) + 1);
            }
        }
        for(Map.Entry<Integer, Integer> entry: map.entrySet()) {
            System.out.println(entry.getKey() + ": " + entry.getValue());
        }
    }

上面的代碼是在24個數中隨機抽取3個數，然後將該抽樣重複一萬次，輸出最後的結果。
將samplemassive方法run起來以後，輸出結果如下：

一共需要抽樣出來10000*3=30000個數，每個數出現的次數平均爲30000/24=1250次。上面的結果大致滿足等概率均勻分佈。

上面算法的問題在於，當m比較大的時候，每次調用random方法生成的數與之前重合的概率也會越來越大，則while循環裏random的調用次數會越來越多，這樣時間複雜度就會升高。
那麼具體的時間複雜度是多少呢？可以定量分析一下。
假設之前已經生成了x個數，接下來生成第x-1個數。
第一次調用random就成功生成第x-1個數的概率爲 $1 - \frac{x}{n}$
第二次調用random就成功生成第x-1個數的概率爲 $(1 - \frac{x}{n})\frac{x}{n}$
第k次調用random就成功生成第x-1個數的概率爲 $(1 - \frac{x}{n}){(\frac{x}{n})}^{k-1}$

那麼生成第x+1個數需要調用random方法的次數爲:
$E(x+1) = (1 - \frac{x}{n}) * 1 + (1 - \frac{x}{n})\frac{x}{n} * 2 + \cdots + (1 - \frac{x}{n}){(\frac{x}{n})}^{k-1} * k + \cdots = \frac{n}{n-x}$
上述等差-等比數列求和的方法，見參考文獻1，只需要中學數學知識即可理解。

則調用random方法的總次數期望爲:
$E(random) = \frac{n}{n} + \frac{n}{n-1} + \cdots + \frac{n}{n-m-1} \approx O(n(lg(n) - lg(n-m)))$
當m接近n時，此時時間複雜度接近 $O(nlogn)$ ，算法的複雜度比較高。

上面的sample算法比較笨，實現一個通用的從N個數抽取M個的算法。

    public static Set<Integer> sampletest(int n, int m) {
        Set<Integer> set = new HashSet<>();
        int first = RandomUtils.nextInt(0, n);
        set.add(first);
        while(set.size() < m) {
            int tmp = RandomUtils.nextInt(0, n);
            while(set.contains(tmp)) {
                tmp = RandomUtils.nextInt(0, n);
            }
            set.add(tmp);
        }
        return set;
    }

其中，n是原始樣本長度，m爲待抽取樣本個數。

2.時間複雜度爲O(n)的從N個數中抽取M個的算法

上面的算法，時間複雜度爲 $O(nlgn)$ 。那麼有沒有時間複雜度更低的算法呢？
答案是有的，用蓄水池算法就可以實現。
關於蓄水池算法的具體原理，可查閱參考文獻2。
直接上一個例子。

       public static int[] reservoir(int[] array, int m) {
        int[] result = new int[m];
        int n = array.length;
        for(int i=0; i<n; i++) {
            int current_num = array[i];
            if(i < m) {
                result[i] = current_num;
            } else {
                int tmp = RandomUtils.nextInt(0, i+1);
                if(tmp < m) {
                    result[tmp] = current_num;
                }
            }
        }
        return result;
    }

    public static void massive_reservoir() {
        int[] array = {0, 1, 2, 3, 4};
        int m = 2;
        Map<Integer, Integer> map = new HashMap();
        for(int i=0; i<10000; i++) {
            int[] result = reservoir(array, m);
            for(int each: result) {
                map.put(each, map.getOrDefault(each, 0) + 1);
            }
        }
        for(Map.Entry<Integer, Integer> entry: map.entrySet()) {
            System.out.println(entry.getKey() + ": " + entry.getValue());
        }
    }

上面代碼模擬的是從{0, 1, 2, 3, 4}中隨機抽取兩個數，重複10000次。
最後運行的結果如下：

3.隨機抽取有序列表

上面抽樣的結果都是無序的，只需要滿足最後出現的概率相等即可。例如從{0, 1, 2, 3, 4}中抽取兩個數，有可能先抽到0，也有可能先抽到4。如果我們要求抽樣結果是有序的，那該怎麼辦？
這種情況在實際中很常見。比如在流量分配系統中，流量都是流式過來的，或者說是有序的。假設有十個流量依次過來，需要在這十個流量隨機選擇三個投放三個廣告，並且每個流量投放廣告的概率都相等。這種場景就跟抽取有序列表類似。
在Knuth的《計算機程序設計藝術第2卷半數值算法》一書中，給出了一個算法。

void GenerateKnuth(int n,int m)
{
	int t=m;
	for(int i=0;i<n;i++)
		if(Rand(0,n-1-i)<t)//即以t/(n-i)的概率執行下面的語句
		{
			printf("%d\n",i);
			t--;
		}
}

上面的n是指待抽取的列表總長度，m爲想要抽取的結果個數。

    public static List<Integer> randomtest() {
        int m = 3;
        int tmp = m;
        int[] array = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
        int len = array.length;
        List<Integer> list = new ArrayList<>();
        for(int i=0; i<len; i++) {
            if(RandomUtils.nextInt(0, len - i) < tmp) {
                list.add(array[i]);
                tmp--;
            }
        }
        return list;
    }

    public static void massive_randomtest() {
        Map<Integer, Integer> map = new HashMap();
        for(int i=0; i<10000; i++) {
            List<Integer> list = randomtest();
            for(int each: list) {
                map.put(each, map.getOrDefault(each, 0) + 1);
            }
        }
        for(Map.Entry<Integer, Integer> entry: map.entrySet()) {
            System.out.println(entry.getKey() + ": " + entry.getValue());
        }
    }

上面代碼的寫法是按照算法的思路來的。我在項目實現過程中，想了另外一種更容易理解，也更只管的實現方式。可以進行簡單的證明如下:
1.需要保證每個樣本被抽到的概率是 $\frac{m}{n}$
2.第一個樣本按 $\frac{m}{n}$ 的概率進行抽樣即可。
3.對於第二個樣本，如果第一個樣本被抽中，其被抽中的概率爲 $\frac{m-1}{n-1}$ 。如果第一個樣本沒有被抽中，其被抽中的概率爲 $\frac{m}{n-1}$ 。第二個樣本被抽中的概率爲 $\frac{m-1}{n-1} * \frac{m}{n} + \frac{m}{n-1} * (1 - \frac{m}{n}) = \frac{m}{n}$ 。
4.對於第i個樣本，被抽中的概率爲 $\frac{m - k}{n - i + 1}$ ，其中k爲前面已經抽中的個數，k<=m。即抽取第i個樣本時候，如果前面已經抽中了k個，那麼需要在剩下的n-i+1個樣本中抽取m-k個。

按照我自己理解的思路再實現一下，代碼更簡單，思路也更清晰一些：

    public static List<Integer> randomtest() {
        int m = 3;
        double costednum = 0.0;
        int[] array = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
        int len = array.length;
        List<Integer> list = new ArrayList<>();
        for(int i=0; i<len; i++) {
            double probability = (m - costednum) / (len - i);
            double value = Math.random();
            if(probability > value) {
                list.add(array[i]);
                costednum += 1;
            }
        }
        return list;
    }

    public static void massive_randomtest() {
        Map<Integer, Integer> map = new HashMap();
        for(int i=0; i<10000; i++) {
            List<Integer> list = randomtest();
            for(int each: list) {
                map.put(each, map.getOrDefault(each, 0) + 1);
            }
        }
        for(Map.Entry<Integer, Integer> entry: map.entrySet()) {
            System.out.println(entry.getKey() + ": " + entry.getValue());
        }
    }

最後的輸出結果爲：

參考文獻：
1.https://zh.wikipedia.org/wiki/等差-等比數列
2.https://blog.csdn.net/bitcarmanlee/article/details/52719202

最全隨機抽樣算法(從N個數中抽取M個等)集合

1.從N個數中等概率抽取M個數

2.時間複雜度爲O(n)的從N個數中抽取M個的算法

3.隨機抽取有序列表

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

nodejs學習06——小案例

評估統計算法在銀行僞造鈔票檢測中的價值

C# Xmlserializer 程序集內存泄露

Java ThreadPoolShutdown

5月21日相聚上海張江！與文心大模型一起共建大模型產業應用生態圈

小白都能理解的FTRL

樹算法系列之四:XGBoost

Redis常用數據結構

樹算法系列之一:CART迴歸樹

HashMap簡單小結

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結