面試題之10億正整數問題--完整解答

好了，不說這個了。下面進入正題。

這個題目來源於某公司的面試題，是absolute同學在我的“面試題收集貼”中提出的，之後CMGS同學在回覆中提到，騰訊今年的面試題中有類似題目，問題規模擴大了10倍，但是本質相同。下面我們來看一下題目：

10億個正整數，只有其中1個數重複出現過，要在O(n)的時間裏面找出這個數，內存要儘可能少（小於100M）。

這個問題，下面有同學提出題目不嚴謹，我看了一下，的確是和我希望表達的意思有些模糊，我這裏將題目從我自己的角度澄清一下：

對於正整數，其範圍爲1-10億，然後從中隨機的選擇出k個數（可以重複，也可以不重複，從具體題目的要求來看），選擇完之後開始做題目，無論是對其排序，還是找重複數。

不知道這樣表達是不是清楚。

我們首先不急着來解決這個問題，而是從其來源來慢慢看。這裏提到的題目與面試題有一些不同（在面試題中是要求找重複的數，而這裏的題目是對其進行排序），所以不要感覺疑惑。

[來源]

這種類型的題目，其來源爲《編程珠璣》這本書，這裏推薦這本書一下，但是由於這本書比較薄，所以裏面涉及到的知識只能簡單提到，但是並不能夠完全覆蓋，所以讀這本書還是要經歷將其讀厚的過程，對涉及到的知識點，通過其它參考書來得到關於其的全部知識。另外，儘管這本書的翻譯還不錯，但是還是推薦下載或得到其英文版進行參考，對於中文版中感覺瞭解不是很清楚的地方，再來對照英文版看看。

該題的原型是在《編程珠璣》的開頭，題目在“面試題之10億正整數問題”中我已經詳細介紹過了，而且園子中應該大部分兄弟都有這本書吧，我就不詳細將書中內容重新打一遍了。

大概介紹一下，作者通過和程序員交談，瞭解到程序員需要在一個大系統中實現一個電話號碼的文本數據庫，讀入電話號碼，輸出排序好的以800開頭的文件。

通過作者的整理，得到精確的問題陳述如下

輸入：

所輸入的是一個文件，至多包含n個正整數，每個正整數都要小於n，這裏的n爲10^7。如果輸入時某一個整數出現了兩次，就會產生一個致命的錯誤。這些整數與其他任何數據都不關聯。

輸出：

以增序形式輸出經過排序的整數列表。（這裏應該補充一下是文件形式嗎？）

約束：

至多（大概）只有1MB的可用主存，但是可用磁盤空間非常充足。運行時間至多隻允許幾分鐘，最適宜的時間大概爲10秒鐘。

[解答]

從上面的這些描述來看一下，題1爲10億正整數的問題，題2爲《編程珠璣》中的問題。

題1中問題規模爲10億，也就是10^9，內存要求爲小於100M，問題要求是找出重複的數；

題2中問題規模爲10^7，而內存要求爲1M左右，問題要求是對這些數進行排序。

同時題1中要求算法的時間複雜度爲O(n)。

這裏我們就來看一下《編程珠璣》中是怎樣來解決這個問題的。

《編程珠璣》中使用了三種方法來對其進行解決。一種爲Merge Sort，一種爲Multi-pass Sort，還有一種爲作者起名的Wonder Sort。這裏我們會一個個分析一下這些算法，包括時間複雜度和空間複雜度，同時在後面還會有實例的說明。

Merge Sort

Merge Sort大家應該很熟悉，是慣常使用的一種外排序方法，其主要思想是採用了分而治之的思想，該思想被應用於多個算法領域。其中文稱爲歸併排序，在大部分的算法書中都會提到，同時也是外排序中經常使用到的排序方法。

歸併排序可以通過迭代和遞歸兩種方式來實現。

這些實現在殷人昆的那本黃書中可以找到（大家應該知道我說的是哪本書吧），但是他的那本書的實現有些不太好，使用了datalist和staticlinklist這兩個與歸併排序本身無關的數據結構，而這裏我們用來示例的話只需要int的數組即可。（另外加上文件讀寫操作，針對題目的話），所以這裏我們僅僅參考一部分黃書中的實現，來自己實現歸併排序方法。

歸併排序可以使用多種方法來進行實現，這裏只談常規方式下的歸併排序，在這種情況下，我們需要和被排序數組同樣大小的一個額外空間來輔助進行排序。

1 #include <iostream>
2 using namespace std;
3
4 void merge(int initlist[], int mergedlist[],int l,int m, int n)
5 {
6     int i=l,j=m+1,k=l;
7     while(i<=m && j<=n)
8     {
9         if(initlist[i]<=initlist[j])
10         {
11             mergedlist[k]=initlist[i];
12             i++;
13             k++;
14         }
15         else
16         {
17             mergedlist[k]=initlist[j];
18             j++;
19             k++;
20         }
21     }
22     if(i<=m)
23     {
24         for(int n1=k,n2=i;n1<=n&&n2<=m;n1++,n2++)
25         {
26             mergedlist[n1]=initlist[n2];
27         }
28     }
29     else
30     {
31         for(int n1=k,n2=j;n1<=n&&n2<=n;n1++,n2++)
32         {
33             mergedlist[n1]=initlist[n2];
34         }
35     }
36 }
37
38 void mergepass(int initlist[], int mergedlist[], const int len,const int listlen)
39 {
40     int i=0;
41     while(i+2*len<=listlen-1)
42     {
43         merge(initlist,mergedlist,i,i+len-1,i+2*len-1);
44         i+=2*len;
45     }
46     if(i+len<=listlen-1)
47     {
48         merge(initlist,mergedlist,i,i+len-1,listlen-1);
49     }
50     else
51     {
52         for(int j=i;j<=listlen-1;j++)
53         {
54             mergedlist[j]=initlist[j];
55         }
56     }
57 }
58
59 void mergesort(int list[], int listlen)
60 {
61     int *templist=new int[listlen];
62     int len=1;
63     while(len<listlen-1)
64     {
65         mergepass(list,templist,len,listlen);
66         len*=2;
67         mergepass(templist,list,len,listlen);
68         len*=2;
69     }
70     delete []templist;
71 }
72
73 int main()
74 {
75     int initlist[]={21,25,49,25,93,62,72,8,37,16,54};
76
77     int i;
78     for(i=0;i<sizeof(initlist)/sizeof(int);i++)
79         cout<<initlist[i]<<" ";
80     cout<<endl;
81
82     mergesort(initlist,sizeof(initlist)/sizeof(int));
83     for(i=0;i<sizeof(initlist)/sizeof(int);i++)
84         cout<<initlist[i]<<" ";
85     cout<<endl;
86
87
88     return 0;
89 }

歸併排序的時間複雜度和空間複雜度:

時間複雜度爲O(nlgn)，空間複雜度爲O(n)。實例說明見後。

Multi-pass Sort

在中文翻譯中，稱其爲多通道排序。但是從其描述來看，似乎是多趟排序更爲確切。

補充：eaglet提到所謂多通道排序的原型就是B+樹。對於B+樹沒有具體研究過，不過multipass在算法書中似乎沒有查找到，倒是B+樹會出現，而且看描述似乎的確是一樣的。

僞碼(using tmp file)：

1 while(not the endof file)
2    read one record from input file      (I/O Operation)
3    if(record in the region)
4        put the record into memory
5    else
6        write the record into the file(s).  (I/O Operation)
7    end if
8 end while
9
10  sort the records in the memory using some inner sort method.
11  write the sorted records into the output file.  (I/O Operation)
12 while(there are tmp files left)
13    dump one tmp file (begin from the smallest) to memory
14    sort the records in the memory using some inner sort method.
15    write the sorted records into the output file.  (I/O Operation)
16 end while
17 DONE.

僞碼(not using tmp file)：

1 for(region++)
2     while(not the endof file)
3    read one record from input file      (I/O Operation)
4      if(record in the region)
5    put the record into memory
6    end if
7     end while
8    sort the records in the memory using some inner sort method.
9     write the sorted records into the output file.  (I/O Operation)
10 end for
11 DONE

多趟排序的時間複雜度和空間複雜度

可以看到，通過存儲到硬盤文件，我們可以控制使用空間的大小，但是這樣同樣也就加大了I/O讀寫的次數，而大家都很清楚，I/O操作的效率遠遠低於內存操作，這樣在整個耗時方面，I/O讀寫次數越多的，其實際運行時間就越長。要提高算法的效率，就需要減少I/O操作的次數。

實例說明見後。

Wonder Sort

OK，現在就進入最精彩的部分，也是這道題目最希望的解法。還記得張柏芝在有一部電影中叫wonderful，很喜歡這個名字，看來作者也很喜歡這個名字哈~

在上面讀入文件記錄的時候，我們可以使用string, int等類型來表示我們所讀入的這一條記錄的具體值。而對於32位的機器，其int值爲32位，也就是4個byte。而1M空間則有1024*1024=1048576個字節，一共能夠表達的int值爲262144。而對於10的7次方也就是1千萬這樣的數，我們除下來得到38.14697265625也就是差不多要做40次。

而Wonder Sort中使用位圖來做，以每一位是0還是1表示數是否存在，這樣1M空間共有8388608個位，也就能表示大概800萬個數。這裏我們暫時考慮不存在一個對內存的嚴格限制。所以如果要表示1000萬，我們需要的內存空間大概是1.25M。

僞碼：

1 setup a bit-map which have 1000,0000 bits. And set all bits to zero.
2 while(not end of the input file)
3     read one record from input file
4     if(correspond bit in the bit-map is 0)
5     change the correspond bit in the bit-map to 1. (bit-map[record]=1)
6     else
7    //if we want to sort, just do nothing.
8    //if we want to find the num that came twice, just print it out and break.
9     end if
10 end while
11
12 write the records into the output file.  (I/O Operation)
13

精彩排序的時間複雜度和空間複雜度

空間複雜度肯定是最低的，而時間複雜度爲O(n)。實例見後。

【在該問題上面的擴展】

上面三個解法中，當然是Wonder Sort最好。但是作者也提到了，使用Wonder Sort時，我們需要1.25M的空間（加上其他一些操作，還要更多些，不過最大的需求肯定是1.25M的用來進行位圖法的區域），但是如果嚴格要求使用1M空間的話，我們怎麼做？

從上面解法的描述來看，我們可以將第二種解法與第三種解法結合起來，這樣就可以解決。

僞碼：

1 setup a bit-map which have 800,0000 bits. And set all bits to zero.
2 while(not the endof file)
3    read one record from input file      (I/O Operation)
4    if(record in the region, which means 0 – 800,0000)
5        if(correspond bit in the bit-map is 0)
6   change the correspond bit in the bit-map to 1. (bit-map[record]=1)
7   else
8            //if we want to sort, just do nothing.
9            //if we want to find the num that came twice, just print it out and break.
10   end if
11   else
12    write the record into the file(s).  (I/O Operation)
13   end if
14 end while
15
16 write the records into the output file.  (I/O Operation)
17
18 //dump the tmp file (800,0000 – 1000,0000) to memory
19 while(not the endof file)
20     if(correspond bit in the bit-map is 0)
21     change the correspond bit in the bit-map to 1. (bit-map[record]=1)
22     else
23    //if we want to sort, just do nothing.
24    //if we want to find the num that came twice, just print it out and break.
25     end if
26 end while
27 write the records into the output file.  (I/O Operation)
28
29 DONE.
30

下面就是實例了，在做實例之前，我們有一個問題，就是如何生成題目中要求的測試數據？

測試數據的生成，必須比較隨機，我們需要生成範圍從0-10000000的數，這裏我生成爲8000000個數。同時這些數要求不重複。

[回答]

在《編程珠璣》的第12章中給出了回答，這裏我直接寫出代碼。我使用這個生成了一個random.txt文件，一共67.8M。

測試文件生成代碼：

1 #include <iostream>
2 #include <fstream>
3 using namespace std;
4
5 #include <ctime>
6 //total use 126 seconds
7
8 ofstream ofs("random.txt");
9
10 void myswap(int &a,int &b)
11 {
12     int tmp=a;
13     a=b;
14     b=tmp;
15 }
16
17 int bigrand()
18 {
19     return RAND_MAX*rand()+rand();
20 }
21
22 int randint(int l,int u)
23 {
24     return l+bigrand()%(u-l+1);
25 }
26
27 void generate(int x[],int k,int n)
28 {
29     int i=0;
30     for(i=0;i<n;i++)
31     {
32         x[i]=i;
33     }
34     for(i=0;i<k;i++)
35     {
36         //myswap(i,randint(i,n-1));
37         int j=randint(i,n-1);
38         int t=x[i];
39         x[i]=x[j];
40         x[j]=t;
41         ofs<<x[i]<<endl;
42     }
43 }
44
45 int main()
46 {
47     clock_t Start, Finish;
48     Start = clock();
49     int *x=new int[10000000];
50     generate(x,8000000,10000000);
51     //generate(x,100000,1000000);
52     delete []x;
53     Finish=clock();
54     int second=double(Finish-Start)/CLOCKS_PER_SEC;
55     cout<<"total use "<<second<<" seconds"<<endl;
56     return 0;
57 }

該代碼生成八百萬的測試數據共用時126秒，機器配置爲1G內存，P4處理器。這裏，能否更快生成測試數據，當然，要保證隨機性和正確性。

問題：

這裏我們得到程序的時間是使用clock函數，那如何得到程序佔用的內存呢？

還有一個問題就是，我們如何像linux中那樣得到用戶時間和系統調用時間？

這個還不清楚，這裏提出來，大家有知道的可以共享一下。

在測試文件生成好之後，我們就可以來實際的看一下這3種方法所具體使用的時間了。

因爲是使用debug版來測試，同時還有其他程序運行，所以可能不是十分精準。但是都是在同一臺機器上面測試，所以數量級上面還是可以參考的。

結果：

使用merge sort來排序800萬的測試數據，我們共用了337.7秒

使用multipass sort來排序800萬的測試數據，我們共用了581.67秒,其實大家仔細看一下，這裏我做得不太公平，我這裏一趟是使用了一半的數，也就是400萬，這樣來達到兩趟的目的，所以這裏來比較是有些問題的。

使用wonder sort時，居然也用了很長時間，共爲334.171秒。稍微換了一下，將函數調用去掉一層，最後還是得到total time is 333.328 seconds。

這和我們的預期相差還是比較大的。到底是什麼原因？

是否是I/O操作歷時比較長的原因？

單純I/O操作total time is 354.343seconds。

下面列出我使用的代碼

1. merge sort

1 //total running time is 337.751 seconds
  2
  3 #include <iostream>
  4 #include <fstream>
  5 using namespace std;
  6
  7 #include <ctime>
  8
  9 ifstream randfile("random.txt");
10 ofstream sortedfile("sorted.txt");
11
12 const int NUMS=8000000;
13
14 void merge(int initlist[], int mergedlist[],int l,int m, int n)
15 {
16     int i=l,j=m+1,k=l;
17     while(i<=m && j<=n)
18     {
19         if(initlist[i]<=initlist[j])
20         {
21             mergedlist[k]=initlist[i];
22             i++;
23             k++;
24         }
25         else
26         {
27             mergedlist[k]=initlist[j];
28             j++;
29             k++;
30         }
31     }
32     if(i<=m)
33     {
34         for(int n1=k,n2=i;n1<=n&&n2<=m;n1++,n2++)
35         {
36             mergedlist[n1]=initlist[n2];
37         }
38     }
39     else
40     {
41         for(int n1=k,n2=j;n1<=n&&n2<=n;n1++,n2++)
42         {
43             mergedlist[n1]=initlist[n2];
44         }
45     }
46 }
47
48 void mergepass(int initlist[], int mergedlist[], const int len,const int listlen)
49 {
50     int i=0;
51     while(i+2*len<=listlen-1)
52     {
53         merge(initlist,mergedlist,i,i+len-1,i+2*len-1);
54         i+=2*len;
55     }
56     if(i+len<=listlen-1)
57     {
58         merge(initlist,mergedlist,i,i+len-1,listlen-1);
59     }
60     else
61     {
62         for(int j=i;j<=listlen-1;j++)
63         {
64             mergedlist[j]=initlist[j];
65         }
66     }
67 }
68
69 void mergesort(int list[], int listlen)
70 {
71     int *templist=new int[listlen];
72     int len=1;
73     while(len<listlen-1)
74     {
75         mergepass(list,templist,len,listlen);
76         len*=2;
77         mergepass(templist,list,len,listlen);
78         len*=2;
79     }
80     delete []templist;
81 }
82
83 int main()
84 {
85     clock_t start,finish;
86     start=clock();
87     int *initlist=new int[NUMS];
88
89     int i;
90     for(i=0;i<NUMS;i++)
91         randfile>>initlist[i];
92
93     mergesort(initlist,NUMS);
94     for(i=0;i<NUMS;i++)
95         sortedfile<<initlist[i]<<endl;
96
97     delete []initlist;
98
99     finish=clock();
100     double seconds=(double)(finish-start)/CLOCKS_PER_SEC;
101     cout<<"total running time is "<<seconds<<" seconds"<<endl;
102     return 0;
103 }

2. multipass Sort

1 //total time is 581.67 seconds
2
3 #include <iostream>
4 #include <fstream>
5 using namespace std;
6
7 #include <ctime>
8
9 ifstream randfile("random.txt");
10 ofstream sortedfile("sorted.txt");
11 ifstream helpfileout;
12 ofstream helpfile("tmp.txt");
13
14 int cmp(const void *a, const void *b)
15 {
16     return *(int *)a - *(int *)b;
17 }
18
19 void multipassSort(int x[])
20 {
21     int record;
22     int i=0;
23     int j;
24     while(randfile>>record)
25     {
26         if(record<4000000)
27             x[i++]=record;
28         else
29             helpfile<<record<<endl;
30     }
31     qsort(x,i,sizeof(int),cmp);
32     for(j=0;j<i;j++)
33         sortedfile<<x[j]<<endl;
34     i=0;
35     helpfile.close();
36     helpfileout.open("tmp.txt");
37     while(helpfileout>>record)
38     {
39         x[i++]=record;
40     }
41     qsort(x,i,sizeof(int),cmp);
42     for(j=0;j<i;j++)
43         sortedfile<<x[j]<<endl;
44 }
45
46 int main()
47 {
48     clock_t start,finish;
49     start=clock();
50     int *x=new int[8000000];
51     multipassSort(x);
52     delete []x;
53
54     finish=clock();
55     double secs=(double)(finish-start)/CLOCKS_PER_SEC;
56     cout<<"total time is "<<secs<<" seconds"<<endl;
57     return 0;
58 }
59

3. wonder Sort

1 #include <iostream>
2 #include <fstream>
3 using namespace std;
4
5 #include <ctime>
6
7 ifstream randfile("random.txt");
8 ofstream sortedfile("sorted.txt");
9
10 const int NUMS=8000000;
11
12 const int BITSPERINT=32;
13 const int SHIFT=5;
14 const int MASK=0x1f;
15
16 void seti(int x[],int i)
17 {
18     x[i>>SHIFT] |= (1<<(i & MASK));
19 }
20
21 void clri(int x[],int i)
22 {
23     x[i>>SHIFT] &= ~(1<<(i & MASK));
24 }
25
26 int test(int x[],int i)
27 {
28     return x[i>>SHIFT] & (1<<(i & MASK));
29 }
30
31 void wonderSort(int x[])
32 {
33     int tmp;
34     int i=0;
35     for(i=0;i<NUMS;i++)
36     {
37         randfile>>tmp;
38         if(test(x,tmp)==0)
39         {
40             seti(x,tmp);
41         }
42     }
43     for(i=0;i<10000000;i++)
44     {
45         if(test(x,i)!=0)
46         {
47             sortedfile<<i<<endl;
48         }
49     }
50 }
51
52 int main()
53 {
54     clock_t start,finish;
55     start=clock();
56     int *x=new int[10000000/BITSPERINT];
57     memset(x,0,10000000/BITSPERINT*sizeof(int));
58
59     wonderSort(x);
60     delete[] x;
61     finish=clock();
62     double seconds=(double)(finish-start)/CLOCKS_PER_SEC;
63     cout<<"total use time is "<<seconds<<" seconds"<<endl;
64
65     return 0;
66 }

4. 單純讀寫文件

如果不使用文件讀寫，就可以看出算法的效率來了，但是如何做到呢？

排序部分結束，然後回到面試題，如果只是來查找是否有重複數的話，是否有其他解法，如何做？

在上次的回覆中，winter-cn提到了字典樹和哈希表的解決方案。

字典樹的確是一個好辦法，而且僅僅是對於查找，如果是排序，字典樹就不合適了。

而哈希還沒有想到好的哈希函數，具體也沒有細想，但是應該也是可以的。

【在其基礎上衍生的面試問題】

其實上面的內容都解決了的話，那一開始的面試題也就解決了，因爲其儘管問題規模變大了，但是其能夠使用的內存也一樣變大了，其基本方法還是一樣的。

現在很多公司，因爲其篩選人員的目的，同時也由於其工作是處理海量數據，所以在面試的時候，會出一些這樣的關於海量數據處理的問題，但是我們很多人，包括我，都一般不會接觸到這樣海量的數據，所以，適量的減小問題的規模，但是同時將其內存限制變得更加嚴格，其實其本質還是一樣的。

【TODO – 海量數據處理的話題】

這裏的話題可以歸入“海量數據處理”，關於海量數據處理，現在有很多公司的面試題會提相關的問題，如果沒有對這個話題思考過的話，是不太可能會有很好的答案的，關於這個話題，以後還可以找到其他的問題來進行討論，當問題規模不大的時候，體現不出算法的優勢，但是當問題的規模到達一定數量級的時候，O(n)或者O(lgn)的算法的優勢就能夠體現出來了。

因爲內容比較多，整理也花了一些時間，還有編寫代碼，如果其中有錯誤，歡迎大家指出，本來想分爲幾篇來寫的，最後還是一下子寫成一篇寫完算了，篇幅就比較長，而且代碼也較多，多謝大家能夠看到最後~

原文鏈接：http://www.cnblogs.com/cnyao/archive/2009/11/26/interview8.html

huagong_adu

發佈了30 篇原創文章 · 獲贊 66 · 訪問量 65萬+

私信關注

面試題之10億正整數問題--完整解答

C++程序的編譯過程及g++與之對應的幾個參數

讓我思潮翻滾的IBM面試內容

如何不生成XML文件通過Socket傳XML文件內容

C++複製構造函數的詭異行爲研究

從月薪3500到700萬——一個大學生的成長經歷

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結