位圖排序及其擴展應用——《編程珠璣》讀書筆記

一、基本的位圖排序

問題1：輸入一個包含n=100萬個正整數的文件，每個正整數都小於N=1000萬，而且這100萬個正整數沒有重複，對這個文件的數字進行排序，保存結果到文件中。要求佔用儘可能小的內存，速度儘可能快。

分析解決：如果用一個int保存一個正整數，一個int爲4 Byte，100萬個數要用400萬 Byte，約爲4M。如果用快排，時間複雜度爲O(nlogn)。

考慮到問題的特殊性，所有數字均爲正整數，且都不重複，這樣的問題可以用位圖解決。每個數字對應位圖中的一位，如果數字出現則置1，否則置0。一個int 4 Byte可以保存32個數，因爲所有的數都小於1000萬，所以可以先用大小爲1000萬的位圖來記錄這100萬個數，最後從頭掃描這個位圖，把置1的數字輸出就是按序的結果。用位圖排序需要的空間約爲1.25M，時間複雜度爲O(N)，無論空間還是時間都比快排好。

僞代碼如下：

/* phase 1: initialize set to empty */
for i = [0, N)
        bit[i] = 0
/* phase 2: insert present elements into the set */
for each i in the input file
        bit[i] = 1
/* phase 3: write the sorted output */
for i = [0, N)
        if bit[i] = 1
                write i on the output file

程序實現：首先要先生成一個100萬的不重複的正整數文件，而且每個數都小於1000萬，生成的方法可以參考我之前寫的

抽樣問題——《編程珠璣》讀書筆記

這篇文章。我採用的是Floyd的方法，抽出來之後數字是有序的，需要打亂他們的順序，如何打亂可以參考我的洗牌程序這篇文章。生成不重複的隨機數的程序如下：

#include <iostream>
#include <cstdlib>
#include <ctime>
#include <set>
#include <vector>
#include <fstream>

using namespace std;

// generate random number between i and j, 
// both i and j are inclusive
int randint(int i, int j)
{
	if (j < i)
	{ int t = i; i = j; j = t; }
	int ret = i + rand() % (j - i + 1);
	return ret;
}
// floyd sample, take m random number without
// duplicate from n
void floyd_f2(int n, int m, set<int> &S)
{
	for (int i = n - m; i < n; ++i)
	{
		int j = randint(0, i);
		if (S.insert(j).second)
			continue;
		else
			S.insert(i);
	}
}
// shuffle the data set V
void knuth_shuffle(vector<int> &V)
{
	int n = V.size();
	for (int i = n - 1; i != 0; --i)
	{
		int j = randint(0, i);
		int t = V[i]; V[i] = V[j]; V[j] = t;
	}
}

template<typename T>
void output_file(T beg, T end, char *file)
{
	ofstream outfile(file);
	if (!outfile)
	{
		cout << "file \"" << file << "\" not exists" << endl;
		return;
	}
	while (beg != end)
	{
		outfile << *beg << endl;
		++beg;
	}
	outfile.close();
}

void help()
{
	cout << "usage:" << endl;
	cout << "./Floyd_F2 n m output_file_name" << endl;
}

int main(int argc, char* argv[])
{
	if (argc != 4)
	{
		help();
		return 1;
	}
	srand(time(NULL));
	int n = atoi(argv[1]);
	int m = atoi(argv[2]);
	set<int> S;
	// sample
	floyd_f2(n, m, S);
	// shuffle
	vector<int> V(S.begin(), S.end());
	knuth_shuffle(V);
	// output
	vector<int>::iterator VBeg = V.begin();
	vector<int>::iterator VEnd = V.end();
	//output(VBeg, VEnd);
	output_file(VBeg, VEnd, argv[3]);

	return 0;
}

有了數據之後接着用位圖算法對數據進行排序。我們用int數組來表示位圖，1000萬個位的位圖需要大小N=(1000萬/32+1)大小的數組（加1是因爲1000萬/32可能有餘數，剩下那部分數據需要多一個int來表示）。

拿到一個數i之後首先要知道把這個數放在位圖的哪個位置。假設數組爲array，因爲一個int可以表示32個數，所以i的在數組中的位置爲(i/32)，即array[i/32]，具體在數組array[i/32]的哪一位呢？可以通過i%32得到。知道了數字在位圖中的位置之後就可以把數字放入位圖中，進行置位、測試和清空等操作，這幾個操作的C++代碼實現如下所示，採用位操作服進行計算：

#define BITWORD 	32
#define SHIFT 		5
#define MARK 		0x1F
#define N 			10000000
#define COUNT 		((N) / (BITWORD))

int ary[COUNT + 1];

void set(int i)
{
	ary[i >> SHIFT] |= (1 << (i & MARK));
}

bool test(int i)
{
	return (ary[i >> SHIFT] & (1 << (i & MARK)));
}

void clr(int i)
{
	ary[i >> SHIFT] &= ~(1 << (i & MARK));
}

整個位圖排序的C++代碼實現如下：

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <sstream>

using namespace std;

#define BITWORD 	32
#define SHIFT 		5
#define MARK 		0x1F
#define N 			10000000
#define COUNT 		((N) / (BITWORD))

int ary[COUNT + 1];

void set(int i)
{
	ary[i >> SHIFT] |= (1 << (i & MARK));
}

bool test(int i)
{
	return (ary[i >> SHIFT] & (1 << (i & MARK)));
}

void clr(int i)
{
	ary[i >> SHIFT] &= ~(1 << (i & MARK));
}

void help()
{
	cout << "usage:" << endl;
	cout << "./BitSort inputfile outputfile" << endl;
}

int main(int argc, char *argv[])
{
	if (argc != 3)
	{
		help();
		return 1;
	}
	ifstream infile(argv[1]);
	if (!infile)
	{
		cout << "file \"" << argv[1] << "\" not exists" << endl;
		return 1;
	}

	time_t t_start, t_end;
	t_start = time(NULL);

	// read the data and set the data in the bit map
	string line;
	istringstream istream;
	int num = 0;
	while (getline(infile, line))
	{
		istream.str(line);
		istream >> num; // read the number
		set(num); // set the number
		istream.clear();
	}
	infile.close();

	ofstream outfile(argv[2]);
	if (!outfile)
	{
		cout << "create output file \"" << argv[2] << "\" failed" << endl;
		return 1;
	}
	// read the bit map and write to the file
	for (int i = 0; i <= N; ++i)
	{
		if (test(i))
			outfile << i << endl;
	}
	outfile.close();

	t_end = time(NULL);
	cout << "time collapse: " << difftime(t_end, t_start) << " s" << endl;
	cout << "need " << ((double)N / (8 * 1000000)) << " M memory" << endl;
	return 0;
}

二、位圖排序擴展

問題2：如果輸入的正整數允許存在重複，而且至多隻能重複10次，又該怎麼對這100萬個數字進行排序呢？

分析解決：問題1只能處理沒有重複的正整數的情況，如果輸入中的數字存在重複那麼上面的位圖算法就不再適用。考慮到問題的限制：每個數字最多隻能重複10次，原來的位圖算法用一個位表示一個數字，一個位只有兩種狀態：1和0，分別表示這個數字存在和不存在，如果對位圖進行小小的改進，用幾個位來表示一個數字，這幾個位的數字表示該位的數字出現的次數，這樣就可以用位圖進行排序。因爲最多隻能重複10次，可以用4個位來表示一個數，這樣空間是原來基本位圖排序的4倍，需要約5M的內存空間，時間複雜度還是O(N)。

程序實現：每個數字對應數組中的位置和前面分析類似，一個int可以表示32/4=8個數字，對一個正整數i，先找到其對應數組的下標位置：i/8，再找到其起始位：4*(i%8)。

置位：當i每出現一次則在其起始位上加1；

測試i出現次數：因爲每個數字佔4位，可以通過對0x0F進行移位，移到i對應的位置上，相與，再移回低位上得到i出現的次數。

清空：和測試相反，相與的時候與0xF0相與。

這幾個操作的C++實現代碼如下：

#define BITWORD 	8
#define SHIFT 		3
#define MARK 		0x07
#define TEST 		0x0F
#define POS 		((i & MARK) << 2)
#define N 			10000000
#define COUNT 		((N) / (BITWORD))

int ary[COUNT + 1];

void set(int i)
{
	ary[i >> SHIFT] += 1 << POS;
}

// return the presence count of number i, used for output
int test(int i)
{
	return (ary[i >> SHIFT] & (TEST << POS)) >> POS;
}

void clr(int i)
{
	ary[i >> SHIFT] &= ~(TEST << ((i & MARK) << 2));
}

具體實現基本和原來的位圖排序差不多，只是在輸出結果的時候要根據數字重複出現的次數進行迭代輸出：

	// read the bit map and write to the file
	for (int i = 0; i <= N; ++i)
	{
		int count = test(i); // get the count of number i's presence
		for (int j = 0; j != count; ++j)
			outfile << i << endl;
	}

三、位圖的擴展應用

位圖的優勢一個是節省空間，通常一個int只能表示1個數字，用位圖可以表示多個數字，二是速度快，可以直接索引到具體的位置。除了用於排序外，還能用於：

找出重複出現的數字：每次進行test，如果test返回非零值，則表示已經存在該數字

位圖排序及其擴展應用——《編程珠璣》讀書筆記

抽樣問題——《編程珠璣》讀書筆記

C++程序的編譯過程及g++與之對應的幾個參數

讓我思潮翻滾的IBM面試內容

如何不生成XML文件通過Socket傳XML文件內容

C++複製構造函數的詭異行爲研究

從月薪3500到700萬——一個大學生的成長經歷

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結