有一個1G大小的一個文件，裏面每一行是一個詞，詞的大小不超過16字節，內存限制大小是1M。返回頻數最高的100個詞.

原創

2018-08-23 00:48

#include<iostream>
#include<string>
using namespace std;

#define FILE_NUM 10
#define WORDLEN 30
#define HASHLEN 7303

typedef struct node_no_space{
	char *word;
	int count;
	struct node_no_space *next;
}node_no_space, *p_node_no_space;

typedef struct node_has_space{
	char word[WORDLEN];
	int count;
	struct node_has_space *next;
}node_has_space, *p_node_has_space;

p_node_no_space bin[HASHLEN] = {NULL}; 

void swap(int *a, int *b) {
	int temp;
	temp = *a;
	*a = *b;
	*b = temp;
}

unsigned int hash(char *p_word) {
	unsigned int index = 0;
	while(*p_word) {
		index += index * 31 + *p_word;
		p_word++;
	}
	return index % HASHLEN;
}

int trim_word(char *word) {
	int n = strlen(word) - 1;
	int i = 0;
	if(n < 0)
		return 0;
	while(word[n] < '0' || (word[n] > '9' && word[n] < 'A') || (word[n] > 'Z' && word[n] < 'a') || word[n] > 'z') {
		word[n] = '\0';
		n--;
	}
	if(n < 0)
		return 0;
	while(word[i] < '0' || (word[i] > '9' && word[i] < 'A') || (word[i] > 'Z' && word[i] < 'a') || word[i] > 'z') {
		i++;
	}
	strcpy(word, word + i);
	return 1;
}

void insert_word(char *p_word) {
	unsigned int index = hash(p_word);
	node_no_space *p;
	for(p = bin[index]; p != NULL; p = p->next) {
		if(strcmp(p_word, p->word) == 0) {
			(p->count)++;
			return;
		}
	}

	p = (node_no_space*)malloc(sizeof(node_no_space));
	p->count = 1;
	p->word = (char*)malloc(strlen(p_word) + 1);
	strcpy(p->word, p_word);
	p->next = bin[index];
	bin[index] = p;
}

void min_heap(node_has_space *heap, int i, int len) {
	int left = 2 * i;
	int right = 2 * i + 1;
	int min_index = 0;

	if(left <= len && heap[left].count < heap[i].count) {
		min_index = left;
	} else {
		min_index = i;
	}

	if(right <= len && heap[right].count < heap[min_index].count) {
		min_index = right;
	}
	if(min_index != i) {
		swap(&heap[min_index].count, &heap[i].count);
		char buffer[WORDLEN];
		strcpy(buffer, heap[min_index].word);
		strcpy(heap[min_index].word, heap[i].word);
		strcpy(heap[i].word, buffer);
		min_heap(heap, min_index, len);
	}
}

void build_min_heap(node_has_space *heap, int len) {
	int index = len / 2;
	int i;
	for(i = index; i >= 1; i--) {
		min_heap(heap, i, len);
	}
}

void destroy_bin() {
	node_no_space *p, *q;
	int i = 0;
	while(i < HASHLEN) {
		p = bin[i];
		while(p) {
			q = p->next;
			if(p->word) {
				free(p->word);
				p->word = NULL;
			}
			free(p);
			p = NULL;
			p = q;
		}
		bin[i] = NULL;
		i++;
	}
}

void write_to_file(char *path) {
	FILE *out;
	if((out = fopen(path, "w")) == NULL) {
		cout << "error, open " << path << " failed!" << endl;
		return;
	}
	int i;
	node_no_space *p;
	i = 0;
	while(i < HASHLEN) {
		for(p = bin[i]; p != NULL; p = p->next) {
			fprintf(out, "%s %d\n", p->word, p->count);
		}
		i++;
	}
	fclose(out);
	destroy_bin();
}

void main() {
	char word[WORDLEN];
	char path[20];
	int count;
	int n = 10;
	unsigned int index = 0;
	int i;
	FILE *fin[10];
	FILE *fout;
	FILE *f_message;
	node_has_space *heap = (node_has_space*)malloc(sizeof(node_has_space) * (n + 1));
	// divide word into n files
	if((f_message = fopen("words.txt", "r")) == NULL) {
		cout << "error, open source file failed!" << endl;
		return;
	}
	for(i = 0; i < n; i++) {
		sprintf(path, "tmp%d.txt", i);
		fin[i] = fopen(path, "w");
	}
	while(fscanf(f_message, "%s", word) != EOF) {
		if(trim_word(word)) {
			index = hash(word) % n;
			fprintf(fin[index], "%s\n", word);
		}
	}
	for(i = 0; i < n; i++) {
		fclose(fin[i]);
	}
	// do hash count
	for(i = 0; i < n; i++) {
		sprintf(path, "tmp%d.txt", i);
		fin[i] = fopen(path, "r");
		while(fscanf(fin[i], "%s", word) != EOF) {
			insert_word(word);
		}
		fclose(fin[i]);
		write_to_file(path);
	}
	// heap find 
	for(i = 1; i <= n; i++) {
		strcpy(heap[i].word, "");
		heap[i].count = 0;
	}
	build_min_heap(heap, n);
	for(i = 0; i < n; i++) {
		sprintf(path, "tmp%d.txt", i);
		fin[i] = fopen(path, "r");
		while(fscanf(fin[i], "%s %d", word, &count) != EOF) {
			if(count > heap[1].count) {
				heap[1].count = count;
				strcpy(heap[1].word, word);
				min_heap(heap, 1, n);
			}
		}
		fclose(fin[i]);
	}

	for(i = 1; i <= n; i++)
		cout << heap[i].word << ":" << heap[i].count << endl;
}

首先，我們看到這個題目應該做一下計算，大概的計算，因爲大家都清楚的知道1G的文件不可能用1M的內存空間處理。所以我們要按照1M的上線來計算，假設每個單詞都爲16個字節，那麼1M的內存可以處理多少個單詞呢？ 1M = 1024 KB = 1024 * 1024 B 。然後1M / 16B = 2^16個單詞，那麼1G大概有多少個單詞呢？有2^26個單詞，但是實際中遠遠不止這些，因爲我們是按照最大單詞長度算的。我們需要把這1G的單詞分批處理，根據上面的計算，可以分成大於2^10個文件。索性就分成2000個文件吧，怎麼分呢，不能隨便分，不能簡單的按照單詞的順序然後模2000劃分，因爲這樣有可能相同的單詞被劃分到不同的文件中去了。這樣在統計個數的時候被當成的不同的單詞，因爲我們沒有能力把在不同文件中相同單詞出現的次數跨越文件的相加，這就迫使我們要把不同序號的同一個單詞劃分到同一個文件中：應用hash統計吧。稍後代碼會給出方法。然後呢，我們隊每個文件進行分別處理。按照key-value的方法處理每個單詞，最終得出每個文件中包含每個單詞和單詞出現的次數。然後再建立大小爲100的小根堆。一次遍歷文件進行處理。我沒有弄1G的文件，弄1M的，簡單的實現了一下，不過原理就是這樣的。這是單詞：http://download.csdn.net/detail/zzran/4934173

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

有一個1G大小的一個文件，裏面每一行是一個詞，詞的大小不超過16字節，內存限制大小是1M。返回頻數最高的100個詞.

strftime的例子

socketpair的問題

__define_initcall 作用(subsys_initcall 作用)

Unix或Linux中&、jobs、fg、bg等命令的使用方法

ip 命令使用詳解

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結