百度实习生招聘的一道大数据处理题目（上）

题目为：两个200G大小的文件A和B，AB文件里内容均为无序的一行一个正整数字（不超过2^63），请设计方案，输出两个文件中均出现过的数字，使用一台内存不超过16G、磁盘充足的机器。方案中指明使用java编程时使用到的关键工具类，以及为什么?

对于这种大数据量问题（至少对于一台机器来说算是大数据了），使用MapReduce是最简单的方式了。现在开源的最好的支持MapReduce的分布式计算框架软件就是Hadoop，而Hadoop是用Java写的，整个运行时系统也是需要Java虚拟机的支持。所以这个问题放在了Java组，当然这只是我的猜测。

改日我再把MapReduce的代码和实验结果发上来，今天主要不是讨论MapReduce。而是讨论使用共享内存的编程模型解决这个问题的办法。共享内存的并行编程模型现在比较常用的是多线程，GPU，还有OpenMP等。这里使用多线程的方式来解决这个问题。

以下分析对文件的数据存储重新做一个假设，假设数据都是以二进制的形式存储，每个元素的数据类型为uint64_t，即占8个字节，所以一共有25G个uint64_t类型的元素。

首先从数据结构的角度去考虑这个问题，对于一般的寻找两个列表listA和listB中相同元素的问题，我们可以使用的最简单的算法如下：

foreach itemA in listA 
 
foreach itemB in listB 
 
    if(itemB == itemA) 
 
       print itemA; 
 
       break; 
 
    end if; 
 
end foreach; 
 
end foreach;

设len(listA)为M，len(listB)为N。算法的时间复杂度为O（MN）。这里假设M=N，则时间复杂度为O（M^2）。对于200G的文件，就意味着对于A文件的每一个数，都需要对B文件扫描一次。假设磁盘的读取速度为50MB/s，需要的时间为：25G*200G/50MB/s =~ 10^14s =~ 3170979 年。也就是说如果用这种方法的话，大致需要300万年。

当然没有人会愿意这么做的，除非N的规模很小的数据才考虑这么做。学过点数据结构的人都知道应该用二叉平衡树，二叉查找树，B+，B-之类的树形结构来存储数据。这样对第一个文件首先建立树的时间复杂度为NlogN,对第二个文件中的每个元素，在树中查找是否有相同的数，查找复杂度为logN。对所有元素查找的时间复杂度就为NlogN。所以总时间复杂度为2NlogN，即：O（NlogN）。使用Big O这种时间复杂度分析方法一般是针对于数据都能够放入内存的算法。对于这个问题，这样分析就太乐观了。首先，建立平衡树的时候，内存会爆掉，而且可以爆掉几十次了。就算是用硬盘索引技术，把一部分树放在磁盘上，这样在查找的时候会更加痛苦。因为要根据树的查找路径不断的移动磁头需找新的索引位置，对于大规模数据，你要想不停的移动磁头，那你就输了。

还有一种方式就是借鉴MapReduce这种流处理方式，对两个文件的数据进行排序，然后再对比是否有相同的数据。使用Big O的分析这个算法的时间复杂度，应该为：NlogN+NlogN+N+N.所以时间复杂度为：O(NlogN).这样看起来和使用树形数据结构的效果差不多。但是，这有一个显著的优点就是你的数据可以从磁盘上“大块大块连续连续”的读取，磁头不需要跳动，这对于这种对上世纪大机械时代有怀旧感的老技术还是非常划得来的。

所以大致的处理方式如下，首先对200G文件A进行分块排序，排序后每块放入一个文件中，然后对文件B做同样的处理。文件A对应的分块有序文件分别为A1，A2，A3，…,AN（如图1所示）。文件B对应的分块有序文件分别为B1，B2，B3，…,BN（如图2所示）。然后使用文件指针分别指向A1，A2，A3，…AN和B1，B2，B3，…BN。接下来的方法就很类似于归并排序的方式了，从A1到AN所指向的文件中取出一个最小minA，然后从B1-BN所指向的文件中也取出一个最小值minB，判断两者是否相等，若minA==minB，则将这个值写入到结果文件（指针fpResult所指向的文件），然后两者指针都向后移动，minA>minB.则从取出minB对应的那个文件指针向后移动一个单元，接着再从B1-BN中输出一个最小值。minA<minB的处理方式也是一样的，如图3所示。

图1 文件A分块排序

图2 文件B分块排序

图3 归并寻找相同值

归并的算法可以使用锦标赛算法，而不是用普通的在N个元素中寻找最小值的算法。

接下来详细分析下，对每一块分块排序的方法，假设块的大小为1G，则需要以下三个步骤：

1.将1G数据从磁盘中载入内存

2.对1G数据进行排序

3.将排序后的结果写入磁盘

首先我们需要一个计时程序，用来统计程序各个模块运行时所耗费的时间，这个程序由timer.h和timer.c构成：

/*timer.h*/

/* 
 
* Author: Chaos Lee 
 
* Date: 2012-06-30 
 
* Description: interfaces for public to use timer 
 
*/ 
 
#ifndef __TIMER_H_ 
 
#define __TIMER_H_ 
 
#include<stdio.h> 
 
#include<sys/time.h> 
 
void start_timer(); 
 
int get_elapsed_time(); 
 
#endif

/*timer.c*/

/* 
 
* Author: Chaos Lee 
 
* Date: 2012-06-30 
 
* Description: implementation of the timer 
 
*/ 
 
#include "timer.h" 
 
static struct timeval start,end; 
 
  
 
void start_timer() 
 
{ 
 
        gettimeofday(&start,NULL); 
 
} 
 
  
 
void restart_timer() 
 
{ 
 
        start_timer(); 
 
} 
 
  
 
int get_elapsed_time() 
 
{ 
 
        gettimeofday(&end,NULL); 
 
        return end.tv_sec-start.tv_sec + (end.tv_usec-start.tv_usec)/1000000; 
 
}

其次，我们还需要一个程序用来产生随机数文件，以下该程序的源代码，使用的时候需要传入一个参数N，则产生大小为2^N 字节的文件。

/*random_generator.c*/

/* 
 
* Author: Chaos Lee 
 
* Date: 2012-06-30 
 
* Description: Generating a file containing a given number of random elements whose type are uint64_t 
 
*/ 
 
#include<stdio.h> 
 
#include<stdlib.h> 
 
#include<time.h> 
 
#include<stdint.h> 
 
#include<sys/time.h> 
 
  
 
#include "timer.h" 
 
  
 
int main(int argc,char *argv[]) 
 
{ 
 
        int shift,tmp[2]; 
 
        FILE * fp; 
 
        int64_t size,i; 
 
        int elapsed_seconds; 
 
        start_timer(); 
 
        if(2 > argc) 
 
        { 
 
                fprintf(stderr,"Usage:%s NUMBER\n",argv[0]); 
 
                exit(1); 
 
        } 
 
        shift = atoi(argv[1]); 
 
        shift -= 3; 
 
        if(0 > shift) 
 
        { 
 
                fprintf(stderr,"too small\n"); 
 
                exit(1); 
 
        } 
 
        size = 1 << shift; 
 
        srand(time(NULL)); 
 
        fp = fopen("data.dat","wb"); 
 
        if(NULL == fp) 
 
        { 
 
                fprintf(stderr,"file open error."); 
 
                exit(1); 
 
        } 
 
        for(i=0;i<size;i++) 
 
        { 
 
                tmp[1] = rand(); 
 
                tmp[2] = rand(); 
 
                if( 2 != fwrite(&tmp[0],sizeof(int),2,fp)) 
 
                { 
 
                        fprintf(stderr,"writing file failure...\n"); 
 
                        exit(1); 
 
                } 
 
        } 
 
        elapsed_seconds = get_elapsed_time(); 
 
        fprintf(stdout,"generating cost %d seconds.\n",elapsed_seconds); 
 
        fclose(fp); 
 
}

对于数据排序可以使用单线程的排序方法或者多线程的排序方法。单线程的排序版本的源代码如下：

/* single_thread_sort.c */

/* 
 
* Author: Chaos Lee 
 
* Date: 2012-06-30 
 
* Description: load,sort,store data with single core 
 
*/ 
 
  
 
#include<stdio.h> 
 
#include<stdlib.h> 
 
#include<stdint.h> 
 
#include<sys/types.h> 
 
#include<sys/stat.h> 
 
  
 
#include "../error.h" 
 
#include "timer.h" 
 
  
 
int uint64_compare(const void * ptr1,const void * ptr2) 
 
{ 
 
        return  *((uint64_t *)ptr1) > *((uint64_t *)ptr2) ? 1 : *((uint64_t *)ptr1) < *((uint64_t *)ptr2) ? -1 : 0; 
 
} 
 
  
 
int main(int argc,char * argv[]) 
 
{ 
 
        struct stat data_stat; 
 
        int status,elapsed_seconds;; 
 
        uint64_t size; 
 
        uint64_t *buffer; 
 
        FILE * fp; 
 
        FILE * fp_result; 
 
        status = stat("data.dat",&data_stat); 
 
        if(0 != status) 
 
                error_abort("stat file error.\n"); 
 
        size = data_stat.st_size; 
 
        buffer = (uint64_t *) malloc(size); 
 
        if(NULL == buffer) 
 
        { 
 
                fprintf(stderr,"mallocing error."); 
 
                exit(1); 
 
        } 
 
        fp = fopen("data.dat","rb"); 
 
        if(NULL == fp) 
 
        { 
 
                fprintf(stderr,"file open error."); 
 
                exit(1); 
 
        } 
 
        start_timer(); 
 
        fread(buffer,size,1,fp); 
 
        elapsed_seconds = get_elapsed_time(); 
 
        fprintf(stdout,"loading cost %d seconds\n",elapsed_seconds); 
 
        restart_timer(); 
 
        qsort(buffer,size/sizeof(uint64_t),sizeof(uint64_t),uint64_compare); 
 
        elapsed_seconds = get_elapsed_time(); 
 
        fprintf(stdout,"sorting cost %d seconds\n",elapsed_seconds); 
 
        fp_result = fopen("single_result.dat","wb"); 
 
        if(NULL == fp_result) 
 
        { 
 
                fprintf(stderr,"open result file error.\n"); 
 
                exit(1); 
 
        } 
 
        restart_timer(); 
 
        fwrite(buffer,sizeof(uint64_t),size/sizeof(uint64_t),fp_result); 
 
        elapsed_seconds = get_elapsed_time(); 
 
        fprintf(stdout,"writing results cost %d seconds\n",elapsed_seconds); 
 
        free(buffer); 
 
        fclose(fp); 
 
        return 0; 
 
}

单线程版本的运行时间和测试方法如下：

[lichao@sg01 thread_power]$ gcc -c timer.c -o timer.o 
 
[lichao@sg01 thread_power]$ gcc random_generator.c -o random_generator timer.o 
 
[lichao@sg01 thread_power]$ ./random_generator 30 
 
generating cost 36 seconds.

由此可见，创建1GB的文件耗时36秒，即写的速度为: 29826161B/s.差不多为30MB/s。下面编译单线程版本的代码，并运行测试下时间：

[lichao@sg01 thread_power]$ gcc single_thread_sort.c -o single_thread_sort timer.o -lpthread 
 
[lichao@sg01 thread_power]$ ./single_thread_sort 
 
loading cost 44 seconds 
 
sorting cost 85 seconds 
 
writing results cost 81 seconds

图4 排序阶段CPU使用率

由于篇幅所限,接下来的内容请看下一篇博文：百度实习生招聘的一道大数据处理题目（下）

百度实习生招聘的一道大数据处理题目（上）

《UNIX網絡編程》中第一個timer_server的例子

百度實習生招聘的一道大數據處理題目（下）

條件變量的接口函數和使用原則

同步異步線程進程的一些思考

百度實習生招聘的一道大數據處理題目（上）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結