今日在CSDN看再次遇見July的這篇博文:教你如何迅速秒殺掉:99%的海量數據處理面試題。
這篇文章我之前是拜讀過的,今天閒來沒事,就想拿來當做MapReduce的練習。
MapReduce這把刀太大,刀大了問題就抵不住這刀鋒了,事實上一開始我想着,這麼多些題目,當是要花不少功夫的,但當我做完一題繼續看下面的題目的時候,才發現這些題目在MapReduce模型下顯得大同小異了,看來拿大刀的人是不管砍的是木頭還是人頭的,而是直接抽象成柱形物然後掄起刀一刀就下去了。
直入主題:
1、海量日誌數據,提取出某日訪問百度次數最多的前K個IP。[稍微改變]
說明:每一次訪問網頁就在日誌中記錄1次訪問者的IP,獨佔一行,一個小數據可以在這裏下載。
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
public class IPAndCount implements WritableComparable{
Text ip;
IntWritable count;
public IPAndCount(){
this.ip = new Text("");
this.count = new IntWritable(1);
}
public IPAndCount(Text ip, IntWritable count){
this.ip = ip;
this.count = count;
}
public IPAndCount(String ip, int count){
this.ip = new Text(ip);
this.count = new IntWritable(count);
}
public void readFields(DataInput in) throws IOException {
ip.readFields(in);
count.readFields(in);
}
public void write(DataOutput out) throws IOException {
ip.write(out);
count.write(out);
}
public int compareTo(Object o) {
return ((IPAndCount)o).count.compareTo(count) == 0?
ip.compareTo(((IPAndCount)o).ip):((IPAndCount)o).count.compareTo(count);//如果只比較count會丟失數據,應該是suffle階段的問題
}
public int hashCode(){
return ip.hashCode();
}
public boolean equals(Object o){
if(!(o instanceof IPAndCount))
return false;
IPAndCount other = (IPAndCount)o;
return ip.equals(other.ip) && count.equals(other.count);
}
public String toString(){
StringBuffer buf = new StringBuffer("[ip=");
buf.append(ip.toString());
buf.append(",count=");
buf.append(count.toString());
buf.append("]");
return buf.toString();
}
public Text getIp() {
return ip;
}
public void setIp(Text ip) {
this.ip = ip;
}
public IntWritable getCount() {
return count;
}
public void setCount(IntWritable count) {
this.count = count;
}
}
Job2再從文件中讀入,使用的是KeyValueTextInputFormat,它對應的是TextOutputFormat,我們可以從job1的配置中看出來。
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.ChainMapper;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob;
import org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class FindActiveIP extends Configured implements Tool{
public static class SumUpIPMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
IntWritable one = new IntWritable(1);
public void map(LongWritable key, Text value, Context context)
throws IOException,InterruptedException{
context.write(value, one);
}
}
public static class SumUpIPReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
//這裏可以選擇前k個進行輸出以優化
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException{
int sum = 0;
for(IntWritable v : values){
sum += v.get();
}
context.write(key, new IntWritable(sum));
}
}
public static class BeforeSortIPMapper extends Mapper<Text,Text,IPAndCount,Text>{
public void map(Text key, Text value, Context context)
throws IOException,InterruptedException{
IPAndCount tmp = new IPAndCount(key,new IntWritable(Integer.valueOf(value.toString())));
System.out.println(tmp);
context.write(tmp,new Text());
}
}
//set num of this reducer to one
public static class SelectTopKIPReducer extends Reducer<IPAndCount,Text,IPAndCount,Text>{
int counter = 0;
int K = 10;
public void reduce(IPAndCount key, Iterable<Text> values, Context context)
throws IOException, InterruptedException{
System.out.println(key);
if(counter < K){
context.write(key, null);
counter++;
}
}
}
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf,"SumUpIP");
job.setJarByClass(FindActiveIP.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.getConfiguration().set("mapred.textoutputformat.separator", ",");
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setMapperClass(SumUpIPMapper.class);
job.setReducerClass(SumUpIPReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(7);
Configuration conf2 = new Configuration();
Job job2 = new Job(conf2,"SortAndFindTopK");
job2.setJarByClass(FindActiveIP.class);
job2.setInputFormatClass(KeyValueTextInputFormat.class);
job2.getConfiguration().set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
job2.setOutputFormatClass(TextOutputFormat.class);
Path in2 = new Path(args[1]);
Path out2 = new Path(args[2]);
FileInputFormat.setInputPaths(job2,in2);
FileOutputFormat.setOutputPath(job2, out2);
job2.setMapperClass(BeforeSortIPMapper.class);
job2.setReducerClass(SelectTopKIPReducer.class);
job2.setMapOutputKeyClass(IPAndCount.class);
job2.setMapOutputValueClass(Text.class);
job2.setOutputKeyClass(IPAndCount.class);
job2.setOutputValueClass(Text.class);
job2.setNumReduceTasks(1);
JobControl jobControl = new JobControl("FindTopKIP");
ControlledJob cJob1 = new ControlledJob(conf);
cJob1.setJob(job);
ControlledJob cJob2 = new ControlledJob(conf2);
cJob2.setJob(job2);
jobControl.addJob(cJob1);
jobControl.addJob(cJob2);
cJob2.addDependingJob(cJob1);
jobControl.run();
return 0;
}
public static void main(String args[]) throws Exception{
int res = ToolRunner.run(new FindActiveIP(), args);
System.exit(res);
}
}
大刀拿習慣了,從前的大刀就成了現在的繡花針,不是繡花針不好,只是用着不順手。當你聽着歌用java寫着MapReduce,突然有人在你耳邊喊了一句:Pig~Pig~Pig~
grunt> records = LOAD 'input/ipdata' AS (ip:chararray);
grunt> grouped_records = GROUP records BY ip;
grunt> counted_records = FOREACH grouped_records GENERATE group, COUNT(records);
grunt> sorted_records = ORDER counted_records BY $1 DESC;
grunt> topK = LIMIT sorted_records 10;
grunt> DUMP topK;
(192.168.0.21,7)
(192.168.0.14,4)
(192.168.0.10,4)
(192.168.0.12,4)
(192.168.0.32,4)
(192.168.0.13,3)
(192.168.0.3,3)
(192.168.0.2,2)
(192.168.0.11,1)
不寫了,以現在的水平繼續寫實在是不優美:
3題:有一個1G大小的一個文件,裏面每一行是一個詞,詞的大小不超過16字節,內存限制大小是1M。返回頻數最高的100個詞。
4題:海量數據分佈在100臺電腦中,想個辦法高效統計出這批數據的TOP10。
5題:有10個文件,每個文件1G,每個文件的每一行存放的都是用戶的query,每個文件的query都可能重複。要求你按照query的頻度排序。
7題:怎麼在海量數據中找出重複次數最多的一個?
8題:上千萬或上億數據(有重複),統計其中出現次數最多的前N個數據。
9題:一個文本文件,大約有一萬行,每行一個詞,要求統計出其中最頻繁出現的前10個詞。
1,2,3,4,5,7,8,9題思路基本一致,值得注意的是,有時候我們完全可以確定我們需要的數據的一些特徵,比如上面的熱門查詢中熱門串一定被查詢超過1000次,那麼我們就可以使用FILTER來進行過濾以減少處理的數據(從而減少對帶寬的壓力)[filted_records = FILTER grouped_records BY SIZE(records) > 1000;]。
6題: 給定a、b兩個文件,各存放50億個url,每個url各佔64字節,內存限制是4G,讓你找出a、b文件共同的url?
見[Hadoop]使用DistributedCache進行復制聯結 以及 使用hadoop的datajoin包進行關係型join操作,你也可以參考Data-Intensive Text Processing with MapReduce看看原生態的join操作是怎麼進行的。
grunt> A = LOAD 'input/url1' AS (url:chararray);
grunt> B = LOAD 'input/url2' AS (url:chararray);
grunt> grouped_A = GROUP A BY url;
grunt> non_duplicated_A = FOREACH grouped_A GENERATE group; --去重
grunt> grouped_B = GROUP B BY url;
grunt> non_duplicated_B = FOREACH grouped_B GENERATE group; --B去重
grunt> C = JOIN non_duplicated_B BY group, non_duplicated_A BY group; --A 、B 內聯結
grunt> D = FOREACH C GENERATE $0; //生成重複url
grunt> DUMP D;
10題: 1000萬字符串,其中有些是重複的,需要把重複的全部去掉,保留沒有重複的字符串。
使用pig:
grunt> records = LOAD 'input/retrived_strings' AS (str:chararray);
grunt> grouped_records = GROUP records BY str;
grunt> filted_records = FILTER grouped_records BY SIZE(records) <= 1;
grunt> DUMP filted_records;