Hive--笔试题05_1--求TopN

现在有一个面试题

 

场景举例

北京市学生成绩分析

 

成绩的数据格式

exercise5_1.txt 文件中的每一行就是一个学生的成绩信息。字段之间的分隔符是","

时间,学校,年纪,姓名,科目,成绩


样例数据

2013,北大,1,黄渤,语文,97
2013,北大,1,徐峥,语文,52
2013,北大,1,刘德华,语文,85
2012,清华,0,马云,英语,61
2015,北理工,3,李彦宏,物理,81
2016,北科,4,马化腾,化学,92
2014,北航,2,刘强东,数学,70
2012,清华,0,刘诗诗,英语,59
2014,北航,2,刘亦菲,数学,49
2014,北航,2,刘嘉玲,数学,77

 

建表导数据

create database if not exists exercise;
use exercise;
drop table if exists exercise5_1;
create table exercise5_1(year int, school string, grade int, name string, course string, score int) row format delimited fields terminated by ',';
load data local inpath "/home/hadoop/exercise5_1.txt" into table exercise5_1;
select * from exercise5_1;
desc exercise5_1;

 

 

需求

1、分组TopN,选出今年每个学校、每个年级、分数前三的科目

select t.* 
from 
(select school, grade, course, score,
row_number() over (partition by school, grade, course order by score desc) rank_code 
from exercise5_1 
where year = "2017"
) t
where t.rank_code <= 3;

详解如下:
  row_number函数:row_number() 按指定的列进行分组生成行序列,从 1 开始,如果两行记录的分组列相同,则行序列 +1。
  over 函数:是一个窗口函数。
  over (order by score) 按照 score 排序进行累计,order by 是个默认的开窗函数。
  over (partition by grade) 按照班级分区。
  over (partition by grade order by score) 按照班级分区,并按着分数排序。
  over (order by score range between 2 preceding and 2 following) 窗口范围为当前行的数据幅度减2加2后的范围内的数据求和。

 

2、今年,北航,每个班级,每科的分数,及分数上下浮动2分的总和

select school, grade, course, score,
sum(score) over (order by score range between 2 preceding and 2 following) sscore
from exercise5_1 
where year = "2017" and school="北航";

 

3、where与having:今年,清华1年级,总成绩大于200分的学生以及学生数

select school, grade, name, sum(score) as total_score,
count(1) over (partition by school, grade) nct
from exercise5_1
where year = "2017" and school="清华" and grade = 1
group by school, grade, name
having total_score > 200;

having 是分组(group by)后的筛选条件,分组后的数据组内再筛选,也就是说 HAVING 子句可以让我们筛选成组后的各组数据。
where 则是在分组,聚合前先筛选记录。也就是说作用在 GROUP BY 子句和 HAVING 子句前。

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章