理解Hive中的窗口函数:ROW_NUMBER、RANK、DENSE_RANK与NTILE的实践应用与剖析
最编程
2024-07-23 07:20:32
...
窗口分析函数
- 分析函数:ROW_NUMBER、RANK、DENSE_RANK、NTILE、cume_dist、percent_rank
- 总结:
- ROW_NUMBER[重点]
- RANK
- DENSE_RANK【重点】
- NTILE
分析函数:ROW_NUMBER、RANK、DENSE_RANK、NTILE、cume_dist、percent_rank
总结:
ROW_NUMBER、RANK、DENSE_RANK --并列的话123 113 112
-
不指定窗口时,不排序默认第一行到最后一行,排序默认第一行到当前行
-
指定窗口时 --rows between 起始位置 and 结束位置
- N preceding:往前多少行
- N following:往后多少行
- current row:当前行
- unbounded:起点或者终点,没有边界
- unbounded preceding 表示从前面的起点
- unbounded following:表示到后面的终点
-
ROW_NUMBER
- 功能:用于实现分区内记录编号
- 语法:row_number() over (partition by col1 order by col2)
-
RANK
- 功能:用于实现分区内排名编号[会留空位]
- 语法:rank() over (partition by col1 order by col2)
-
DENSE_RANK
- 功能:用于实现分区内排名编号[不留空位]
- 语法:dense_rank() over (partition by col1 order by col2)
-
NTILE
- 功能:将每个分区内排序后的结果均分成N份【如果不能均分,优先分配编号小的】
- 本质:将每个分区拆分成更小的分区
- 语法:NTILE(N) over (partition by col1 order by col2)
-
cume_dist
- 功能:计算某个窗口或分区中某个值的累积分布。
- 语法:cume_dist(N) over (partition by col1 order by col2)
- 假定升序排序,则使用以下公式确定累积分布:小于等于当前值x的行数 / 窗口或partition分区内的总行数。其中,x 等于 order by 子句中指定的列的当前行中的值。
-
percent_rank
- 功能:返回某列或某列组合后每行的百分比排序
- 语法:percent_rank(N) over (partition by col1 order by col2)
- 可以用来计算超过了百分之多少的人。(当前行的rank值-1)/(分组内的总行数-1)
ROW_NUMBER[重点]
- 功能:用于实现分区内记录编号
- 语法:row_number() over (partition by col1 order by col2)
- 示例:统计每个部门薪资最高的前两名
select
empno,
ename,
salary,
deptno,
row_number() over (partition by deptno order by salary desc) as numb
from
db_emp.tb_emp;
--此时为排名,要求取前两名,作为临时表t再套一层查询即可
select
*
from
(select
empno,
ename,
salary,
deptno,
row_number() over (partition by deptno order by salary desc) as numb
from
db_emp.tb_emp) t
where t.numb < 3;
+--------+---------+---------+---------+-------+--+
| empno | ename | salary | deptno | numb |
+--------+---------+---------+---------+-------+--+
| 7839 | KING | 5000.0 | 10 | 1 |
| 7782 | CLARK | 2450.0 | 10 | 2 |
| 7934 | MILLER | 1300.0 | 10 | 3 |
| 7788 | SCOTT | 3000.0 | 20 | 1 |
| 7902 | FORD | 3000.0 | 20 | 2 |
| 7566 | JONES | 2975.0 | 20 | 3 |
| 7876 | ADAMS | 1100.0 | 20 | 4 |
| 7369 | SMITH | 800.0 | 20 | 5 |
| 7698 | BLAKE | 2850.0 | 30 | 1 |
| 7499 | ALLEN | 1600.0 | 30 | 2 |
| 7844 | TURNER | 1500.0 | 30 | 3 |
| 7654 | MARTIN | 1250.0 | 30 | 4 |
| 7521 | WARD | 1250.0 | 30 | 5 |
| 7900 | JAMES | 950.0 | 30 | 6 |
+--------+---------+---------+---------+-------+--+
RANK
- 功能:用于实现分区内排名编号[会留空位]
- 与row_number的区别:
- row_number:如果排序时数值相同,继续编号
- rank:如果排序时数值相同,编号相同,但留下空位
- 语法:rank() over (partition by col1 order by col2)
- 示例:统计每个部门薪资排名
select
empno,
ename,
salary,
deptno,
rank() over (partition by deptno order by salary desc) as numb
from
db_emp.tb_emp;
+--------+---------+---------+---------+-------+--+
| empno | ename | salary | deptno | numb |
+--------+---------+---------+---------+-------+--+
| 7839 | KING | 5000.0 | 10 | 1 |
| 7782 | CLARK | 2450.0 | 10 | 2 |
| 7934 | MILLER | 1300.0 | 10 | 3 |
| 7788 | SCOTT | 3000.0 | 20 | 1 |
| 7902 | FORD | 3000.0 | 20 | 1 |
| 7566 | JONES | 2975.0 | 20 | 3 |
| 7876 | ADAMS | 1100.0 | 20 | 4 |
| 7369 | SMITH | 800.0 | 20 | 5 |
| 7698 | BLAKE | 2850.0 | 30 | 1 |
| 7499 | ALLEN | 1600.0 | 30 | 2 |
| 7844 | TURNER | 1500.0 | 30 | 3 |
| 7654 | MARTIN | 1250.0 | 30 | 4 |
| 7521 | WARD | 1250.0 | 30 | 4 |
| 7900 | JAMES | 950.0 | 30 | 6 |
+--------+---------+---------+---------+-------+--+
DENSE_RANK【重点】
- 功能:用于实现分区内排名编号[不留空位]
- 与rank的区别:
- rank:如果排序时数值相同,编号不变,并留下排名空位
- dense_rank:如果排序时数值相同,编号不变,不留空位
- 语法:dense_rank() over (partition by col1 order by col2)
- 示例:统计每个部门薪资排名
select
empno,
ename,
salary,
deptno,
dense_rank() over (partition by deptno order by salary desc) as numb
from
db_emp.tb_emp;
NTILE
- 功能:将每个分区内排序后的结果均分成N份【如果不能均分,优先分配编号小的】
- 本质:将每个分区拆分成更小的分区
- 语法:NTILE(N) over (partition by col1 order by col2)
- 示例:统计每个部门薪资排名,将每个部门的薪资分为两个部分,区分高薪和低薪
select
empno,
ename,
salary,
deptno,
NTILE(2) over (partition by deptno order by salary desc) as numb
from
db_emp.tb_emp;