欢迎您访问 最编程 本站为您分享编程语言代码,编程技术文章!
您现在的位置是: 首页

探索Hive中的窗口函数:理解SUM、AVG、COUNT、MAX和MIN的窗口聚合操作

最编程 2024-07-23 07:24:16
...

窗口聚合函数

  • 窗口聚合函数:SUM、AVG、COUNT、MAX、MIN
    • 以SUM为例
    • 总结
      • 实现分区内起始到当前行的pv累加,指定排序默认窗口:取分区第一行开始到当前行的和
      • 实现分区内起始到当前行的pv累加,指定排序默认窗口:取分区第一行开始到当前行的和,每组的相同level分享同一累加值。
      • 实现分区内所有pv的累加,不指定排序默认窗口:从第一行到最后一行
      • 实现分区内起始到当前行的pv累加,手动指定窗口大小从第一行到当前行
      • 实现分区内指定前N行到当前行的pv累加
      • 实现分区内指定前N行到后N行的pv累加
      • 实现分区内指定当前行到后N行的pv累加
  • 注意:窗口函数不支持distinct

窗口聚合函数:SUM、AVG、COUNT、MAX、MIN

以SUM为例

总结

  • 1.如果不使用 over(),窗口大小是针对查询产生的所有数据
  • 功能:用于实现数据分区后的聚合
  • 语法:fun_name(col1) over (partition by col2 order by col3)
  • 示例:实现分区内的累加,其他的原理类似
  • 不指定窗口时,不排序默认第一行到最后一行,排序默认第一行到当前行
  • 指定窗口时 --rows between 起始位置 and 结束位置
    • N preceding:往前多少行
    • N following:往后多少行
    • current row:当前行
    • unbounded:起点或者终点,没有边界
    • unbounded preceding 表示从前面的起点
    • unbounded following:表示到后面的终点
  • 注意
    • 分组列表,组内按排列顺序求累加和, 每组的相同level(排序字段)分享同一累加值。
    • 如果level(排序字段)有相同,需要注意是否是你想要的结果,是否需要考虑指定窗口位置

实现分区内起始到当前行的pv累加,指定排序默认窗口:取分区第一行开始到当前行的和

--创建文件:vim /export/datas/window.txt
cookie1,2018-04-10,1
cookie1,2018-04-11,5
cookie1,2018-04-12,7
cookie1,2018-04-13,3
cookie2,2018-04-13,3
cookie2,2018-04-14,2
cookie2,2018-04-15,4
cookie1,2018-04-14,2
cookie1,2018-04-15,4
cookie1,2018-04-16,4
cookie2,2018-04-10,1
cookie2,2018-04-11,5
cookie2,2018-04-12,7
--创建表
create database db_function;
use db_function;
create table hanjiaxiaozhi_f1(
cookieid string,
daystr string,
pv int
) row format delimited fields terminated by ',';
--加载数据
load data local inpath '/export/datas/window.txt' into table hanjiaxiaozhi_f1;
set hive.exec.mode.local.auto=true;
--实现分区内起始到当前行的pv累加,默认窗口:取分区第一行开始到当前行的和
select 
  cookieid,
  daystr,
  pv,
  sum(pv) over(partition by cookieid order by daystr) as pv1 
from 
  hanjiaxiaozhi_f1;
  
+-----------+-------------+-----+------+--+
| cookieid  |   daystr    | pv  | pv3  |
+-----------+-------------+-----+------+--+
| cookie1   | 2018-04-10  | 1   | 1    |
| cookie1   | 2018-04-11  | 5   | 6    |
| cookie1   | 2018-04-12  | 7   | 13   |
| cookie1   | 2018-04-13  | 3   | 16   |
| cookie1   | 2018-04-14  | 2   | 18   |
| cookie1   | 2018-04-15  | 4   | 22   |
| cookie1   | 2018-04-16  | 4   | 26   |
| cookie2   | 2018-04-10  | 1   | 1    |
| cookie2   | 2018-04-11  | 5   | 6    |
| cookie2   | 2018-04-12  | 7   | 13   |
| cookie2   | 2018-04-13  | 3   | 16   |
| cookie2   | 2018-04-14  | 2   | 18   |
| cookie2   | 2018-04-15  | 4   | 22   |
+-----------+-------------+-----+------+--+

实现分区内起始到当前行的pv累加,指定排序默认窗口:取分区第一行开始到当前行的和,每组的相同level分享同一累加值。

select id,score,
sum(score) over(patition by id order by score) cumm_scores  
from GRADE

id score cumm_scores
1 98 98
1 100 198
2 95 190
2 95 190

实现分区内所有pv的累加,不指定排序默认窗口:从第一行到最后一行

select 
  cookieid,
  daystr,
  pv,
  sum(pv) over(partition by cookieid ) as pv2
from 
  hanjiaxiaozhi_f1;
  
+-----------+-------------+-----+------+--+
| cookieid  |   daystr    | pv  | pv2  |
+-----------+-------------+-----+------+--+
| cookie1   | 2018-04-16  | 4   | 26   |
| cookie1   | 2018-04-15  | 4   | 26   |
| cookie1   | 2018-04-13  | 3   | 26   |
| cookie1   | 2018-04-12  | 7   | 26   |
| cookie1   | 2018-04-11  | 5   | 26   |
| cookie1   | 2018-04-14  | 2   | 26   |
| cookie1   | 2018-04-10  | 1   | 26   |
| cookie2   | 2018-04-12  | 7   | 22   |
| cookie2   | 2018-04-11  | 5   | 22   |
| cookie2   | 2018-04-10  | 1   | 22   |
| cookie2   | 2018-04-14  | 2   | 22   |
| cookie2   | 2018-04-13  | 3   | 22   |
| cookie2   | 2018-04-15  | 4   | 22   |
+-----------+-------------+-----+------+--+
  

手动指定窗口的大小:分区

rows between 起始位置 and 结束位置
指定窗口的关键字:rows betweenand- N preceding:往前多少行
- N following:往后多少行
- current row:当前行
- unbounded:起点或者终点,没有边界
- unbounded preceding 表示从前面的起点
- unbounded following:表示到后面的终点
rows between unbounded preceding and current row
--实现分区内从第一行到当前行的pv累加
--手动指定窗口的大小:分区
rows between 起始位置 and 结束位置
rows between unbounded preceding and current row

实现分区内起始到当前行的pv累加,手动指定窗口大小从第一行到当前行

select 
  cookieid,
  daystr,
  pv,
  sum(pv) over(partition by cookieid order by daystr rows between unbounded preceding and current row) as pv3
from 
  hanjiaxiaozhi_f1;
  
  
+-----------+-------------+-----+------+--+
| cookieid  |   daystr    | pv  | pv3  |
+-----------+-------------+-----+------+--+
| cookie1   | 2018-04-10  | 1   | 1    |
| cookie1   | 2018-04-11  | 5   | 6    |
| cookie1   | 2018-04-12  | 7   | 13   |
| cookie1   | 2018-04-13  | 3   | 16   |
| cookie1   | 2018-04-14  | 2   | 18   |
| cookie1   | 2018-04-15  | 4   | 22   |
| cookie1   | 2018-04-16  | 4   | 26 
| cookie2   | 2018-04-10  | 1   | 1    |
| cookie2   | 2018-04-11  | 5   | 6    |
| cookie2   | 2018-04-12  | 7   | 13   |
| cookie2   | 2018-04-13  | 3   | 16   |
| cookie2   | 2018-04-14  | 2   | 18   |
| cookie2   | 2018-04-15  | 4   | 22   |
+-----------+-------------+-----+------+--+

实现分区内指定前N行到当前行的pv累加

select 
  cookieid,
  daystr,
  pv,
  sum(pv) over(partition by cookieid order by daystr rows between 3 preceding and current row) as pv4
from 
  hanjiaxiaozhi_f1;
  
+-----------+-------------+-----+------+--+
| cookieid  |   daystr    | pv  | pv4  |
+-----------+-------------+-----+------+--+
| cookie1   | 2018-04-10  | 1   | 1    |
| cookie1   | 2018-04-11  | 5   | 6    |
| cookie1   | 2018-04-12  | 7   | 13   |
| cookie1   | 2018-04-13  | 3   | 16   |
| cookie1   | 2018-04-14  | 2   | 17   |
| cookie1   | 2018-04-15  | 4   | 16   |
| cookie1   | 2018-04-16  | 4   | 13   |
| cookie2   | 2018-04-10  | 1   | 1    |
| cookie2   | 2018-04-11  | 5   | 6    |
| cookie2   | 2018-04-12  | 7   | 13   |
| cookie2   | 2018-04-13  | 3   | 16   |
| cookie2   | 2018-04-14  | 2   | 17   |
| cookie2   | 2018-04-15  | 4   | 16   |
+-----------+-------------+-----+------+--+

实现分区内指定前N行到后N行的pv累加

select 
  cookieid,
  daystr,
  pv,
  sum(pv) over(partition by cookieid order by daystr rows between 3 preceding and 1 following) as pv5
from 
  hanjiaxiaozhi_f1;
  
+-----------+-------------+-----+------+--+
| cookieid  |   daystr    | pv  | pv5  |
+-----------+-------------+-----+------+--+
| cookie1   | 2018-04-10  | 1   | 6    |
| cookie1   | 2018-04-11  | 5   | 13   |
| cookie1   | 2018-04-12  | 7   | 16   |
| cookie1   | 2018-04-13  | 3   | 18   |
| cookie1   | 2018-04-14  | 2   | 21   |
| cookie1   | 2018-04-15  | 4   | 20   |
| cookie1   | 2018-04-16  | 4   | 13   |
| cookie2   | 2018-04-10  | 1   | 6    |
| cookie2   | 2018-04-11  | 5   | 13   |
| cookie2   | 2018-04-12  | 7   | 16   |
| cookie2   | 2018-04-13  | 3   | 18   |
| cookie2   | 2018-04-14  | 2   | 21   |
| cookie2   | 2018-04-15  | 4   | 16   |
+-----------+-------------+-----+------+--+

实现分区内指定当前行到后N行的pv累加

select 
  cookieid,
  daystr,
  pv,
  sum(pv) over(partition by cookieid order by daystr rows between current row and unbounded following) as pv6
from 
  hanjiaxiaozhi_f1;
  
+-----------+-------------+-----+------+--+
| cookieid  |   daystr    | pv  | pv6  |
+-----------+-------------+-----+------+--+
| cookie1   | 2018-04-10  | 1   | 26   |
| cookie1   | 2018-04-11  | 5   | 25   |
| cookie1   | 2018-04-12  | 7   | 20   |
| cookie1   | 2018-04-13  | 3   | 13   |
| cookie1   | 2018-04-14  | 2   | 10   |
| cookie1   | 2018-04-15  | 4   | 8    |
| cookie1   | 2018-04-16  | 4   | 4    |
| cookie2   | 2018-04-10  | 1   | 22   |
| cookie2   | 2018-04-11  | 5   | 21   |
| cookie2   | 2018-04-12  | 7   | 16   |
| cookie2   | 2018-04-13  | 3   | 9    |
| cookie2   | 2018-04-14  | 2   | 6    |
| cookie2   | 2018-04-15  | 4   | 4    |
+-----------+-------------+-----+------+--+

注意:窗口函数不支持distinct

  • 会报错Error in query: Distinct window functions are not supported: 。。。。
  • 解决:size( collect_set(xxx)over( ) )