理解Hive中的窗口函数操作

最编程 2024-07-23 07:55:55

...

1、基本语法

    Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>]
    [<window_expression>])
Function (arg1,..., argn) 可以是下面的函数：
    Aggregate Functions: 聚合函数,比如：sum(...)、 max(...)、min(...)、avg(...)等.
    Sort Functions: 数据排序函数, 比如 ：rank(...)、row_number(...)等.
    Analytics Functions: 统计和比较函数, 比如：lead(...)、lag(...)、 first_value(...)等.

2、前期准备：

2.1、建表语句:

CREATE TABLE IF NOT EXISTS temp.test (
                `name` string COMMENT '姓名',
                `dept_num` int COMMENT '编号',
                `employee_id` int COMMENT 'id',
                `salary` int COMMENT '工资',
                `type` string COMMENT '岗位类型',
                `start_date` date COMMENT '入职时间'
) ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  STORED as TEXTFILE;

2.2、在本地创建test文件：

name    dept_num    employee_id salary  type    start_date
Michael 1000        100         5000    full    2014-01-29
Will    1000        101         4000    full    2013-10-02
Wendy   1000        101         4000    part    2014-10-02
Steven  1000        102         6400    part    2012-11-03
Lucy    1000        103         5500    full    2010-01-03
Lily    1001        104         5000    part    2014-11-29
Jess    1001        105         6000    part    2014-12-02
Mike    1001        106         6400    part    2013-11-03
Wei     1002        107         7000    part    2010-04-03
Yun     1002        108         5500    full    2014-01-29
Richard 1002        109         8000    full    2013-09-01

2.3、将创建好的本地文上传至hive库中：

load data local inpath '/root/test' into table temp.test;

3、窗口聚合函数

3.1、查询姓名、部门编号、工资以及部门人数

select `name`,`dept_num`,`salary`,
count(*) over (partition by dept_num) as cnt 
from employee;

3.1.1、输出结果为：

Total MapReduce CPU Time Spent: 1 seconds 780 msec
OK
name    dept_num    salary    cnt
Lucy    1000        5500      5
Steven  1000        6400      5
Wendy   1000        4000      5
Will    1000        4000      5
Michael 1000        5000      5
Mike    1001        6400      3
Jess    1001        6000      3
Lily    1001        5000      3
Richard 1002        8000      3
Yun     1002        5500      3
Wei     1002        7000      3
Time taken: 22.624 seconds, Fetched: 12 row(s)

3.2、查询姓名、部门编号、工资以及每个部门的总工资，部门总工资按照降序输出

select name,dept_num,salary,
sum(salary) over (partition by dept_num order by dept_num) as sum_dept_salary 
from employee order by sum_dept_salary desc;

3.2.1、输出结果为：

Total MapReduce CPU Time Spent: 3 seconds 470 msec
OK
name    dept_num    salary    sum_dept_salary
Michael 1000        5000      24900
Will    1000        4000      24900
Wendy   1000        4000      24900
Steven  1000        6400      24900
Lucy    1000        5500      24900
Wei     1002        7000      20500
Yun     1002        5500      20500
Richard 1002        8000      20500
Lily    1001        5000      17400
Jess    1001        6000      17400
Mike    1001        6400      17400
Time taken: 47.313 seconds, Fetched: 12 row(s)

4、窗口排序函数

4.1、查询姓名、部门编号、工资、排名编号(按工资的多少排名)

select `name`,`dept_num`,`salary`,
row_number() over (order by salary desc ) rnum 
from employee;

4.1.1、输出结果为：

Total MapReduce CPU Time Spent: 1 seconds 890 msec
OK
name    dept_num    salary    rnum
Richard 1002        8000      1
Wei     1002        7000      2
Mike    1001        6400      3
Steven  1000        6400      4
Jess    1001        6000      5
Yun     1002        5500      6
Lucy    1000        5500      7
Lily    1001        5000      8
Michael 1000        5000      9
Wendy   1000        4000      10
Will    1000        4000      11
Time taken: 22.453 seconds, Fetched: 12 row(s)

4.2、查询每个部门工资最高的两个人的信息(姓名、部门、薪水)

select name,dept_num,salary 
from (
select `name`,`dept_num`,`salary`,
row_number() over (partition by dept_num order by salary desc ) rnum 
from employee) t1 where rnum <= 2;

4.2.1、输出结果为：

Total MapReduce CPU Time Spent: 2 seconds 680 msec
OK
name    dept_num    salary
Steven  1000        6400
Lucy    1000        5500
Mike    1001        6400
Jess    1001        6000
Richard 1002        8000
Wei     1002        7000
Time taken: 24.083 seconds, Fetched: 7 row(s)

4.3、查询每个部门的员工工资排名信息

select `name`,`dept_num`,`salary`,
row_number() over (partition by dept_num order by salary desc ) rnum 
from employee;

4.3.1、输出结果为：

Total MapReduce CPU Time Spent: 1 seconds 860 msec
OK
name    dept_num    salary    rnum
Steven  1000        6400      1
Lucy    1000        5500      2
Michael 1000        5000      3
Wendy   1000        4000      4
Will    1000        4000      5
Mike    1001        6400      1
Jess    1001        6000      2
Lily    1001        5000      3
Richard 1002        8000      1
Wei     1002        7000      2
Yun     1002        5500      3
Time taken: 23.202 seconds, Fetched: 12 row(s)

4.4、使用rank函数进行排名

select `name`,`dept_num`,`salary`,
rank() over (order by salary desc) rank
from employee;

4.4.1、输出结果为：

Total MapReduce CPU Time Spent: 1 seconds 830 msec
OK
name    dept_num    salary    rank
Richard 1002        8000      1
Wei     1002        7000      2
Mike    1001        6400      3
Steven  1000        6400      3
Jess    1001        6000      5
Yun     1002        5500      6
Lucy    1000        5500      6
Lily    1001        5000      8
Michael 1000        5000      8
Wendy   1000        4000      10
Will    1000        4000      10
Time taken: 21.547 seconds, Fetched: 12 row(s)

4.5、使用dense_rank进行排名

select `name`,`dept_num`,`salary`,
dense_rank() over (order by salary desc) rank
from employee;

4.5.1、输出结果为：

Total MapReduce CPU Time Spent: 1 seconds 710 msec
OK
name    dept_num    salary    rank
Richard 1002        8000      1
Wei     1002        7000      2
Mike    1001        6400      3
Steven  1000        6400      3
Jess    1001        6000      4
Yun     1002        5500      5
Lucy    1000        5500      5
Lily    1001        5000      6
Michael 1000        5000      6
Wendy   1000        4000      7
Will    1000        4000      7
Time taken: 21.879 seconds, Fetched: 12 row(s)

4.6、使用percent_rank()进行排名

select name,dept_num,salary,
percent_rank() over (order by salary desc) rank
from employee;

4.6.1、输出结果为：

Total MapReduce CPU Time Spent: 1 seconds 940 msec
OK
name    dept_num    salary  rank
Richard 1002        8000    0.0
Wei     1002        7000    0.09090909090909091
Mike    1001        6400    0.18181818181818182
Steven  1000        6400    0.18181818181818182
Jess    1001        6000    0.36363636363636365
Yun     1002        5500    0.45454545454545453
Lucy    1000        5500    0.45454545454545453
Lily    1001        5000    0.6363636363636364
Michael 1000        5000    0.6363636363636364
Wendy   1000        4000    0.8181818181818182
Will    1000        4000    0.8181818181818182
Time taken: 22.401 seconds, Fetched: 12 row(s)

4.7、使用ntile进行数据分片排名

SELECT name,dept_num as deptno,salary,
ntile(4) OVER(ORDER BY salary desc) as ntile
FROM employee;

4.7.1、输出结果为：

Total MapReduce CPU Time Spent: 1 seconds 940 msec
OK
name    deptno  salary  ntile
Richard 1002    8000    1
Wei     1002    7000    1
Mike    1001    6400    1
Steven  1000    6400    2
Jess    1001    6000    2
Yun     1002    5500    2
Lucy    1000    5500    3
Lily    1001    5000    3
Michael 1000    5000    3
Wendy   1000    4000    4
Will    1000    4000    4
Time taken: 28.829 seconds, Fetched: 12 row(s)

5、窗口分析函数

5.1、统计小于等于当前工资的人数占总人数的比例

SELECT name,dept_num,salary,
cume_dist() OVER (ORDER BY salary) as cume
FROM employee;

5.1.1、输出结果为：

name    deptno  salary  cume
Wendy   1000    4000    0.25
Will    1000    4000    0.25
Lily    1001    5000    0.4166666666666667
Michael 1000    5000    0.4166666666666667
Yun     1002    5500    0.5833333333333334
Lucy    1000    5500    0.5833333333333334
Jess    1001    6000    0.6666666666666666
Mike    1001    6400    0.8333333333333334
Steven  1000    6400    0.8333333333333334
Wei     1002    7000    0.9166666666666666
Richard 1002    8000    1.0
Time taken: 20.869 seconds, Fetched: 12 row(s)

5.2、统计小于等于当前工资的人数占总人数的比例

SELECT name,dept_num,salary,
cume_dist() OVER (ORDER BY salary desc) as cume
FROM employee;

5.2.1、输出结果为：

Total MapReduce CPU Time Spent: 1 seconds 790 msec
OK
name    dept_num    salary  cume
Richard 1002        8000    0.08333333333333333
Wei     1002        7000    0.16666666666666666
Mike    1001        6400    0.3333333333333333
Steven  1000        6400    0.3333333333333333
Jess    1001        6000    0.4166666666666667
Yun     1002        5500    0.5833333333333334
Lucy    1000        5500    0.5833333333333334
Lily    1001        5000    0.75
Michael 1000        5000    0.75
Wendy   1000        4000    0.9166666666666666
Will    1000        4000    0.9166666666666666
Time taken: 21.672 seconds, Fetched: 12 row(s)

5.3、统计小于等于当前工资的人数占总人数的比例

SELECT name,dept_num,salary,
cume_dist() OVER (PARTITION BY dept_num ORDER BY salary) as cume
FROM employee;

5.3.1、输出结果为：

Total MapReduce CPU Time Spent: 2 seconds 130 msec
OK
name    dept_num    salary  cume
Wendy   1000        4000    0.4
Will    1000        4000    0.4
Michael 1000        5000    0.6
Lucy    1000        5500    0.8
Steven  1000        6400    1.0
Lily    1001        5000    0.3333333333333333
Jess    1001        6000    0.6666666666666666
Mike    1001        6400    1.0
Yun     1002        5500    0.3333333333333333
Wei     1002        7000    0.6666666666666666
Richard 1002        8000    1.0
Time taken: 22.055 seconds, Fetched: 12 row(s)

5.4、统计小于等于当前工资的人数占总人数的比例

SELECT name,dept_num,salary,
lead(salary,1) OVER (PARTITION BY dept_num ORDER BY salary) as lead
FROM employee;

5.4.1、输出结果为：

Total MapReduce CPU Time Spent: 1 seconds 880 msec
OK
name    dept_num    salary  lead
Wendy   1000        4000    4000
Will    1000        4000    5000
Michael 1000        5000    5500
Lucy    1000        5500    6400
Steven  1000        6400    NULL
Lily    1001        5000    6000
Jess    1001        6000    6400
Mike    1001        6400    NULL
Yun     1002        5500    7000
Wei     1002        7000    8000
Richard 1002        8000    NULL
Time taken: 21.57 seconds, Fetched: 12 row(s)

5.5、统计小于等于当前工资的人数占总人数的比例

SELECT name,dept_num,salary,
lag(salary,1) OVER (PARTITION BY dept_num ORDER BY salary) as lead
FROM employee;

5.5.1、输出结果为：

Total MapReduce CPU Time Spent: 1 seconds 700 msec
OK
name    dept_num    salary  lead
Wendy   1000        4000    NULL
Will    1000        4000    4000
Michael 1000        5000    4000
Lucy    1000        5500    5000
Steven  1000        6400    5500
Lily    1001        5000    NULL
Jess    1001        6000    5000
Mike    1001        6400    6000
Yun     1002        5500    NULL
Wei     1002        7000    5500
Richard 1002        8000    7000
Time taken: 21.423 seconds, Fetched: 12 row(s)

5.6、统计小于等于当前工资的人数占总人数的比例

SELECT name,dept_num,salary,
first_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as fval
FROM employee;

5.6.1、输出结果为：

Total MapReduce CPU Time Spent: 1 seconds 720 msec
OK
name    dept_num    salary  fval
Wendy   1000        4000    4000
Will    1000        4000    4000
Michael 1000        5000    4000
Lucy    1000        5500    4000
Steven  1000        6400    4000
Lily    1001        5000    5000
Jess    1001        6000    5000
Mike    1001        6400    5000
Yun     1002        5500    5500
Wei     1002        7000    5500
Richard 1002        8000    5500
Time taken: 20.379 seconds, Fetched: 12 row(s)

5.7、统计小于等于当前工资的人数占总人数的比例

SELECT name,dept_num,salary,
last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary RANGE
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lval
FROM employee;

5.7.1、输出结果为：

Total MapReduce CPU Time Spent: 1 seconds 770 msec
OK
name    dept_num    salary  lval
Wendy   1000        4000    6400
Will    1000        4000    6400
Michael 1000        5000    6400
Lucy    1000        5500    6400
Steven  1000        6400    6400
Lily    1001        5000    6400
Jess    1001        6000    6400
Mike    1001        6400    6400
Yun     1002        5500    8000
Wei     1002        7000    8000
Richard 1002        8000    8000
Time taken: 21.649 seconds, Fetched: 12 row(s)

上一篇：巧用窗口函数OVER(PARTITION BY)对数据库数据进行智能排序

下一篇：用MySQL实现窗口函数操作详解