理解Hive中的窗口函数操作
最编程
2024-07-23 07:55:55
...
1、基本语法
Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>]
[<window_expression>])
Function (arg1,..., argn) 可以是下面的函数:
Aggregate Functions: 聚合函数,比如:sum(...)、 max(...)、min(...)、avg(...)等.
Sort Functions: 数据排序函数, 比如 :rank(...)、row_number(...)等.
Analytics Functions: 统计和比较函数, 比如:lead(...)、lag(...)、 first_value(...)等.
2、前期准备:
2.1、建表语句:
CREATE TABLE IF NOT EXISTS temp.test (
`name` string COMMENT '姓名',
`dept_num` int COMMENT '编号',
`employee_id` int COMMENT 'id',
`salary` int COMMENT '工资',
`type` string COMMENT '岗位类型',
`start_date` date COMMENT '入职时间'
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED as TEXTFILE;
2.2、在本地创建test文件:
name dept_num employee_id salary type start_date
Michael 1000 100 5000 full 2014-01-29
Will 1000 101 4000 full 2013-10-02
Wendy 1000 101 4000 part 2014-10-02
Steven 1000 102 6400 part 2012-11-03
Lucy 1000 103 5500 full 2010-01-03
Lily 1001 104 5000 part 2014-11-29
Jess 1001 105 6000 part 2014-12-02
Mike 1001 106 6400 part 2013-11-03
Wei 1002 107 7000 part 2010-04-03
Yun 1002 108 5500 full 2014-01-29
Richard 1002 109 8000 full 2013-09-01
2.3、将创建好的本地文上传至hive库中:
load data local inpath '/root/test' into table temp.test;
3、窗口聚合函数
3.1、查询姓名、部门编号、工资以及部门人数
select `name`,`dept_num`,`salary`,
count(*) over (partition by dept_num) as cnt
from employee;
3.1.1、输出结果为:
Total MapReduce CPU Time Spent: 1 seconds 780 msec
OK
name dept_num salary cnt
Lucy 1000 5500 5
Steven 1000 6400 5
Wendy 1000 4000 5
Will 1000 4000 5
Michael 1000 5000 5
Mike 1001 6400 3
Jess 1001 6000 3
Lily 1001 5000 3
Richard 1002 8000 3
Yun 1002 5500 3
Wei 1002 7000 3
Time taken: 22.624 seconds, Fetched: 12 row(s)
3.2、查询姓名、部门编号、工资以及每个部门的总工资,部门总工资按照降序输出
select name,dept_num,salary,
sum(salary) over (partition by dept_num order by dept_num) as sum_dept_salary
from employee order by sum_dept_salary desc;
3.2.1、输出结果为:
Total MapReduce CPU Time Spent: 3 seconds 470 msec
OK
name dept_num salary sum_dept_salary
Michael 1000 5000 24900
Will 1000 4000 24900
Wendy 1000 4000 24900
Steven 1000 6400 24900
Lucy 1000 5500 24900
Wei 1002 7000 20500
Yun 1002 5500 20500
Richard 1002 8000 20500
Lily 1001 5000 17400
Jess 1001 6000 17400
Mike 1001 6400 17400
Time taken: 47.313 seconds, Fetched: 12 row(s)
4、窗口排序函数
4.1、查询姓名、部门编号、工资、排名编号(按工资的多少排名)
select `name`,`dept_num`,`salary`,
row_number() over (order by salary desc ) rnum
from employee;
4.1.1、输出结果为:
Total MapReduce CPU Time Spent: 1 seconds 890 msec
OK
name dept_num salary rnum
Richard 1002 8000 1
Wei 1002 7000 2
Mike 1001 6400 3
Steven 1000 6400 4
Jess 1001 6000 5
Yun 1002 5500 6
Lucy 1000 5500 7
Lily 1001 5000 8
Michael 1000 5000 9
Wendy 1000 4000 10
Will 1000 4000 11
Time taken: 22.453 seconds, Fetched: 12 row(s)
4.2、查询每个部门工资最高的两个人的信息(姓名、部门、薪水)
select name,dept_num,salary
from (
select `name`,`dept_num`,`salary`,
row_number() over (partition by dept_num order by salary desc ) rnum
from employee) t1 where rnum <= 2;
4.2.1、输出结果为:
Total MapReduce CPU Time Spent: 2 seconds 680 msec
OK
name dept_num salary
Steven 1000 6400
Lucy 1000 5500
Mike 1001 6400
Jess 1001 6000
Richard 1002 8000
Wei 1002 7000
Time taken: 24.083 seconds, Fetched: 7 row(s)
4.3、查询每个部门的员工工资排名信息
select `name`,`dept_num`,`salary`,
row_number() over (partition by dept_num order by salary desc ) rnum
from employee;
4.3.1、输出结果为:
Total MapReduce CPU Time Spent: 1 seconds 860 msec
OK
name dept_num salary rnum
Steven 1000 6400 1
Lucy 1000 5500 2
Michael 1000 5000 3
Wendy 1000 4000 4
Will 1000 4000 5
Mike 1001 6400 1
Jess 1001 6000 2
Lily 1001 5000 3
Richard 1002 8000 1
Wei 1002 7000 2
Yun 1002 5500 3
Time taken: 23.202 seconds, Fetched: 12 row(s)
4.4、使用rank函数进行排名
select `name`,`dept_num`,`salary`,
rank() over (order by salary desc) rank
from employee;
4.4.1、输出结果为:
Total MapReduce CPU Time Spent: 1 seconds 830 msec
OK
name dept_num salary rank
Richard 1002 8000 1
Wei 1002 7000 2
Mike 1001 6400 3
Steven 1000 6400 3
Jess 1001 6000 5
Yun 1002 5500 6
Lucy 1000 5500 6
Lily 1001 5000 8
Michael 1000 5000 8
Wendy 1000 4000 10
Will 1000 4000 10
Time taken: 21.547 seconds, Fetched: 12 row(s)
4.5、使用dense_rank进行排名
select `name`,`dept_num`,`salary`,
dense_rank() over (order by salary desc) rank
from employee;
4.5.1、输出结果为:
Total MapReduce CPU Time Spent: 1 seconds 710 msec
OK
name dept_num salary rank
Richard 1002 8000 1
Wei 1002 7000 2
Mike 1001 6400 3
Steven 1000 6400 3
Jess 1001 6000 4
Yun 1002 5500 5
Lucy 1000 5500 5
Lily 1001 5000 6
Michael 1000 5000 6
Wendy 1000 4000 7
Will 1000 4000 7
Time taken: 21.879 seconds, Fetched: 12 row(s)
4.6、使用percent_rank()进行排名
select name,dept_num,salary,
percent_rank() over (order by salary desc) rank
from employee;
4.6.1、输出结果为:
Total MapReduce CPU Time Spent: 1 seconds 940 msec
OK
name dept_num salary rank
Richard 1002 8000 0.0
Wei 1002 7000 0.09090909090909091
Mike 1001 6400 0.18181818181818182
Steven 1000 6400 0.18181818181818182
Jess 1001 6000 0.36363636363636365
Yun 1002 5500 0.45454545454545453
Lucy 1000 5500 0.45454545454545453
Lily 1001 5000 0.6363636363636364
Michael 1000 5000 0.6363636363636364
Wendy 1000 4000 0.8181818181818182
Will 1000 4000 0.8181818181818182
Time taken: 22.401 seconds, Fetched: 12 row(s)
4.7、使用ntile进行数据分片排名
SELECT name,dept_num as deptno,salary,
ntile(4) OVER(ORDER BY salary desc) as ntile
FROM employee;
4.7.1、输出结果为:
Total MapReduce CPU Time Spent: 1 seconds 940 msec
OK
name deptno salary ntile
Richard 1002 8000 1
Wei 1002 7000 1
Mike 1001 6400 1
Steven 1000 6400 2
Jess 1001 6000 2
Yun 1002 5500 2
Lucy 1000 5500 3
Lily 1001 5000 3
Michael 1000 5000 3
Wendy 1000 4000 4
Will 1000 4000 4
Time taken: 28.829 seconds, Fetched: 12 row(s)
5、窗口分析函数
5.1、统计小于等于当前工资的人数占总人数的比例
SELECT name,dept_num,salary,
cume_dist() OVER (ORDER BY salary) as cume
FROM employee;
5.1.1、输出结果为:
name deptno salary cume
Wendy 1000 4000 0.25
Will 1000 4000 0.25
Lily 1001 5000 0.4166666666666667
Michael 1000 5000 0.4166666666666667
Yun 1002 5500 0.5833333333333334
Lucy 1000 5500 0.5833333333333334
Jess 1001 6000 0.6666666666666666
Mike 1001 6400 0.8333333333333334
Steven 1000 6400 0.8333333333333334
Wei 1002 7000 0.9166666666666666
Richard 1002 8000 1.0
Time taken: 20.869 seconds, Fetched: 12 row(s)
5.2、统计小于等于当前工资的人数占总人数的比例
SELECT name,dept_num,salary,
cume_dist() OVER (ORDER BY salary desc) as cume
FROM employee;
5.2.1、输出结果为:
Total MapReduce CPU Time Spent: 1 seconds 790 msec
OK
name dept_num salary cume
Richard 1002 8000 0.08333333333333333
Wei 1002 7000 0.16666666666666666
Mike 1001 6400 0.3333333333333333
Steven 1000 6400 0.3333333333333333
Jess 1001 6000 0.4166666666666667
Yun 1002 5500 0.5833333333333334
Lucy 1000 5500 0.5833333333333334
Lily 1001 5000 0.75
Michael 1000 5000 0.75
Wendy 1000 4000 0.9166666666666666
Will 1000 4000 0.9166666666666666
Time taken: 21.672 seconds, Fetched: 12 row(s)
5.3、统计小于等于当前工资的人数占总人数的比例
SELECT name,dept_num,salary,
cume_dist() OVER (PARTITION BY dept_num ORDER BY salary) as cume
FROM employee;
5.3.1、输出结果为:
Total MapReduce CPU Time Spent: 2 seconds 130 msec
OK
name dept_num salary cume
Wendy 1000 4000 0.4
Will 1000 4000 0.4
Michael 1000 5000 0.6
Lucy 1000 5500 0.8
Steven 1000 6400 1.0
Lily 1001 5000 0.3333333333333333
Jess 1001 6000 0.6666666666666666
Mike 1001 6400 1.0
Yun 1002 5500 0.3333333333333333
Wei 1002 7000 0.6666666666666666
Richard 1002 8000 1.0
Time taken: 22.055 seconds, Fetched: 12 row(s)
5.4、统计小于等于当前工资的人数占总人数的比例
SELECT name,dept_num,salary,
lead(salary,1) OVER (PARTITION BY dept_num ORDER BY salary) as lead
FROM employee;
5.4.1、输出结果为:
Total MapReduce CPU Time Spent: 1 seconds 880 msec
OK
name dept_num salary lead
Wendy 1000 4000 4000
Will 1000 4000 5000
Michael 1000 5000 5500
Lucy 1000 5500 6400
Steven 1000 6400 NULL
Lily 1001 5000 6000
Jess 1001 6000 6400
Mike 1001 6400 NULL
Yun 1002 5500 7000
Wei 1002 7000 8000
Richard 1002 8000 NULL
Time taken: 21.57 seconds, Fetched: 12 row(s)
5.5、统计小于等于当前工资的人数占总人数的比例
SELECT name,dept_num,salary,
lag(salary,1) OVER (PARTITION BY dept_num ORDER BY salary) as lead
FROM employee;
5.5.1、输出结果为:
Total MapReduce CPU Time Spent: 1 seconds 700 msec
OK
name dept_num salary lead
Wendy 1000 4000 NULL
Will 1000 4000 4000
Michael 1000 5000 4000
Lucy 1000 5500 5000
Steven 1000 6400 5500
Lily 1001 5000 NULL
Jess 1001 6000 5000
Mike 1001 6400 6000
Yun 1002 5500 NULL
Wei 1002 7000 5500
Richard 1002 8000 7000
Time taken: 21.423 seconds, Fetched: 12 row(s)
5.6、统计小于等于当前工资的人数占总人数的比例
SELECT name,dept_num,salary,
first_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as fval
FROM employee;
5.6.1、输出结果为:
Total MapReduce CPU Time Spent: 1 seconds 720 msec
OK
name dept_num salary fval
Wendy 1000 4000 4000
Will 1000 4000 4000
Michael 1000 5000 4000
Lucy 1000 5500 4000
Steven 1000 6400 4000
Lily 1001 5000 5000
Jess 1001 6000 5000
Mike 1001 6400 5000
Yun 1002 5500 5500
Wei 1002 7000 5500
Richard 1002 8000 5500
Time taken: 20.379 seconds, Fetched: 12 row(s)
5.7、统计小于等于当前工资的人数占总人数的比例
SELECT name,dept_num,salary,
last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary RANGE
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lval
FROM employee;
5.7.1、输出结果为:
Total MapReduce CPU Time Spent: 1 seconds 770 msec
OK
name dept_num salary lval
Wendy 1000 4000 6400
Will 1000 4000 6400
Michael 1000 5000 6400
Lucy 1000 5500 6400
Steven 1000 6400 6400
Lily 1001 5000 6400
Jess 1001 6000 6400
Mike 1001 6400 6400
Yun 1002 5500 8000
Wei 1002 7000 8000
Richard 1002 8000 8000
Time taken: 21.649 seconds, Fetched: 12 row(s)
推荐阅读
-
14-傅里叶变换的代码实现-一、numpy实现傅里叶变换和逆傅里叶变换 1.numpy实现傅里叶变换numpy.fft.fft2实现傅里叶变换,返回一个复数数组(complex ndarray),也就是频谱图像numpy.fft.fftshift将零频率分量移到频谱中心(将左上角的低频区域,移到中心位置) 20*np.log(np.abs(fshift))设置频谱的范围。可以理解为,之前通过傅里叶变换得到复数的数组,是不能通过图像的方法展示出来的,需要转换为灰度图像(映射到[0,255]区间)需要注意的是1> 傅里叶得到低频、高频信息,针对低频、高频处理能够实现不同的目的2> 傅里叶过程是可逆的,图像经过傅里叶变换、逆傅里叶变换后,能够恢复到原始图像3> 在频域对图像进行处理,在频域的处理会反映在逆变换图像上 # 将绘制的图显示在窗口 %matplotlib qt5 import cv2 import numpy as np import matplotlib.pyplot as plt img = cv2.imread(r"image\lena.bmp",cv2.IMREAD_GRAYSCALE) # 傅里叶变换 f = np.fft.fft2(img) # 移动中心位置 fshift = np.fft.fftshift(f) # 调整值范围 result = 20*np.log(np.abs(fshift)) plt.subplot(1,2,1) plt.imshow(img,cmap=plt.cm.gray) plt.title("original") plt.axis("off") plt.subplot(1,2,2) plt.imshow(result,cmap=plt.cm.gray) plt.title("result") plt.axis("off") plt.show 傅里叶变换的频谱图像: 2.numpy实现逆傅里叶变换numpy.fft.ifft2实现逆傅里叶变换,返回一个复数数组(complex ndarray)numpy.fft.ifftshiftfftshift函数的逆函数,将中心位置的低频,重新移到左上角iimg = np.abs(逆傅里叶变化结果)设置值的范围,映射到[0,255]区间 # 将绘制的图显示在窗口 %matplotlib qt5 import cv2 import numpy as np import matplotlib.pyplot as plt img = cv2.imread(r"image\boat.bmp",cv2.IMREAD_GRAYSCALE) # 傅里叶变换 f = np.fft.fft2(img) fshift = np.fft.fftshift(f) # 逆傅里叶变换 ishift = np.fft.ifftshift(fshift) iimg = np.fft.ifft2(ishift) iimg = np.abs(iimg) plt.subplot(1,2,1) plt.imshow(img,cmap=plt.cm.gray) plt.title("original") plt.axis("off") plt.subplot(1,2,2) plt.imshow(iimg,cmap=plt.cm.gray) plt.title("iimg") plt.axis("off") plt.show 将一副图像,进行傅里叶变换和逆傅里叶变换后,进行对比(一样的) 实例:通过numpy实现高通滤波,保留图像的边缘信息 获取图像的形状rows,cols = img.shape获取图像的中心点crow,ccol = int(rows/2),int(cols/2)将频谱图像的中心区域(低频区域)设置为0(黑色)fshift[crow-30:crow+30,ccol-30:ccol+30] = 0 # 将绘制的图显示在窗口 %matplotlib qt5 import cv2 import numpy as np import matplotlib.pyplot as plt img = cv2.imread(r"image\boat.bmp",cv2.IMREAD_GRAYSCALE) # 傅里叶变换 f = np.fft.fft2(img) fshift = np.fft.fftshift(f) # 高通滤波 rows,cols = img.shape crow,ccol = int(rows/2),int(cols/2) fshift[crow-30:crow+30,ccol-30:ccol+30] = 0 # 逆傅里叶变换 ishift = np.fft.ifftshift(fshift) iimg = np.fft.ifft2(ishift) iimg = np.abs(iimg) plt.subplot(1,2,1) plt.imshow(img,cmap=plt.cm.gray) plt.title("original") plt.axis("off") plt.subplot(1,2,2) plt.imshow(iimg,cmap=plt.cm.gray) plt.title("iimg") plt.axis("off") plt.show 使用numpy实现高通滤波的实验结果: 二、opencv实现傅里叶变换和逆傅里叶变换 1.opencv实现傅里叶变换 返回结果 = cv2.dft(原始图像,转换标识)1> 返回结果:是双通道的,第一个通道是结果的实数部分,第二个通道是结果的虚数部分2> 原始图像:输入图像要首先转换成np.float32(img)格式3> 转换标识:flags = cv2.DFT_COMPLEX_OUTPUT,输出一个复数阵列numpy.fft.fftshift将零频率分量移到频谱中心(将左上角的低频区域,移到中心位置)调整频谱的范围,将上面频谱图像的复数数组,转换为可以显示的灰度图像(映射到[0,255]区间)返回值 = 20*np.log(cv2.magnitude(参数1,参数2))1> 参数1:浮点型X坐标值,也就是实部2> 参数2:浮点型Y坐标值,也就是虚部 # 将绘制的图显示在窗口 %matplotlib qt5 import cv2 import numpy as np import matplotlib.pyplot as plt img = cv2.imread(r"image\lena.bmp",cv2.IMREAD_GRAYSCALE) # 傅里叶变换 dft = cv2.dft(np.float32(img),flags = cv2.DFT_COMPLEX_OUTPUT) # 移动中心位置 dftShift = np.fft.fftshift(dft) # 调整频谱的范围 result = 20*np.log(cv2.magnitude(dftShift[:,:,0],dftShift[:,:,1])) plt.subplot(1,2,1) plt.imshow(img,cmap=plt.cm.gray) plt.title("original") plt.axis("off") plt.subplot(1,2,2) plt.imshow(result,cmap=plt.cm.gray) plt.title("result") plt.axis("off") plt.show 傅里叶变换的频谱图像: 2.opencv实现逆傅里叶变换返回结果 = cv2.idft(原始数据)1> 返回结果:取决于原始数据的类型和大小2> 原始数据:实数或者复数均可numpy.fft.ifftshiftfftshift函数的逆函数,将中心位置的低频,重新移到左上角调整频谱的范围,映射到[0,255]区间返回值 = cv2.magnitude(参数1,参数2)1> 参数1:浮点型X坐标值,也就是实部2> 参数2:浮点型Y坐标值,也就是虚部 # 将绘制的图显示在窗口 %matplotlib qt5 import cv2 import numpy as np import matplotlib.pyplot as plt img = cv2.imread(r"image\lena.bmp",cv2.IMREAD_GRAYSCALE) # 傅里叶变换 dft = cv2.dft(np.float32(img),flags = cv2.DFT_COMPLEX_OUTPUT) dftShift = np.fft.fftshift(dft) # 逆傅里叶变换 ishift = np.fft.ifftshift(dftShift) iimg = cv2.idft(ishift) iimg = cv2.magnitude(iimg[:,:,0],iimg[:,:,1]) plt.subplot(1,2,1) plt.imshow(img,cmap=plt.cm.gray) plt.title("original") plt.axis("off") plt.subplot(1,2,2) plt.imshow(iimg,cmap=plt.cm.gray) plt.title("inverse") plt.axis("off") plt.show 将一副图像,进行傅里叶变换和逆傅里叶变换后,进行对比(一样的) 实例:通过opencv实现低通滤波,模糊一副图像
-
移位操作函数的应用:circshift、fftshift和ifftshift在matlab中的用法
-
C语言第七章:深入理解C语言函数中的复杂递归——C语言中间递归函数
-
理解JSP中的9大内置对象与6个基础操作
-
理解MySQL中的ASCII函数:获取字符的ASCII码值
-
在.NET中利用ufun函数获取CAM操作的进给速度
-
理解并探索newlib中的printf库函数实现
-
深入理解Linux高级I/O操作中的select方法
-
理解JavaScript中的求模和取余操作:为什么摸的值会和x的符号相同?
-
深入理解C语言网络编程中的send函数