常用的PostgreSQL窗口函数的案例学习

窗口函数简介

窗口函数主要是用来对表的数值字段做统计分析的。既然是统计分析就必然涉及到多条记录。
窗口函数一般配合avgmax等聚合函数一起使用,通过跨越多行记录的统计数据来影响当前行。
它们跟普通聚合函数最大的不同在于它们不改变结果集记录行数。
窗口函数的调用语法类似:

1
function_name(expression) over(window_name)

function_name可以为max,min,avg,sum,count等普通聚合函数,也可以为rank,dense_rank等排名函数。
over子句确定了窗口划分方法,也就是我们通常意义上说的分组,类似group by

窗口函数注意事项

执行顺序

窗口函数在GROUP BYHAVINGWHERE子句以及聚合函数之后、排序子句之前执行。因此只允许出现在查询的SELECTORDER BY子句中。

窗口函数案例学习

下面用一些实际的使用场景来说明窗口函数的作用。

数据准备

先准备一些测试数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- postgresql创建自增序列和表结构
CREATE SEQUENCE employee_empno_seq START 1;
CREATE TABLE employee_tbl (
empno int4 NOT NULL DEFAULT nextval('employee_empno_seq'::regclass),
empname varchar(100) COLLATE "pg_catalog"."default",
deptname varchar(100) COLLATE "pg_catalog"."default",
salary int4,
PRIMARY KEY ("empno")
);
-- 准备测试数据
INSERT INTO employee_tbl(empno,deptname,empname,salary) VALUES
(1,'sales','s1',5000),
(2,'personnel','p1',3900),
(3,'sales','s2',4800),
(4,'sales','s3',5000),
(5,'personnel','p2',3500),
(6,'personnel','p3',4100),
(7,'develop','d1',4200),
(8,'develop','d2',6000),
(9,'develop','d3',4500),
(10,'develop','d4',5200),
(11,'develop','d5',5200);

准备好的表数据如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 empno | empname | deptname  | salary
-------+---------+-----------+--------
1 | s1 | sales | 5000
2 | p1 | personnel | 3900
3 | s2 | sales | 4800
4 | s3 | sales | 5000
5 | p2 | personnel | 3500
6 | p3 | personnel | 4100
7 | d1 | develop | 4200
8 | d2 | develop | 6000
9 | d3 | develop | 4500
10 | d4 | develop | 5200
11 | d5 | develop | 5200
(11 rows)

统计员工工资和所在部门平均工资的差异百分比

使用窗口函数计算各部门平均工资:

1
2
3
SELECT empno,empname,deptname,salary,(avg(salary) over(partition by deptname)) as avg_salary
FROM employee_tbl
order by empno asc;

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 empno | empname | deptname  | salary |      avg_salary
-------+---------+-----------+--------+-----------------------
1 | s1 | sales | 5000 | 4933.3333333333333333
2 | p1 | personnel | 3900 | 3833.3333333333333333
3 | s2 | sales | 4800 | 4933.3333333333333333
4 | s3 | sales | 5000 | 4933.3333333333333333
5 | p2 | personnel | 3500 | 3833.3333333333333333
6 | p3 | personnel | 4100 | 3833.3333333333333333
7 | d1 | develop | 4200 | 5020.0000000000000000
8 | d2 | develop | 6000 | 5020.0000000000000000
9 | d3 | develop | 4500 | 5020.0000000000000000
10 | d4 | develop | 5200 | 5020.0000000000000000
11 | d5 | develop | 5200 | 5020.0000000000000000
(11 rows)

对上述sql美化一下,把平均工资取整,再加上员工工资跟部门平均工资的差异比例(这里暂不考虑性能调优啥的):

1
2
3
SELECT empno,empname,deptname,salary,(round(avg(salary) over(partition by deptname))) as avg_salary, round(((salary/(avg(salary) over(partition by deptname)))-1), 2) as diff_ratio 
FROM employee_tbl
order by empno asc;

最终得到结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 empno | empname | deptname  | salary | avg_salary | diff_ratio
-------+---------+-----------+--------+------------+------------
1 | s1 | sales | 5000 | 4933 | 0.01
2 | p1 | personnel | 3900 | 3833 | 0.02
3 | s2 | sales | 4800 | 4933 | -0.03
4 | s3 | sales | 5000 | 4933 | 0.01
5 | p2 | personnel | 3500 | 3833 | -0.09
6 | p3 | personnel | 4100 | 3833 | 0.07
7 | d1 | develop | 4200 | 5020 | -0.16
8 | d2 | develop | 6000 | 5020 | 0.20
9 | d3 | develop | 4500 | 5020 | -0.10
10 | d4 | develop | 5200 | 5020 | 0.04
11 | d5 | develop | 5200 | 5020 | 0.04
(11 rows)

这样我们就能很直观的看到部门内每个员工工资相对平均工资的偏离情况了。
如果窗口函数的over子句括号内容为空,则表示把所有结果集作为一个分组来处理。例如,以下SQL可以在员工工资记录后追加一列“公司平均工资”:

1
2
3
SELECT empno,empname,deptname,salary,round(avg(salary) over()) as avg_salary
FROM employee_tbl
order by empno asc;

得到结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 empno | empname | deptname  | salary | avg_salary
-------+---------+-----------+--------+------------
1 | s1 | sales | 5000 | 4673
2 | p1 | personnel | 3900 | 4673
3 | s2 | sales | 4800 | 4673
4 | s3 | sales | 5000 | 4673
5 | p2 | personnel | 3500 | 4673
6 | p3 | personnel | 4100 | 4673
7 | d1 | develop | 4200 | 4673
8 | d2 | develop | 6000 | 4673
9 | d3 | develop | 4500 | 4673
10 | d4 | develop | 5200 | 4673
11 | d5 | develop | 5200 | 4673
(11 rows)

按员工工资由高到低排名

先在全公司排名:

1
select empno, empname, deptname, salary, rank() over(order by salary desc) from employee_tbl;

得到结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 empno | empname | deptname  | salary | rank
-------+---------+-----------+--------+------
8 | d2 | develop | 6000 | 1
11 | d5 | develop | 5200 | 2
10 | d4 | develop | 5200 | 2
1 | s1 | sales | 5000 | 4
4 | s3 | sales | 5000 | 4
3 | s2 | sales | 4800 | 6
9 | d3 | develop | 4500 | 7
7 | d1 | develop | 4200 | 8
6 | p3 | personnel | 4100 | 9
2 | p1 | personnel | 3900 | 10
5 | p2 | personnel | 3500 | 11
(11 rows)

接着在各自部门内排名:

1
select empno, empname, deptname, salary, rank() over(partition by deptname order by salary desc) from employee_tbl;

得到结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 empno | empname | deptname  | salary | rank
-------+---------+-----------+--------+------
8 | d2 | develop | 6000 | 1
11 | d5 | develop | 5200 | 2
10 | d4 | develop | 5200 | 2
9 | d3 | develop | 4500 | 4
7 | d1 | develop | 4200 | 5
6 | p3 | personnel | 4100 | 1
2 | p1 | personnel | 3900 | 2
5 | p2 | personnel | 3500 | 3
1 | s1 | sales | 5000 | 1
4 | s3 | sales | 5000 | 1
3 | s2 | sales | 4800 | 3
(11 rows)

观察上述两个案例中的rank()得到的排名,如果有并列第N名,则排名序号会不连续。比如上面示例,并列第2名后直接跳到了第4名,没有第3名。如果想要名次编号连续,可以换成dense_rank()

1
select empno,empname,deptname,salary,dense_rank() over(order by salary desc) from employee_tbl;

得到结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 empno | empname | deptname  | salary | dense_rank
-------+---------+-----------+--------+------------
8 | d2 | develop | 6000 | 1
11 | d5 | develop | 5200 | 2
10 | d4 | develop | 5200 | 2
1 | s1 | sales | 5000 | 3
4 | s3 | sales | 5000 | 3
3 | s2 | sales | 4800 | 4
9 | d3 | develop | 4500 | 5
7 | d1 | develop | 4200 | 6
6 | p3 | personnel | 4100 | 7
2 | p1 | personnel | 3900 | 8
5 | p2 | personnel | 3500 | 9
(11 rows)

可见,使用dense_rank()后,并列第2名之后就是并列第3名。

在员工工资记录后展示各部门工资总额列

直接使用sum()聚合函数对应的窗口函数:

1
select empno,empname,deptname,salary,(sum(salary) over(partition by deptname)) as dept_total_salary from employee_tbl order by empno asc;

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 empno | empname | deptname  | salary | dept_total_salary
-------+---------+-----------+--------+-------------------
1 | s1 | sales | 5000 | 14800
2 | p1 | personnel | 3900 | 11500
3 | s2 | sales | 4800 | 14800
4 | s3 | sales | 5000 | 14800
5 | p2 | personnel | 3500 | 11500
6 | p3 | personnel | 4100 | 11500
7 | d1 | develop | 4200 | 25100
8 | d2 | develop | 6000 | 25100
9 | d3 | develop | 4500 | 25100
10 | d4 | develop | 5200 | 25100
11 | d5 | develop | 5200 | 25100
(11 rows)

在员工工资记录后展示各部门截止当前员工记录行累加得到的工资总额列

直接使用sum()聚合函数对应的窗口函数, 注意over子句要使用order by

1
select empno,empname,deptname,salary,(sum(salary) over(partition by deptname order by empno asc)) as dept_total_salary from employee_tbl order by deptname asc, empno asc;

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 empno | empname | deptname  | salary | dept_total_salary
-------+---------+-----------+--------+-------------------
7 | d1 | develop | 4200 | 4200
8 | d2 | develop | 6000 | 10200
9 | d3 | develop | 4500 | 14700
10 | d4 | develop | 5200 | 19900
11 | d5 | develop | 5200 | 25100
2 | p1 | personnel | 3900 | 3900
5 | p2 | personnel | 3500 | 7400
6 | p3 | personnel | 4100 | 11500
1 | s1 | sales | 5000 | 5000
3 | s2 | sales | 4800 | 9800
4 | s3 | sales | 5000 | 14800
(11 rows)

这里需要特别注意一下,dept_total_salary是首先按部门分组,再对部门内员工记录按empno升序排序,然后累加计算组内(本例中即部门内)第一条记录到当前记录的工资得到的。

获取工资前3高的员工

需要先用窗口函数对员工按工资排序,然后排序号<4的几条记录。

1
2
3
select * from (
select empno,empname,deptname,salary,(rank() over(order by salary desc)) as pos from employee_tbl
) as a where a.pos < 4;

也可以根据需要取前3条记录:

1
2
3
select * from (
select empno,empname,deptname,salary,(rank() over(order by salary desc)) as pos from employee_tbl
) as a limit 3;

或:

1
select empno,empname,deptname,salary,(rank() over(order by salary desc)) as pos from employee_tbl limit 3;

在员工工资记录后展示所在部门截止当前员工记录的工资总额和平均工资

这里需要两个窗口函数,可以复用partition by xxx order by yyy部分:

1
2
select empno,empname,deptname,(sum(salary) over w) as dept_total_salary, round(avg(salary) over w) as avg_salary FROM employee_tbl WINDOW w as
(partition by deptname order by salary asc);

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 empno | empname | deptname  | dept_total_salary | avg_salary
-------+---------+-----------+-------------------+------------
7 | d1 | develop | 4200 | 4200
9 | d3 | develop | 8700 | 4350
10 | d4 | develop | 19100 | 4775
11 | d5 | develop | 19100 | 4775
8 | d2 | develop | 25100 | 5020
5 | p2 | personnel | 3500 | 3500
2 | p1 | personnel | 7400 | 3700
6 | p3 | personnel | 11500 | 3833
3 | s2 | sales | 4800 | 4800
1 | s1 | sales | 14800 | 4933
4 | s3 | sales | 14800 | 4933
(11 rows)

吐槽一下,感觉这个需求好没意义,纯粹为了炫技。
同样的,开头计算部门平均工资以及员工工资和部门平均工资偏差比例的sql也可以稍微简化一下:

1
2
3
4
SELECT empno,empname,deptname,salary,round(avg(salary) over w) as avg_salary, round(salary/(avg(salary) over w)-1, 2) as diff_ratio 
FROM employee_tbl
WINDOW w as (partition by deptname)
order by empno asc;

执行结果跟原始语句相同:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 empno | empname | deptname  | salary | avg_salary | diff_ratio
-------+---------+-----------+--------+------------+------------
1 | s1 | sales | 5000 | 4933 | 0.01
2 | p1 | personnel | 3900 | 3833 | 0.02
3 | s2 | sales | 4800 | 4933 | -0.03
4 | s3 | sales | 5000 | 4933 | 0.01
5 | p2 | personnel | 3500 | 3833 | -0.09
6 | p3 | personnel | 4100 | 3833 | 0.07
7 | d1 | develop | 4200 | 5020 | -0.16
8 | d2 | develop | 6000 | 5020 | 0.20
9 | d3 | develop | 4500 | 5020 | -0.10
10 | d4 | develop | 5200 | 5020 | 0.04
11 | d5 | develop | 5200 | 5020 | 0.04
(11 rows)