【转】如何在ClickHouse中实现RANKOVER排序(开窗函数)
原⽂链接:
如何在ClickHouse中实现ROW_NUMBER OVER 和DENSE_RANK OVER等同效果的查询,它们在⼀些其他数据库中可⽤于RANK排序。
同样的,CH中并没有直接提供对应的开窗函数,需要利⽤⼀些特殊函数变相实现,主要会⽤到下⾯⼏个数组函数,它们分别是:
arrayEnumerate
arrayEnumerateDense
arrayEnumerateUniq
这些函数均接受⼀个数组作为输⼊参数,并返回数组中元素出现的位置,例如:
ch7.nauu :) SELECT arrayEnumerate([10,20,30,10,40]) AS row_number, arrayEnumerateDense([10,20,30,10,40]) AS dense_rank, arrayEnumerateUniq([10,20,30,10,40]) AS uniq_rank SELECT
arrayEnumerate([10, 20, 30, 10, 40]) AS row_number,
arrayEnumerateDense([10, 20, 30, 10, 40]) AS dense_rank,
arrayEnumerateUniq([10, 20, 30, 10, 40]) AS uniq_rank
┌─row_number──┬─dense_rank──┬─uniq_rank───┐
│[1,2,3,4,5]│[1,2,3,1,4]│[1,1,1,2,1]│
└─────────────┴─────────────┴─────────────┘
1 rows in set. Elapsed: 0.005 sec.
熟悉开窗函数的看官应该⼀眼就能明⽩
arrayEnumerate 的效果等同于 ROW_NUMBER
arrayEnumerateDense 的效果等同于 DENSE_RANK
⽽ arrayEnumerateUniq 相对特殊,它只返回元素第⼀次出现的位置
在知道了上述⼏个函数的作⽤之后,接下来我⽤⼀个具体⽰例,逐步演⽰如何实现最终需要的查询效果。
⾸先准备测试数据集,创建⼀张测试表
CREATE TABLE test_data engine = Memory AS
WITH(  SELECT['A','A','A','A','B','B','B','B','B','A','59','90','80','80','65','75','78','88','99','70'])AS dictSELECT dict[number%10+1]AS id, dict[number+11]AS val FROM system.numbers LIMIT 10
这是⼀张典型的分数表:
ch7.nauu :) SELECT*FROM test_data
SELECT*FROM test_data
┌─id─┬─val─┐
│ A  │59│
│ A  │90│
│ A  │80│
│ A  │80│
│ B  │65│
│ B  │75│
│ B  │78│
│ B  │88│
│ B  │99│
│ A  │70│
└────┴─────┘
10 rows in set. Elapsed: 0.002 sec.
我们的⽬标,是要实现如下语义的查询:
ROW_NUMBER() OVER( PARTITION BY id ORDER BY val )
DENSE_RANK() OVER( PARTITION BY id ORDER BY val )
UNIQ_RANK() OVER( PARTITION BY id ORDER BY val )
即按照 id 分组后,基于val 排序并得出RANK。
第⼀步,按 val 排序,因为条件是 ORDER BY val :
SELECT*FROM test_data ORDER BY val
(因为要返回所有字段,所以这⾥可以使⽤ * )
第⼆步,按 id 分组,因为条件是 PARTITION BY id :
SELECT id
FROM (
SELECT*FROM test_data    ORDER BY val ASC
)
GROUP BY id
┌─id─┐
│ B  │
│ A  │
└────┘
2 rows in set. Elapsed: 0.006 sec.
第三步,计算val的RANK,需要⽤到刚才介绍的⼏个arrayEnumerate*函数,由于它们的⼊参要求数组,所以先使⽤ groupArray将 val 转成数组: SELECT
id,
groupArray(val) AS arr_val,
arrayEnumerate(arr_val) AS row_number,
arrayEnumerateDense(arr_val) AS dense_rank,
arrayEnumerateUniq(arr_val) AS uniq_rank
FROM (
SELECT*FROM test_data    ORDER BY val ASC
)
GROUP BY id
┌─id─┬─arr_val────────────────────┬─row_number──┬─dense_rank──┬─uniq_rank───┐
│ B  │['65','75','78','88','99']│[1,2,3,4,5]│[1,2,3,4,5]│[1,1,1,1,1]│
│ A  │['59','70','80','80','90']│[1,2,3,4,5]│[1,2,3,3,4]│[1,1,1,2,1]│
└────┴────────────────────────────┴─────────────┴─────────────┴─────────────┘
可以看到,到这⼀步各种形式的RANK排序已经查出来了。第四步,数组展开,利⽤ARRAY JOIN将数组展开,并按照 id 、RANK列排序: SELECT
id,
val,
row_number,
dense_rank,
uniq_rank
FROM
(
SELECT
id,
groupArray(val) AS arr_val,
arrayEnumerate(arr_val) AS row_number,
arrayEnumerateDense(arr_val) AS dense_rank,
arrayEnumerateUniq(arr_val) AS uniq_rank
FROM
(
SELECT*FROM test_data        ORDER BY val ASC
)
GROUP BY id
)
ARRAY JOIN
arr_val AS val,
row_number,
dense_rank,
uniq_rank
ORDER BY
id ASC,
row_number ASC,
dense_rank ASC
┌─id─┬─val─┬─row_number─┬─dense_rank─┬─uniq_rank─┐
│ A  │59│1│1│1│
│ A  │70│2│2│1│
│ A  │80│3│3│1│
│ A  │80│4│3│2│
│ A  │90│5│4│1│
│ B  │65│1│1│1│
│ B  │75│2│2│1│
│ B  │78│3│3│1│
│ B  │88│4│4│1│
│ B  │99│5│5│1│
└────┴─────┴────────────┴────────────┴───────────┘
10 rows in set. Elapsed: 0.004 sec.
⾄此,整个查询就完成了,我们实现了如下三种语义的查询:
ROW_NUMBER() OVER( PARTITION BY id ORDER BY val )
DENSE_RANK() OVER( PARTITION BY id ORDER BY val )
UNIQ_RANK() OVER( PARTITION BY id ORDER BY val )
利⽤RANK排序,进⼀步还能回答哪些问题呢?
分组TOP N,例如按id分组后,查询排名前3的分数: SELECT
id,
val,
dense_rank
FROM (
SELECT
id,
val,
dense_rank
FROM
(
SELECT
id,
groupArray(val) AS arr_val,
arrayEnumerateDense(arr_val) AS dense_rank        FROM
(
SELECT
DISTINCT val,
id
FROM test_data
ORDER BY val DESC
)
GROUP BY id
)
ARRAY JOIN
arr_val AS val,
dense_rank
ORDER BY
id ASC,
dense_rank ASC
)WHERE dense_rank <=3
┌─id─┬─val─┬─dense_rank─┐
│ A  │90│1│
│ A  │80│2│
│ A  │70│3│
│ B  │99│1│
│ B  │88│2│
│ B  │78│3│
└────┴─────┴────────────┘
6 rows in set. Elapsed: 0.008 sec.
由于分数val存在重复数据,此处使⽤了DISTINCT去重指定id的分数排名,查询 id = A,val = 70的排名: SELECT
id,
val,
dense_rankFROM
(
SELECT
id,
val,
dense_rank
FROM
(
SELECT
id,
groupArray(val) AS arr_val,
arrayEnumerateDense(arr_val) AS dense_rank        FROM
(
SELECT
DISTINCT val,
id
FROM test_data
ORDER BY val DESCrank函数怎么排名
)
GROUP BY id
)
ARRAY JOIN
arr_val AS val,
dense_rank
ORDER BY
id ASC,
dense_rank ASC
)WHERE id ='A'AND val ='70'
┌─id─┬─val─┬─dense_rank─┐
│ A  │70│3│
└────┴─────┴────────────┘1 rows in set. Elapsed: 0.006 sec.

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。