ClickHouse⽀持的join类型说明
ClickHouse ⽀持的join类型说明
按照代码Join.h的说明,ClickHouse⽀持14种Join,如下所⽰:
* JOIN-s could be of these types:
* - ALL × LEFT/INNER/RIGHT/FULL
* - ANY × LEFT/INNER/RIGHT
* - SEMI/ANTI x LEFT/RIGHT
* - ASOF x LEFT/INNER
* - CROSS
All和Any的区别如官⽹⽂档所⽰:
ANY与ALL
在使⽤ALL修饰符对JOIN进⾏修饰时,如果右表中存在多个与左表关联的数据,那么系统则将右表中所有可以与左表关联的数据全部返回在结果中。这与SQL标准的JOIN⾏为相同。
在使⽤ANY修饰符对JOIN进⾏修饰时,如果右表中存在多个与左表关联的数据,那么系统仅返回第⼀个与左表匹配的结果。如果左表与右表⼀⼀对应,不存在多余的⾏时,ANY与ALL的结果相同。
以INNER JOIN为例说明ANY和ALL的区别,先准备数据:
1、创建join_test库
create database join_test engine=Ordinary;
2、创建left_t1和right_t1表
create table left_t1(a UInt16,b UInt16,create_date date)Engine=MergeTree(create_date,a,8192);
create table right_t1(a UInt16,b UInt16,create_date date)Engine=MergeTree(create_date,a,8192);
3、插⼊数据
insert into left_t1 values(1,11,2020-3-20);
insert into left_t1 values(2,22,2020-3-20);
insert into left_t1 values(3,22,2020-3-20);
insert into right_t1 values(1,111,2020-3-20);
insert into right_t1 values(2,222,2020-3-20);
insert into right_t1 values(2,2222,2020-3-20);
4、查看分别增加ANY和ALL对INNER JOIN输出结果的影响
ALL INNER JOIN
select * from left_t1 all inner join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ALL INNER JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 2 │ 22 │  1975-06-21 │          2 │      2222 │          1975-06-21 │
│ 2 │ 22 │  1975-06-21 │          2 │        222 │          1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │          1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
3 rows in set. Elapsed: 0.019 sec.
右表right_t1存在两条与左表left_t1匹配的结果,两条全部返回。
ANY INNER JOIN
select * from left_t1 any inner join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ANY INNER JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 2 │ 22 │  1975-06-21 │          2 │      2222 │          1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │          1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
2 rows in set. Elapsed: 0.02
3 sec.
右表right_t1存在两条与左表left_t1匹配的结果,但是只返回⼀条。
INNER JOIN
内连接,将left_t1表和right_t1表所有满⾜left_t1.a=right_t1.a条件的记录进⾏连接,如下图所⽰:
select * from left_t1 all inner join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ALL INNER JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │          1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 2 │ 22 │  1975-06-21 │          2 │      2222 │          1975-06-21 │
│ 2 │ 22 │  1975-06-21 │          2 │        222 │          1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
3 rows in set. Elapsed: 0.13
4 sec.
LEFT JOIN
左连接,在内连接的基础上,对于那些在right_t1表中不到匹配记录的left_t1表中的记录,⽤空值或0进⾏连接,如下图所⽰:select * from left_t1 all left join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ALL LEFT JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 3 │ 22 │  1975-06-21 │          0 │          0 │          0000-00-00 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │          1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 2 │ 22 │  1975-06-21 │          2 │      2222 │          1975-06-21 │
│ 2 │ 22 │  1975-06-21 │          2 │        222 │          1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
4 rows in set. Elapsed: 0.013 sec.
RIGHT JOIN
右连接,在内连接的基础上,对于那些在left_t1表中不到匹配记录的right_t1表中的记录,⽤空值或0进⾏连接,如下图所⽰:select * from left_t1 all right join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ALL RIGHT JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 2 │ 22 │  1975-06-21 │          2 │      2222 │          1975-06-21 │
│ 2 │ 22 │  1975-06-21 │          2 │        222 │          1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │          1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
3 rows in set. Elapsed: 0.021 sec.
FULL JOIN
全连接,在内连接的基础上,对于那些在left_t1表中不到匹配记录的right_t1表中的记录和在right_t1表中不到匹配记录的left_t1表中的记录,都⽤空值或0进⾏连接,如下图所⽰:
select * from left_t1 all full join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ALL FULL OUTER JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 2 │ 22 │  1975-06-21 │          2 │      2222 │          1975-06-21 │
│ 2 │ 22 │  1975-06-21 │          2 │        222 │          1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │          1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 3 │ 22 │  1975-06-21 │          0 │          0 │          0000-00-00 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
4 rows in set. Elapsed: 0.046 sec.
SEMI LEFT JOIN  和  SEMI RIGHT JOIN    ANTI LEFT JOIN  和  ANTI RIGHT JOIN        Join.h中的解释如下:
* SEMI JOIN filter left table by keys that are present in right table for LEFT JOIN, and filter right table by keys from left table
* for RIGHT JOIN. In other words SEMI JOIN returns only rows which joining keys present in another
table.
* ANTI JOIN is the same as SEMI JOIN but returns rows with joining keys that are NOT present in another table.
* SEMI/ANTI JOINs allow to get values from both tables. For filter table it gets any row with joining same key. For ANTI JOIN it returns
* defaults other table columns.
意思是:使⽤SEMI LEFT JOIN时,使⽤右表中存在的key去过滤左表中的key,如果左表存在与右表相同的key,则输出。
使⽤SEMI RIGHT JOIN时,使⽤左表中存在的key去过滤右表中的key,如果右表中存在与左表相同的key,则输出。
换句话说,SEMI JOIN返回key在另外⼀个表中存在的记录⾏。
ANTI JOIN和SEMI JOIN相反,他返回的是key在另外⼀个表中不存在的记录⾏。
SEMI JOIN和ANTI JOIN都允许从两个表中获取数据。对于被过滤的表,返回的是与key相同的记录⾏。对于ANTI JOIN,另外⼀个表返回的是默认值,⽐如空值或0。
select * from left_t1 semi left join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
SEMI LEFT JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 2 │ 22 │  1975-06-21 │          2 │      2222 │          1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │          1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
2 rows in set. Elapsed: 0.052 sec.
select * from left_t1 semi right join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
SEMI RIGHT JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 1 │ 11 │  1975-06-21 │          1 │        111 │          1975-06-21 │
│ 2 │ 22 │  1975-06-21 │          2 │        222 │          1975-06-21 │
│ 2 │ 22 │  1975-06-21 │          2 │      2222 │          1975-06-21 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
3 rows in set. Elapsed: 1.327 sec.
select * from left_t1 anti left join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ANTI LEFT JOIN right_t1 ON left_t1.a = right_t1.a
┌─a─┬──b─┬─create_date─┬─right_t1.a─┬─right_t1.b─┬─ate_date─┐
│ 3 │ 22 │  1975-06-21 │          3 │          0 │          0000-00-00 │
└───┴────┴─────────────┴────────────┴────────────┴──────────────────────┘
1 rows in set. Elapsed: 0.061 sec.
select * from left_t1 anti right join right_t1 on left_t1.a=right_t1.a;
SELECT *
FROM left_t1
ANTI RIGHT JOIN right_t1 ON left_t1.a = right_t1.a
Ok.
0 rows in set. Elapsed: 0.024 sec.
ASOF LEFT  和  ASOF INNER 没有具体的语法,本来想通过查看执⾏计划来看看,但是采⽤下述⽅式后,没看到选择什么⽅式,暂时不知道怎么能⾛到这两个类型的处理⽅式上来。
clickhouse-client --send_logs_level=trace <<< 'select * from join_test.left_t1,join_test.right_t1 where join_test.left_t1.a<>1 and
join_test.right_t1.a<>1' > /dev/null
下⾯为Join.h中的说明:
* ASOF JOIN is not-equi join. For one key column it finds nearest value to join according to join inequality.
* It's expected that ANY|SEMI LEFT JOIN is more efficient that ALL one.
*
* If INNER is specified - leave only rows that have matching rows from "right" table.
* If LEFT is specified - in case when there is no matching row in "right" table, fill it with default values instead.
* If RIGHT is specified - first process as INNER, but track what rows from the right table was joined,
*  and at the end, add rows from right table that was not joined and substitute default values for columns of left table.
* If FULL is specified - first process as LEFT, but track what rows from the right table was joined,
*  and at the end, add rows from right table that was not joined and substitute default values for columns of left table.
*
* Thus, LEFT and RIGHT JOINs are not symmetric in terms of implementation.
*
connect和join的区别* All JOINs (except CROSS) are done by equality condition on keys (equijoin).
* Non-equality and other conditions are not supported.
仅⽀持等值条件的Join,不⽀持⾮等值和其他条件的Join。
*
* Implementation:实现机制如下:
*
* 1. Build hash table in memory from "right" table.
* This hash table is in form of keys -> row in case of ANY or keys -> [] in case of ALL.
* This is done in insertFromBlock method.
*⼀般将⼩表作为右表,根据右表在内存中构建hash表。这部分实现在insertFromBlock中完成。
* 2. Process "left" table and join corresponding rows from "right" table by lookups in the map.
* This is done in joinBlock methods.
*遍历左表,根据右表在内存中的map来连接对应⾏,这部分实现在joinBlock中完成。
* In case of ANY LEFT JOIN - form new columns with found values or default values.
* This is the most simple. Number of rows in left table does not change.
*ANY LEFT JOIN左表⾏数量不变,使⽤匹配的值或默认值填充新列。
* In case of ANY INNER JOIN - form new columns with found values,
*  and also build a filter - in what rows nothing was found.
* Then filter columns of "left" table.
*ANY INNER JOIN ⽤满⾜条件的值构建新列,⽤不满⾜条件的⾏构建filter,然后⽤filter过滤左表。
* In case of ALL ... JOIN - form new columns with all found rows,
*  and also fill 'offsets' array, describing how many times we need to replicate values of "left" table.
* Then replicate columns of "left" table.
*JOIN 将所有到的⾏合并为新列,并填充offsets数组,并描述需要把左表的值复制多少次,然后复制左表的列。
* How Nullable keys are processed:
*如何处理NULL值:
* NULLs never join to anything, even to each other.
NULL永远不会和任何值做JOIN,即使是NULL之间。
* During building of map, we just skip keys with NULL value of any component.
构建Hash表的过程中,跳过任何NULL值。
* During joining, we simply treat rows with any NULLs in key as non joined.
*Join期间,将NULL值⾏视为未JOIN
* Default values for outer joins (LEFT, RIGHT, FULL):
*外部连接的默认值
* Behaviour is controlled by 'join_use_nulls' settings.
⾏为由join_use_nulls参数控制。
* If it is false, we substitute (global) default value for the data type, for non-joined rows
*  (zero, empty string, etc. and NULL for Nullable data types).
* If it is true, we always generate Nullable column and substitute NULLs for non-joined rows,
*  as in standard SQL.
分两种情况:当join_use_nulls参数为false时,⽤默认值替代未连接的⾏;当join_use_nulls为true时,⽤NULL替代未连接的⾏。ANTI RIGHT JOIN

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。