HIVE中使⽤fulljoin联合查询使⽤coalesce()函数性能慢的解
决⽅案
众所周知hive多表联查在实际⽣产中有很⼴泛的应⽤,使⽤较为快捷的当然是right join,left join;要避免的查询那就是inner join,当表很⼤的时候性能差异特别明显,但是还有⼀种⽤法就是full join,其实full join本⾝性能并不差,与right join,left join性能旗⿎相当,但仅仅使⽤full join出来的效果可能并不是⾃⼰想要的,效果如下:
(举例四个表table1,table2,table3,table4)(为保护隐私数据做简化处理)
SELECT *
FROM
(SELECT name,
datas
FROM table1
WHERE dt="20190110")t1
FULL JOIN
(SELECT name,
datas
FROM table2
WHERE dt="20190110")t2 ON t1.name=t2.name
FULL JOIN
(SELECT name,
datas
FROM table3
WHERE dt="20190110")t3 ON t1.name=t3.name
FULL JOIN
多表left join(SELECT name,
datas
FROM table4
WHERE dt="20190110")t4 ON t1.name=t4.name
结果呢是下⾯这种
Time taken: 23.468 seconds
当然你想要的就是公司名字合并了,于是想到了coalesce()函数
SELECT coalesce(t1.name,t2.name,t3.name,t4.name) AS name,
t1.datas,
t2.datas,
t3.datas,
t4.datas
FROM
(SELECT name,
datas
FROM table1
WHERE dt="20190110")t1
FULL JOIN
(SELECT name,
datas
FROM table2
WHERE dt="20190110")t2 ON t1.name=t2.name
FULL JOIN
(SELECT name,
datas
FROM table3
WHERE dt="20190110")t3 ON coalesce(t1.name,t2.name)=t3.name
FULL JOIN
(SELECT name,
datas
FROM table4
WHERE dt="20190110")t4 ON coalesce(t1.name,t2.name,t3.name)=t4.name
效果如下
Time taken: 76.187 seconds
效果实现了,可性能真的差到了极点。⾜⾜多了三倍时间,原因很简单,正是因为每次join都会使⽤ON coalesce()进⾏校验,判断前⾯是否有空值(不这样写,只要前⾯table1,table2,table3数据有空值结果就会多出来结果,导致错误)
最后划重点:综上所述,性能主要体现在在字段合并上,解决⽅案得从这⾥思考,⽅法当然是使⽤left join替代full join
答案是:原始表⽤原始表的distinct(name)去left join (table1,table2,table3,table4)的name
SELECT t0.name,
t1.datas,
t2.datas,
t3.datas,
t4.datas
FROM
(SELECT distinct(name)
FROM ooodata)t0
LEFT JOIN
(SELECT name,
datas
FROM table1
WHERE dt=dt="20190110")t1 ON t0.name=t1.name LEFT JOIN
(SELECT name,
datas
FROM table2
WHERE dt=dt="20190110")t2 ON t0.name=t2.name LEFT JOIN
(SELECT name,
datas
FROM table3
WHERE dt=dt="20190110")t3 ON t0.name=t3.name LEFT JOIN
(SELECT name,
datas
FROM table4
WHERE dt=dt="20190110")t4 ON t0.name=t4.name 结果当然如下
Time taken: 25.468 seconds
ok,圆满完成,祝你⼯作愉快
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论