SparkSqlMAPJOIN优化之⼩表leftjoin⼤表
⾸先我们要了解MAPJOIN优化原理,这⾥简要说明下
Spark Broadcast hash join(Hive map join同理)
1,把⼩表⼴播到所有⼤表分布的节点上,在每个节点上分别进⾏单机hash join
2,left join时只能⼴播右表
执⾏基本条件:⼩表必须⼩于参数spark.sql.autoBroadcastJoinThreshold, 默认为10M
sql 场景⼩表需要left join⼤表  ⼩表150M左右  ⼤表1T左右
原始sql(⼴播左表,未成功,因为left join只⽀持⼴播右表)
create table tmp.iapp_mkt_per_imei_pkg_repair_t2_not_default_20200812_1 as
select /*+MAPJOIN(a) */ a.imei,b.package,1 status,1 install_type,b.imei imei_b
from
(select imei from tmp.iapp_mkt_per_imei_only_repair_20200812 where install_type = 0 and tail != 'default') a
left join join
(select imei,package from edw.sys_app_list_fact  where data_date = 20200812 and tail != 'default' and imei is not null and status in (1,2))b
on a.imei = b.imei
优化思路:使⽤with将⼩表提前处理,多次复⽤,将left join转化为可以进⾏⼴播变量的join,然后再加进⾏⼩表与⼩表right join进⾏数据补偿,此⽅法只适合⼩表left join ⼤表
优化后sql
create table tmp.lanfz_222 as
with iapp_mkt as (select imei from tmp.iapp_mkt_per_imei_only_repair_20200812 where install_type = 0 and tail != 'default')
select /*+BROADCAST(a) */
c.imei,b.package,1 status,1 install_type,b.imei imei_b
from
iapp_mkt a
join
(select imei,package from edw.sys_app_list_fact  where data_date = 20200812 and tail != 'default' and imei is not null and status in (1,2))b
on a.imei = b.imei
right join
iapp_mkt c
on c.imei = b.imei;
结果数据量对⽐,完全⼀致
多表left join
sql耗时对⽐:
原始sql:Time taken: 1369.239 seconds
优化后sql:Time taken: 680.239 seconds

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。