groupby后乱序问题原因和解决⽬录
group by后乱序问题原因和解决
1.解决⽅案
先说⼀下解决的办法
1.1 放到集合内排序
SELECT uid
,regexp_replace(
concat_ws(
','
,sort_array(
collect_list(
concat_ws(':',lpad(CAST(rn AS STRING),3,'0'),page_id)
)
)
)
,'\\d+\:'
,''
)
FROM tmp_t
WHERE uid IS NOT NULLgroupby分组
GROUP BY uid
;
1.2 在分组内排序后在分组求和
SELECT uid
,collect_list(page_id) AS page_id
FROM (
SELECT *
,ROW_NUMBER() OVER (PARTITION BY uid ORDER BY rn) AS rank
FROM tmp_t
WHERE uid IS NOT NULL
)
GROUP BY uid
;
2.问题经历和原因
求⽤户的访问轨迹以特殊分隔符隔开(已经有数仓表记录⽤户访问顺序)
SELECT uid
,collect_list(page_id) AS visit_track
FROM tmp_t
;
我以为这时已经排好序的表,是直接按照接过排序,可是我发现这个顺序根本对不上
⼀开始我想过是因为 collect_list 的原因,但是'解决⽅案⾥在分组内排序后在分组求和'就是正常的
⽽且由此可以知道原因还是 group by 的问题,它导致了我已经排序好插⼊表⾥的数据分组后取值时顺序变了数据量⼤我是可以理解的,要分组打散,但是数据量⼩的时候在spark跑时有序的,在odps上就⽆序了3.⽤到的表和数据
## 模仿⼀下数仓⽤户访问顺序表
CREATE TABLE IF NOT EXISTS tmp_t
(
uid STRING,
page_id STRING,
rn BIGINT
) ;
insert overwrite table tmp_t values
('1422','4888','1'),
('1422','4883','2'),
('1422','4881','3'),
('1057','3305','1'),
('1057','3342','2'),
('1057','3261','3'),
('1057','3262','4'),
('1057','3316','5'),
('1057','3467','6'),
('1057','4882','7'),
('1057','3265','8'),
('1057','3422','9'),
('1061','3461','1'),
('1061','3100','2'),
('1085','1952','1'),
('1097','3177','1'),
('1097','3246','2'),
('1097','3080','3'),
('1097','3321','4'),
('1097','3370','5'),
('1097','4634','6'),
('1097','3471','7'),
('1097','3079','8'),
('1097','3076','9'),
('1151','3048','1'),
('1151','3232','2'),
('1151','3457','3'),
('1151','3307','4'),
('1153','4939','1'),
('1153','4934','2'),
('1153','4938','3'),
('1153','4543','4'),
('1153','4942','5'),
('1153','4942','5'), ('1153','4460','6'), ('1153','5107','7'), ('1083','3439','1'), ('1093','4927','1'), ('1093','3278','2'), ('1059','3365','1'), ('1059','3371','2'), ('1059','3182','3'), ('1059','3092','4'), ('1059','3089','5'), ('1059','3377','6'), ('1059','3348','7'), ('1059','4323','8'), ('1059','3395','9'), ('1079','4276','1'), ('1067','4261','1'), ('1117','4901','1'), ('1117','5120','2'), ('1117','5109','3'), ('1117','5108','4'), ('1117','4840','5'), ('1117','4923','6'), ('1117','4703','7'), ('1117','4922','8'), ('1117','4920','9'), ('1065','4243','1'), ('1065','4631','2'), ('1155','3052','1'), ('1155','3053','2'), ('1155','341
3','3'), ('1155','3055','4'), ('1155','3056','5'), ('1155','3176','6'), ('1155','3054','7'), ('1155','3406','8'), ('1155','4281','9'), ('1147','3456','1'), ('1149','3299','1'), ('1149','3470','2'), ('1149','3315','3'), ('1149','3292','4'), ('1149','3268','5'), ('1149','3057','6'), ('1149','3060','7'), ('1149','3181','8'), ('1149','3405','9'), ('1145','3303','1'), ('1099','3131','1');
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论