首页教程专区正文内容

Hive查询某一重复字段记录第一条数据

教程专区

2025-03-12 22:52:42

结果数据字段

Hive查询某⼀重复字段记录第⼀条数据

场景：数据库中id、toapp、topin、toclienttype⼏个字段都相同，receivetime字段不⼀样，现需要将receive最⼩的⼀⾏查出，其他⾏舍去。

select

*

from

(

select

*,

row_number() over(partition by id order by receivetime asc) num

from

xxxxxxxxxxxxxxxxxxxx

where

dt = '2019-01-14'

and toapp = 'xxxxxx'

and toclienttype = 'xxxxxx'

and msgrectype = '1'

and systemtime is not null

) as t

where

t.num = 1

这⾥主要的代码就是row_number() OVER (PARTITION BY COL1 ORDER BY COL2)

这⾏代码的意思是先对COL1列进⾏分组，然后按照COL2进⾏排序，row_number()函数是对分组后的每个组内记录按照COL2排序标号，我们最后取的时候就拿标号为1的⼀条记录，即达到我的需求。

例⼦：

empid deptid salary

----------- ----------- ---------------------------------------

1 10 5500.00

2 10 4500.00

3 20 1900.00

4 20 4800.00

5 40 6500.00

6 40 14500.00

7 40 44500.00

8 50 6500.00

9 50 7500.00

row_number() OVER (PARTITION BY deptid ORDER BY salary)

SELECT *, Row_Number() OVER (partition by deptid ORDER BY salary desc) rank FROM employee

结果：

问题：发现存在⼀张表中存在相同的两⾏数据

得到：仅仅保留⼀⾏数据

⽅法：

原理-我们通过

1 select count (字段1，字段2) from 表1；

2

3 结果 200条数据

4

5 select count (distinct 字段1，字段2) from 表1；

6

7 结果 100条数据

8

9 相当于后者可以实现查出来去重后的数据

10

11 create table 表1_bak as select distinct 字段1，字段2 from 表1; --备份表数据

12

13 delete from 表1;

14

15 insert into 表1 select * from 表1_bak；

Hive中使⽤Distinct踩到的坑

问题描述：

在使⽤Hive的过程中，⽤Distinct对重复数据进⾏过滤，得出了⼀个违背认知的结果，百思不得其解。

假设：test表中有100W数据，对test表按照a, b, c, d, e去重。

⼀、使⽤Distinct的SQL如下：

SQL1 ：select count(distinct a, b, c, d, e) from test;

得出结果： 2W+。

根据数据特点第⼀感觉，并不会有那么多重复数据，对⾃⼰的distinct使⽤产⽣了怀疑，因此⽤group by校验结果。

⼆、使⽤Group by的SQL如下：

SQL2 ：select sum (gcount) from (select count(*) gcount from test group by a, b, c, d, e) t

得出结果： 80W+。

这个结果是符合数据特点的；

三、修改SQL1，去掉⼀个字段；

SQL3：select count(distinct b, c, d, e) from test;

得出结果：90W+。

四、对⽐SQL1和 SQL3

按照4个字段distinct 理论上⼀定⽐ 5个字段distinct 结果少，测试结果缺恰恰相反；

原因就是因为a列中包含null，按我的认知以为所有的null值会被归结为同⼀个，可实际上hive并不会；

所以distinct的列中如果含有null值，会导致结果不准，需要将null值替换为⼀个统⼀的值。

修改如下：

hive中的distinct⽤法：

HIVESQL中ROW_NUMBER() OVER语法以及⽰例

ROW_NUMBER() OVER函数的基本⽤法

语法：ROW_NUMBER() OVER(PARTITION BY COLUMNORDER BY COLUMN)

详解：

row_number() OVER (PARTITION BY COL1 ORDERBY COL2)表⽰根据COL1分组，在分组内部根据COL2排序，⽽此函数计算的值就表⽰每组内部排序后的顺序编号（该编号在组内是连续并且唯⼀的)。

场景描述：

在Hive中employee表包括empid、depid、salary三个字段，根据部门分组，显⽰每个部门的⼯资等级。

1、原表查看：在Hive中employee表及其内容如下所⽰：

2、执⾏SQL。

SELECT *, Row_Number() OVER (partition by deptid ORDER BY salary desc) rank FROM employee

distinct查询3、查看结果。

我的使⽤：

(SELECT * FROM tm_data_room_${this_ds}_${this_ts}_per_hour_tmp1) as a

left join

(

SELECT DISTINCT(k.setid),k.ds,k.ts,k.game_ate_clubid as

clubid,k.leagueid,k.is_satellite,k.is_sure_chips,k.gameset_end_time from

(

select * from

(select *,Row_Number() OVER (partition by setid ORDER BY gameset_end_time desc) rank from gameset_info_log_flow where ds = ${this_ds} and ts = ${this_ts} and room_mode = 3 and gameset_status=100 and gameset_start_time != 0 and gameset_end_time !=0

) as t

where t.rank = 1

) as k

--gameset_info_log_flow

--where ds = ${this_ds} and ts = ${this_ts} and room_mode = 3 and gameset_status=100 and gameset_start_time != 0 and gameset_end_time !=0

) as b

on a.setid = b.setid

版权声明：本站内容均来自互联网，仅供演示用，请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198，我们将在24小时内删除。

sql分组查询语句

« 上一篇

(完整版)sql练习题+答案

下一篇 »

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]
2025-02-08
能被5整除的十进制整数的正规表达式
2025-02-08
大于0小于等于1的正则表达式
2025-02-08
linux grep 26个字母
2025-02-08
java pattern 正则表达式
2025-02-08
掌握文本编辑器中的搜索和替换技巧
2025-02-08

标签列表