19,sql优化:开窗函数,row_number()over,sum()over(),窗⼝函数
⼤全
⼀,常⽤:排序,累加sql优化的几种方式
1 ,分组取前三:
1. 精华代码 :
select*from
(select id,name,clazz,score,
row_number()over(partition by clazz order by score desc) rank
from person) res
where res.rank <=3
2. 结果 :
+---+----+------+-----+----+
| id|name| clazz|score|rank|
+---+----+------+-----+----+
|5| e| spark|60|1|
|9| i| spark|50|2|
|2| b| spark|30|3|
|11| k|hadoop|70|1|
|10| j|hadoop|60|2|
|4| d|hadoop|50|3|
|12| l| hive|90|1|
|13| m| hive|80|2|
|6| f| hive|70|3|
+---+----+------+-----+----+
3. 全部代码 :
package com.lifecycle.demo01
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SparkSession}
object Demo03 {
def main(args: Array[String]): Unit ={
// 1 ,spark sql 上下⽂
val spark = SparkSession.builder()
.master("local[2]")
.
config("abled","false")
.config("","2g")
.config("","2g")
.appName("SparkDemoFromS3")
.getOrCreate()
// 2 ,设置⽇志级别:
spark.sparkContext.setLogLevel("ERROR")
// 3 ,导⼊隐式转换
import spark.implicits._
val df01: DataFrame = ad.option("delimiter",",").csv("person.csv")
val df02: DataFrame = DF("id","name","clazz","score")
/
/ 建表
spark.sql("select * from "+
"(select id,name,clazz,score,row_number() over(partition by clazz order by score desc) rank from person) res "+
"where res.rank <=3").show(100)
spark.close()
}
}
2 ,分组聚合:
1. sql
select clazz,sum(score) sumscore from person group by(clazz)
2. 结果 :
+------+--------+
| clazz|sumscore|
+------+--------+
| spark|170.0|
|hadoop|220.0|
| hive|280.0|
+------+--------+
3 ,分组累加 ( sum over )
1. 元数据 :
1438,2016-05-13,165
1438,2016-05-14,595
1438,2016-05-15,105
1629,2016-05-13,12340
1629,2016-05-14,13850
1629,2016-05-15,227
2. sql :
select pcode,event_date,duration,
sum(duration)over(partition by pcode order by event_date asc)as sum_duration
from userlogs_date
3. 结果 :
+-----+----------+--------+------------+
|pcode|event_date|duration|sum_duration|
+-----+----------+--------+------------+
|1438|2016-05-13|165|165.0|
|1438|2016-05-14|595|760.0|
|1438|2016-05-15|105|865.0|
|1629|2016-05-13|12340|12340.0|
|1629|2016-05-14|13850|26190.0|
|1629|2016-05-15|227|26417.0|
+-----+----------+--------+------------+
4 ,不分组累加:必须排序
1. sql :
select pcode,event_date,duration,
sum(duration)over(order by pcode,event_date asc)as sum_duration
from userlogs_date
2. 结果 :
+-----+----------+--------+------------+
|pcode|event_date|duration|sum_duration|
+-----+----------+--------+------------+
|1438|2016-05-13|165|165.0|
|1438|2016-05-14|595|760.0|
|1438|2016-05-15|105|865.0|
|1629|2016-05-13|12340|13205.0|
|1629|2016-05-14|13850|27055.0|
|1629|2016-05-15|227|27282.0|
+-----+----------+--------+------------+
3. 全部代码 :
package com.lifecycle.demo01
import org.apache.spark.sql.{DataFrame, SparkSession}
object Demo05 {
def main(args: Array[String]): Unit ={
// 1 ,spark sql 上下⽂
val spark = SparkSession.builder()
.master("local[2]")
.
config("abled","false")
.config("","2g")
.config("","2g")
.appName("SparkDemoFromS3")
.getOrCreate()
// 2 ,设置⽇志级别:
spark.sparkContext.setLogLevel("ERROR")
// 3 ,导⼊隐式转换
val df01: DataFrame = ad.option("delimiter",",").csv("ppt")
val df02: DataFrame = DF("pcode","event_date","duration")
// 建表
spark.sql("select pcode,event_date,duration,sum(duration) over (order by pcode,event_date asc) as sum_duration from userlogs_date").show(100) spark.close()
}
}
⼆,窗⼝函数⼤全:1 ,种类:
1. ranking 排名类
2. analytic 分析类
3. aggregate 聚合类
2 ,rank :不分组排序1. 元数据 :
1438,2016-05-13,165
1438,2016-05-14,595
1438,2016-05-15,105
1629,2016-05-13,12340
1629,2016-05-14,13850
1629,2016-05-15,227
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论