sparksql语句
(1)in 不⽀持⼦查询 eg. select * from src where key in(select key from test);
⽀持查询个数 eg. select * from src where key in(1,2,3,4,5);
in 40000个耗时25.766秒
in 80000个耗时78.827
(2).union all/union
不⽀持顶层的union all eg. select key from src UNION ALL select key from test;
⽀持select * from (select key from src union all select key from test)aa;
不⽀持 union
⽀持select distinct key from (select key from src union all select key from test)aa;
3.intersect 不⽀持
4.minus 不⽀持
6.inner join/join/left outer join/right outer join/full outer join/left semi join 都⽀持
left outer join/right outer join/full outer join 中间必须有outer
join是最简单的关联操作,两边关联只取交集;
left outer join是以左表驱动,右表不存在的key均赋值为null;
right outer join是以右表驱动,左表不存在的key均赋值为null;
full outer join全表关联,将两表完整的进⾏笛卡尔积操作,左右表均可赋值为null;
left semi join最主要的使⽤场景就是解决exist in;
Hive不⽀持where⼦句中的⼦查询,SQL常⽤的exist in⼦句在Hive中是不⽀持的
不⽀持⼦查询 eg. select * from src aa where aa.key in(select bb.key from test bb);
可⽤以下两种⽅式替换:
select * from src aa left outer join test bb on aa.key=bb.key where bb.key <> null;
select * from src aa left semi join test bb on aa.key=bb.key;
⼤多数情况下 JOIN ON 和 left semi on 是对等的
A,B两表连接,如果B表存在重复数据
当使⽤JOIN ON的时候,A,B表会关联出两条记录,应为ON上的条件符合;
⽽是⽤LEFT SEMI JOIN 当A表中的记录,在B表上产⽣符合条件之后就返回,不会再继续查B表记录了,
所以如果B表有重复,也不会产⽣重复的多条记录。
left outer join ⽀持⼦查询 eg. select aa.* from src aa left outer join (select * from test111)bb on aa.key=bb.a;
7. hive四中数据导⼊⽅式
1)从本地⽂件系统中导⼊数据到Hive表
create table wyp(id int,name string) ROW FORMAT delimited fields terminated by '\t' STORED AS TEXTFILE;
load data local inpath '' into table wyp;
2)从HDFS上导⼊数据到Hive表
[wyp@master /home/q/hadoop-2.2.0]$ bin/hadoop fs -cat /home/
hive> load data inpath '/home/' into table wyp;
3)从别的表中查询出相应的数据并导⼊到Hive表中
hive> create table test(
> id int, name string
> ,tel string)
> partitioned by
> (age int)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE;
注:test表⾥⾯⽤age作为了分区字段,分区:在Hive中,表的每⼀个分区对应表下的相应⽬录,所有分区的数据都是存储在对应的⽬录中。⽐如wyp表有dt和city两个分区,则对应dt=20131218city=BJ对应表的⽬录为/user/hive/warehouse/dt=20131218/city=BJ,
所有属于这个分区的数据都存放在这个⽬录中。
hive> insert into table test
> partition (age='25')
> select id, name, tel
> from wyp;
也可以在select语句⾥⾯通过使⽤分区值来动态指明分区:
hive> de=nonstrict;
hive> insert into table test
> partition (age)
> select id, name,
> tel, age
> from wyp;
Hive也⽀持insert overwrite⽅式来插⼊数据
hive> insert overwrite table test
> PARTITION (age)
> select id, name, tel, age
> from wyp;
Hive还⽀持多表插⼊
hive> from wyp
> insert into table test
> partition(age)
> select id, name, tel, age
> insert into table test3
> select id, name
> where age>25;
4)在创建表的时候通过从别的表中查询出相应的记录并插⼊到所创建的表中
hive> create table test4
> as
> select id, name, tel
> from wyp;
8.查看建表语句
hive> show create table test3;
9.表重命名
hive> ALTER TABLE events RENAME TO 3koobecaf;
10.表增加列
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
11.添加⼀列并增加列字段注释
hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
12.删除表
hive> DROP TABLE pokes;

hive> select * from test order by key limit 10;
14.创建数据库
Create Database baseball;
14.alter table tablename  change oldColumn newColumn column_type 修改列的名称和类型
alter table yangsy CHANGE product_no phone_no string
15.导⼊.sql⽂件中的sql
spark-sql --driver-class-path /home/hadoop/hive/lib/mysql-connector-java-5.1.30-bin.jar -f testsql.sql
insert into table CI_CUSER_20141117154351522 select
mainResult.PRODUCT_NO,dw_coclbl_m02_3848.L1_01_02_01,dw_coclbl_d01_3845.L2_01_01_04 from (select PRODUCT_NO from
CI_CUSER_20141114203632267) mainResult left join DW_COCLBL_M02_201407 dw_coclbl_m02_3848 on mainResult.PRODUCT_NO = dw_coclbl_m02_3848.PRODUCT_NO left join DW_COCLBL_D01_20140515 dw_coclbl_d01_3845 on
dw_coclbl_m02_3848.PRODUCT_NO = dw_coclbl_d01_3845.PRODUCT_NO
insert into CI_CUSER_20141117142123638 ( PRODUCT_NO,ATTR_COL_0000,ATTR_COL_0001) select
mainResult.PRODUCT_NO,dw_coclbl_m02_3848.L1_01_02_01,dw_coclbl_m02_3848.L1_01_03_01 from (select PRODUCT_NO from
CI_CUSER_20141114203632267) mainResult left join DW_COCLBL_M02_201407 dw_coclbl_m02_3848 on mainResult.PRODUCT_NO = dw_coclbl_m02_3848.PRODUCT_NO
CREATE TABLE ci_cuser_yymmddhhmisstttttt_tmp(product_no string) row format serde 'com.bizo.hive.serde.csv.CSVSerde' ;
LOAD DATA LOCAL INPATH '/home/ocdc/coc/yuli/test123.csv' OVERWRITE INTO TABLE test_yuli2;
创建⽀持CSV格式的testfile⽂件
CREATE TABLE test_yuli7 row format serde 'com.bizo.hive.serde.csv.CSVSerde' as select * from CI_CUSER_20150310162729786;
不依赖CSVSerde的jar包创建逗号分隔的表
"create table " +listName+ " ROW FORMAT DELIMITED FIELDS TERMINATED BY ','" +
" as select * from " + listName1;
create table aaaa ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE as select *
from
ThriftServer 开启FAIR模式
SparkSQL Thrift Server 开启FAIR调度⽅式:
1. 修改$SPARK_HOME/f,新增
2. de FAIR
3. spark.scheduler.allocation.file /Users/tianyi/github/community/apache-spark/l
4. 修改$SPARK_HOME/l(或新增该⽂件), 编辑如下格式内容
5. <?xml version="1.0"?>
6. <allocations>
7. <pool name="production">
8. <schedulingMode>FAIR</schedulingMode>
9. <!-- weight表⽰两个队列在minShare相同的情况下,可以使⽤资源的⽐例 -->
10. <weight>1</weight>
11. <!-- minShare表⽰优先保证的资源数 -->
12. <minShare>2</minShare>
13. </pool>
14. <pool name="test">
15. <schedulingMode>FIFO</schedulingMode>
16. <weight>2</weight>
17. <minShare>3</minShare>
18. </pool>
19. </allocations>
20. 重启Thrift Server
21. 执⾏SQL前,执⾏
22. set spark.sql.thriftserver.scheduler.pool=指定的队列名
等操作完了 create table yangsy555 like CI_CUSER_YYMMDDHHMISSTTTTTT 然后insert into yangsy555 select * from yangsy555创建⼀个⾃增序列表,使⽤row_number() over()为表增加序列号以供分页查询
create table yagnsytest2 as SELECT ROW_NUMBER() OVER() as id,* from yangsytest;
⼆. API
Spark SQL的API⽅案:3种
SQL
the DataFrames API
the Datasets API.
但会使⽤同⼀个执⾏引擎
the same execution engine is used
(⼀)数据转为Dataframe
1、(半)格式化数据(HDFS⽂件)
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Parquet files are self-describing so the schema is preserved.⽂件格式⾃带描述性
DataFrame df= ad().parquet("people.parquet");
//ad().json() on either an RDD of String, or a JSON file. not a typical JSON file(见下⾯的⼩实验)
DataFrame df = ad().json("/testDir/people.json");
Load默认是parquet格式,通过format指定格式
DataFrame df = ad().load("examples/src/main/resources/users.parquet");
DataFrame df = ad().format("json").load("main/resources/people.json");
旧API  已经被废弃
DataFrame df2 =sqlContext.jsonFile("/xxx.json");
DataFrame df2 =sqlContext.parquetFile("/xxx.parquet");
2、RDD数据
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc)
a. 通过类利⽤类的反射机制
已有:JavaRDD<Person> people
DataFrame df= ateDataFrame(people, Person.class);
b. 通过schema转换RDD
已有:StructType schema = ateStructType(fields);
和JavaRDD<Row> rowRDD
1. DataFrame df= ateDataFrame(rowRDD, schema);
3、 Hive数据(HDFS⽂件在中的表(schema)对应关系)
HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
DataFrame df = sqlContext.sql("select count(*) from wangke.wangke where ns_date=20161224");
//(if configured,sparkSQL caches metadata)
sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)");
sqlContext.sql("LOAD DATA LOCAL INPATH '' INTO TABLE src");
Row[] results = sqlContext.sql("FROM src SELECT key, value").collect();
4、特殊⽤法
DataFrame df = sqlContext.sql("SELECT * FROM parquet.`main/resources/users.parquet`");
/
/查询临时表people
DataFrame teenagers = sqlContext.sql("SELECT name FROMpeople WHERE age >= 13 AND age <= 19")
(⼆)、Dataframe使⽤
1、展⽰
df.show();
df.printSchema();
2、过滤选择
df.select("name").show();
df.l("name"), df.col("age").plus(1)).show();
df.l("age").gt(21)).show();
3、写⽂件
df.select("name", "favorite_color").write().save("namesAndFavColors.parquet");
df.select("name", "age").write().format("parquet").save("namesAndAges.parquet");
df.write().parquet("people.parquet");
多表left join4、注册临时表
之后就可以⽤SQL在上⾯去查了
DataFrame teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19");
5、保存Hive表
When working with a HiveContext, DataFrames can also be saved as persistent tables using the saveAsTable command 只有HiveContext⽣成的Dataframe才能调⽤saveAsTable去持久化hive表
(三)、直接SQL操作
sqlContext.sql("create p ");
sqlContext.sql("insert into p partition(day=20160816) select * where day=20160816");
sqlContext.sql("insert overwrite partition(day=20160816) select * p where day=20160816");

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。