Spark中对Dataframe的union、unionAll和unionByName⽅法说明⽅法说明:
union: 两个df合并,但是不按列名进⾏合并,⽽是位置,列名以前表为准(a.union(b) 列名顺序以a为准)
unionAll:同union⽅法
unionByName:合并时按照列名进⾏合并,⽽不是位置
举例:
把 b表的id_num和CST_NO两列的值更改顺序
var a = Seq(
("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("200", "ming", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")
).toDF("id_num", "CST_NO", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")
a.show()
var b = Seq(
("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("ming", "787878", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")
).toDF("CST_NO", "id_num", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")
b.show()
var r = a.union(b)
r.show()
var p = a.unionAll(b)
p.show()
var t = a.unionByName(b)
t.show()
结果是:
a: org.apache.spark.sql.DataFrame = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution|              dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|    1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|    1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|  200|  ming|        hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|        ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
sql中union多表合并b: org.apache.spark.sql.DataFrame = [CST_NO: string, id_num: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|CST_NO|id_num|distribution|              dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|  ming|787878|        hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|        ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
r: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution|              dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|    1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|    1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|  200|  ming|        hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|        ATM|
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|  ming|787878|        hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|        ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
warning: there was one deprecation warning; re-run with -deprecation for details
p: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
p: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution|              dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|    1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|    1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|  200|  ming|        hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|        ATM|
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|    ke|  9999|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|  ming|787878|        hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|        ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
t: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id_num: string, CST_NO: string ... 7 more fields]
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|id_num|CST_NO|distribution|              dayId|AMOUNT_cnt|CLIENT_IP|CLIENT_MAC|PAYER_CODE_num|CHANNEL_CODE|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
|    1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|    1|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|  200|  ming|        hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|        ATM|
|  9999|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|  9999|    ke|          hb|2019-09-04 21:15:15|      1001|  192.196|      mac|            43|        ATM|
|787878|  ming|        hlj|2019-09-06 17:15:15|      2002|  192.196|      win7|            13|        ATM|
+------+------+------------+-------------------+----------+---------+----------+--------------+------------+
可以看出: r表和p表只关注位置, t表对应列进⾏合并,为正确的
unionByName如果两表列不完全相同,是会报错:
var a = Seq(
("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("1", "ke", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("200", "ming", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")
).toDF("id_num", "CST_NO", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")
var b = Seq(
("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("ke", "9999", "hb","2019-09-04 21:15:15", "1001", "192.196", "mac", "43", "ATM"),
("ming", "787878", "hlj","2019-09-06 17:15:15", "2002", "192.196", "win7", "13", "ATM")
).toDF("vvv", "id_num", "distribution","dayId", "AMOUNT_cnt", "CLIENT_IP", "CLIENT_MAC", "PAYER_CODE_num","CHANNEL_CODE")
var t = a.unionByName(b)
t.show()
报错: org.apache.spark.sql.AnalysisException: Cannot resolve column name "CST_NO" among (vvv, id_num, distribution, dayId, AMOUNT_cnt, CLIENT_ IP, CLIENT_MAC, PAYER_CODE_num, CHANNEL_CODE);
注:三种⽅法的前提是两个df的⾏数⼀样,不⼀样会直接报错。

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。