Python-Groupby函数应⽤
Pandas分组和聚合运算–Groupby函数应⽤
⼀、groupby函数功能
根据⼀个或多个键拆分pandas对象,计算分组摘要统计,如计数、平均值、标准差或⽤户⾃定义函数等。
⼆、groupby函数原理
可将groupby函数分组聚合的过程分为两步:
1、分组split:按照指定键值或分组变量对数据分组
2、聚合combine:应⽤python⾃带函数或⾃定义函数进⾏聚合计算
eg.
1.分组
#创建数据
import pandas as pd
import numpy as np
df = pd.DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})
print(df)
>>>
key1 key2 data1 data2
0 a one 0.342001-0.026749
1 a two 0.645837-0.460551
2 b one -0.0608590.199347
3 b two -1.043132-0.551104
4 a one 0.312109-1.595615
#进⾏拆分
grouped=df['data1'].groupby(df['key1'])
print(grouped)
>>>
&upby.SeriesGroupBy object at 0x114033278>
2.聚合
这⾥的grouped是groupby的⼀个对象,实际上并没有经过任何计算,只是包含⼀些有关df[‘key1’]的中间数据,接下来调⽤聚合函数对其进⾏进⼀步计算。
#进⾏聚合计算
an())
>>>
key1
a -0.290544
b 0.211538
Name: data1, dtype: float64
三、groupby函数应⽤
1、求和、标准差以及最⼤/⼩值等
groupby是什么函数print(grouped.sum())
>>>
key1
a -0.871633
b 0.423076
Name: data1, dtype: float64
print(grouped.std())
>>>
key1
a 0.274760
b 0.918468
Name: data1, dtype: float64
print(grouped.max())
>>>
key1
a -0.095071
b 0.860993
Name: data1, dtype: float64
print(grouped.min())
>>>
key1
a -0.604697
b -0.437917
Name: data1, dtype: float64
如果结合在⼀起写的话,应为
Dataframe[‘计算列名’].groupby(Dataframe[‘分组列名’]).函数名()
#按照key1及key2分组后,计算data1的均值
df['data1'].groupby([df['key1'],df['key2']]).mean()
>>>
key1 key2
a one 0.327055
two 0.645837
b one -0.060859
two -1.043132
Name: data1, dtype: float64
2、⼀次传⼊多个数组(类似数据透视表)
注意传⼊多个数组时的写法
#按照key1及key2进⾏分组,求均值
df['data1'].groupby([df['key1'],df['key2']]).mean()
>>>
key1 key2
a one 0.327055
two 0.645837
b one -0.060859
two -1.043132
Name: data1, dtype: float64
3、可以将列名⽤作分组
注意与传⼊数组时语法上的区别,此时未限定对某列值的数据进⾏聚合,只限定了分组的值
#根据key1中属性值进⾏分组计算
>>>
data1 data2
key1
a 0.433316-0.694305
b -0.551996-0.175878
#根据key1及key2中属性值进⾏分组计算
upby(['key1','key2']).mean())
>>>
data1 data2
key1 key2
a one 0.327055-0.811182
two 0.645837-0.460551
b one -0.0608590.199347
two -1.043132-0.551104
4、分组迭代
groupby对象⽀持迭代,可以产⽣⼀组⼆元元组(由分组名和数据块组成)#对groupby对象中数据进⾏迭代输出
for name,group upby('key1'):
print(name)
print(group)
>>>
a
key1 key2 data1 data2
0 a one 0.342001-0.026749
1 a two 0.645837-0.460551
4 a one 0.312109-1.595615
b
key1 key2 data1 data2
2 b one -0.0608590.199347
3 b two -1.043132-0.551104
#多重键的输出
for(k1,k2),group upby(['key1','key2']):
print(k1,k2)
print(group)
>>>
a one
key1 key2 data1 data2
0 a one 0.342001-0.026749
4 a one 0.312109-1.595615
a two
key1 key2 data1 data2
1 a two 0.645837-0.460551
b one
key1 key2 data1 data2
2 b one -0.0608590.199347
b two
key1 key2 data1 data2
3 b two -1.043132-0.551104
5、结合agg()函数对多个值进⾏计算
#对不同字段进⾏计算
.agg({'data1':'mean','data2':'sum','data3':'std'})
举例
#结合agg()函数⼀起使⽤
>>>
data1 data2
key1 key2
a one -0.5178580.20524
two 0.118211 NaN
b one 0.512116 NaN
two 1.218486 NaN
#对同⼀个字段进⾏多次聚合计算
#⽅法⼀:将这些函数作为列表传递
>>>
returns
sum mean
dummy
10.2850.0285
#⽅法⼆:将这些函数作为字典传递
returns
Sum Mean
dummy
10.2850.0285
6、与apply结合使⽤
import pandas as pd
import numpy as np
import seaborn as sns
#以⼩费数据集为例进⾏分析
tips=sns.load_dataset('tips')
#展⽰⼩费前五的数据
def top(x,n=5):
return x.sort_values(by='tip',ascending=False)[-n:]
>>>
total_bill tip sex smoker day time size sextipMean
sex
Male
439.681.32 Male No Sun Dinner 23.089618
23510.071.25 Male No Sat Dinner 23.089618
7510.511.25 Male No Sat Dinner 23.089618
23732.831.17 Male Yes Sat Dinner 23.089618
23612.601.00 Male Yes Sat Dinner 23.089618
Female
21512.901.10 Female Yes Sat Dinner 22.833448
016.991.01 Female No Sun Dinner 22.833448
1117.251.00 Female No Sat Dinner 12.833448
673.071.00 Female Yes Sat Dinner 12.833448
925.751.00 Female Yes Fri Dinner 22.833448
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论