python等频分箱_数据分箱:等频分箱,等距分箱,卡方分箱,计算WOE、IV...--688IT编程网

python等频分箱_数据分箱：等频分箱，等距分箱，卡⽅分

箱，计算WOE、IV

1.离散的优势：

(1)离散化后的特征对异常数据有很强的鲁棒性：⽐如⼀个特征是年龄>30是1，否则0。如果特征没有离散化，⼀个异常数据“年龄300岁”会给模型造成很⼤的⼲扰；

(2)逻辑回归属于⼴义线性模型，表达能⼒受限，单变量离散化为N个后，每个变量有单独的权重，相当于为模型引⼊了⾮线性，能够提升模型表达能⼒，加⼤拟合；

(3)离散化后可以进⾏特征交叉，由M+N个变量变为M*N个变量，进⼀步引⼊⾮线性，提升表达能⼒；

(4)可以将缺失作为独⽴的⼀类带⼊模型；

(5)将所有变量变换到相似的尺度上。

WOE：

WOE的全称是“Weight of Evidence”，即证据权重，WOE是对原始⾃变量的⼀种编码形式。要对⼀个变

量进⾏WOE编码，需要⾸先把这个变量进⾏分箱。分箱后，对于第i组，WOE的计算公式如下：

yi是这个分组中响应客户(即取值为1)的数量，yT是全部样本中所有响应客户(即取值为1)的数量

ni是这个分组中未响应客户(即取值为0)的数量，nT是全部样本中所有未响应客户(即取值为0)的数量

IV值：

IV的全称是Information Value，⽤来衡量⾃变量的预测能⼒

对于分组i的IV值:

计算整个变量的IV值，n为变量分组个数:

过⾼的IV，可能有潜在的风险

特征分箱越细,IV越⾼

defcompute_WOE_IV(df,col,target):"""param df:DataFrame|包含feature和label

param col:str|feature名称，col这列已经经过分箱

param taget:str|label名称,0,1

return 每箱的WOE(字典类型)和总的IV之和，注意考虑计算时候分⼦分母为零的溢出情况"""

importnumpy as np

total= df.groupby([col])[target].count() #计算col每个分组中的样本总数

total = pd.DataFrame({'total': total})

bad= df.groupby([col])[target].sum() #计算col每个分组中的⽬标取值为1的总数，关注的正样本

bad = pd.DataFrame({'bad': bad})

regroup= (bad,left_index=True,right_index=True,how='left')

N= sum(regroup['total']) #样本总数

B = sum(regroup['bad']) #正样本总数

regroup['good'] = regroup['total'] - regroup['bad'] #计算col每个分组中的⽬标取值为0的总数，关注的负样本

G = N - B #负样本总数

regroup['bad_pcnt'] = regroup['bad'].map(lambda x: x*1.0/B)

regroup['good_pcnt'] = regroup['good'].map(lambda x: x * 1.0 /G)

regroup["WOE"] = regroup.apply(lambda x:np.d_pcnt*1.0/x.bad_pcnt),axis=1)

WOE_dict= regroup[[col,"WOE"]].set_index(col).to_dict(orient="index")

IV= regroup.apply(lambda x:(x.good_pcnt-x.bad_pcnt)*np.d_pcnt*1.0/x.bad_pcnt),axis = 1)

IV=sum(IV)return {"WOE":WOE_dict,"IV":IV}

等频分箱

区间的边界值要经过选择，使得每个区间包含⼤致相等的实例数量。⽐如说 N=10 ，每个区间应该包含⼤约10%的实例。

等距分箱

从最⼩值到最⼤值之间，均分为 N 等份。如果 A,B 为最⼩最⼤值, 则每个区间的长度为 W=(B−A)/N , 则区间边界值为A+W,A+2W,….A+ (N−1)W 。这⾥只考虑边界，每个等份的实例数量可能不等。

importpandas as pdimportseaborn as del_selection importtrain_test_split

df= sn.load_dataset(name="titanic")

train,test= train_test_split(df,test_size=0.2)>>>>#等频分箱

>>>>>>>>>####

train["age_bin"] = pd.qcut(train["age"],10)

group_by_age_bin= upby(["age_bin"],as_index=True)

df_min_max_bin= pd.DataFrame()#⽤来记录每个箱体的最⼤最⼩值

df_min_max_bin["min_bin"] =group_by_age_bin.age.min()

df_min_max_bin["max_bin"] =group_by_age_bin.age.max()

df_min_set_index(inplace=True)>>>>#等宽分箱

>>>>>>>>>>#

train["age_bin"] = pd.cut(train["age"],10)

group_by_age_bin= upby(["age_bin"],as_index=True)

df_min_max_bin= pd.DataFrame()#⽤来记录每个箱体的最⼤最⼩值

df_min_max_bin["min_bin"] =group_by_age_bin.age.min()

df_min_max_bin["max_bin"] =group_by_age_bin.age.max()

df_min_set_index(inplace=True)

卡⽅分箱

#-*- coding: utf-8 -*-

"""Created on Sun Oct 28 21:39:24 2018

@author: WZD"""

def ChiMerge(df,variable,flag,confidenceVal=3.841,bin=10,sample=None):'''param df:DataFrame| 必须包含标签列

param variable:str| 需要卡⽅分箱的变量名称(字符串)

param flag:str | 正负样本标识的名称(字符串)

param confidenceVal:float| 置信度⽔平(默认是不进⾏抽样95%)

param bin：int | 最多箱的数⽬

param sample: int | 为抽样的数⽬(默认是不进⾏抽样)，因为如果观测值过多运⾏会较慢

note: 停⽌条件为⼤于置信⽔平且⼩于bin的数⽬

return :DataFrame|采样结果'''

importpandas as pdimportnumpy as np#进⾏是否抽样操作

if sample !=None:

df= df.sample(n=sample)else:

df#进⾏数据格式化录⼊

total_num = df.groupby([variable])[flag].count() #统计需分箱变量每个值数⽬

total_num = pd.DataFrame({'total_num': total_num}) #创建⼀个数据框保存之前的结果

positive_class = df.groupby([variable])[flag].sum() #统计需分箱变量每个值正样本数

positive_class = pd.DataFrame({'positive_class': positive_class}) #创建⼀个数据框保存之前的结果

regroup = pd.merge(total_num, positive_class, left_index=True, right_index=True,

how='inner') #组合total_num与positive_class

regroup['negative_class'] = regroup['total_num'] - regroup['positive_class'] #统计需分箱变量每个值负样本数

regroup = regroup.drop('total_num', axis=1)

np_regroup= np.array(regroup) #把数据框转化为numpy(提⾼运⾏效率)

#print('已完成数据读⼊，正在计算数据初处理')

#处理连续没有正样本或负样本的区间，并进⾏区间的合并(以免卡⽅值计算报错)

i =0while (i <= np_regroup.shape[0] - 2):if ((np_regroup[i, 1] == 0 and np_regroup[i + 1, 1] == 0) or ( np_regroup[i, 2] == 0 and np_regroup[i + 1, 2] ==0)):

np_regroup[i,1] = np_regroup[i, 1] + np_regroup[i + 1, 1] #正样本

np_regroup[i, 2] = np_regroup[i, 2] + np_regroup[i + 1, 2] #负样本

np_regroup[i, 0] = np_regroup[i + 1, 0]

np_regroup= np.delete(np_regroup, i + 1, 0)

i= i - 1i= i + 1

#对相邻两个区间进⾏卡⽅值计算

chi_table = np.array([]) #创建⼀个数组保存相邻两个区间的卡⽅值

variable used in lambda

for i in np.arange(np_regroup.shape[0] - 1):

chi= (np_regroup[i, 1] * np_regroup[i + 1, 2] - np_regroup[i, 2] * np_regroup[i + 1, 1]) ** 2\* (np_regroup[i, 1] +

np_regroup[i, 2] + np_regroup[i + 1, 1] + np_regroup[i + 1, 2]) /\

((np_regroup[i,1] + np_regroup[i, 2]) * (np_regroup[i + 1, 1] + np_regroup[i + 1, 2]) *(

np_regroup[i,1] + np_regroup[i + 1, 1]) * (np_regroup[i, 2] + np_regroup[i + 1, 2]))

chi_table=np.append(chi_table, chi)#print('已完成数据初处理，正在进⾏卡⽅分箱核⼼操作')

#把卡⽅值最⼩的两个区间进⾏合并(卡⽅分箱核⼼)

while (1):if (len(chi_table) <= (bin - 1) and min(chi_table) >=confidenceVal):breakchi_min_index= np.argwhere(chi_table == min(chi_table))[0] #出卡⽅值最⼩的位置索引

np_regroup[chi_min_index, 1] = np_regroup[chi_min_index, 1] + np_regroup[chi_min_index + 1, 1]

np_regroup[chi_min_index,2] = np_regroup[chi_min_index, 2] + np_regroup[chi_min_index + 1, 2]

np_regroup[chi_min_index, 0]= np_regroup[chi_min_index + 1, 0]

np_regroup= np.delete(np_regroup, chi_min_index + 1, 0)if (chi_min_index == np_regroup.shape[0] - 1): #最⼩值试最后两个区间的时候

#计算合并后当前区间与前⼀个区间的卡⽅值并替换

chi_table[chi_min_index - 1] = (np_regroup[chi_min_index - 1, 1] * np_regroup[chi_min_index, 2] - np_regroup[chi_min_index - 1, 2] * np_regroup[chi_min_index, 1]) ** 2\* (np_regroup[chi_min_index - 1, 1] + np_regroup[chi_min_index - 1, 2] +

np_regroup[chi_min_index, 1] + np_regroup[chi_min_index, 2]) /\

((np_regroup[chi_min_index- 1, 1] + np_regroup[chi_min_index - 1, 2]) * (np_regroup[chi_min_index, 1] +

np_regroup[chi_min_index, 2]) * (np_regroup[chi_min_index - 1, 1] + np_regroup[chi_min_index, 1]) *

(np_regroup[chi_min_index - 1, 2] + np_regroup[chi_min_index, 2]))#删除替换前的卡⽅值

chi_table = np.delete(chi_table, chi_min_index, axis=0)else:#计算合并后当前区间与前⼀个区间的卡⽅值并替换

np_regroup[chi_min_index, 1] + np_regroup[chi_min_index, 2]) /\

((np_regroup[chi_min_index- 1, 1] + np_regroup[chi_min_index - 1, 2]) * (np_regroup[chi_min_index, 1] +

np_regroup[chi_min_index, 2]) * (np_regroup[chi_min_index - 1, 1] + np_regroup[chi_min_index, 1]) *

(np_regroup[chi_min_index - 1, 2] + np_regroup[chi_min_index, 2]))#计算合并后当前区间与后⼀个区间的卡⽅值并替换

chi_table[chi_min_index] = (np_regroup[chi_min_index, 1] * np_regroup[chi_min_index + 1, 2] - np_regroup[chi_min_index, 2] * np_regroup[chi_min_index + 1, 1]) ** 2\* (np_regroup[chi_min_index,

1] + np_regroup[chi_min_index, 2] +

np_regroup[chi_min_index + 1, 1] + np_regroup[chi_min_index + 1, 2]) /\

((np_regroup[chi_min_index,1] + np_regroup[chi_min_index, 2]) * (np_regroup[chi_min_index + 1, 1] +

np_regroup[chi_min_index + 1, 2]) * (np_regroup[chi_min_index, 1] + np_regroup[chi_min_index + 1, 1]) *

(np_regroup[chi_min_index, 2] + np_regroup[chi_min_index + 1, 2]))#删除替换前的卡⽅值

chi_table = np.delete(chi_table, chi_min_index + 1, axis=0)#print('已完成卡⽅分箱核⼼操作，正在保存结果')

#把结果保存成⼀个数据框

result_data = pd.DataFrame() #创建⼀个保存结果的数据框

result_data['variable'] = [variable] * np_regroup.shape[0] #结果表第⼀列：变量名

list_temp =[]for i innp.arange(np_regroup.shape[0]):if i ==0:

x= '0' + ',' +str(np_regroup[i, 0])elif i == np_regroup.shape[0] - 1:

x= str(np_regroup[i - 1, 0]) + '+'

else:

x= str(np_regroup[i - 1, 0]) + ',' +str(np_regroup[i, 0])

list_temp.append(x)

result_data['interval'] = list_temp #结果表第⼆列：区间

result_data['flag_0'] = np_regroup[:, 2] #结果表第三列：负样本数⽬

result_data['flag_1'] = np_regroup[:, 1] #结果表第四列：正样本数⽬

returnresult_data>>>>>>测试

>>>>>>>>>

del_selection importtrain_test_splitimportseaborn as snimportpandas as pd

df= sn.load_dataset(name="titanic")

train,test= train_test_split(df,test_size=0.2)

result_data= ChiMerge(df=df,variable="age",flag="survived",confidenceVal=3.841,bin=10,sample=None) bins= [] #卡⽅的区间值

bins.append(-float('inf'))for i in range(result_data["interval"].shape[0]-1):

St= result_data["interval"][i].split(",")

bins.append(float(St[1]))

bins.append(float('inf'))

train["age"] = pd.cut(x=train["age"],bins=bins,labels=[1,3,5,7,9,11,13,15,17])

test["age"] = pd.cut(x=test["age"],bins=bins,labels=[1,3,5,7,9,11,13,15,17])

VARIABLE-CAMBER AIRFOIL

« 上一篇

Variable-viscosity flows in channels with high hea

688IT编程网

python等频分箱_数据分箱:等频分箱,等距分箱,卡方分箱,计算WOE、IV...

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符回溯引用和前后查匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式选择题

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

688IT编程网

python等频分箱_数据分箱:等频分箱,等距分箱,卡方分箱,计算WOE、IV...

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符 回溯引用和前后查 匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式 选择题

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

java正则表达式选择题

非零金额正则表达式

基本的元字符回溯引用和前后查匹配模式

java正则表达式选择题

非零金额正则表达式